Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
In April 2020, the APOC standard library added procedures that wrap the NLP APIs of each of the big cloud providers - AWS, GCP, and Azure. These procedures extract text from a node property and then send that text to APIs that extract entities, key phrases, categories, or sentiment.
We’re going to use the GCP Entity Extraction procedures on our articles. The GCP NLP API returns Wikipedia pages for entities where those pages exist.
Before we do that we’ll need to create an API key that has access to the Natural Language API. Assuming that we’re already created a GCP account, we can generate a key by following the instructions at console.cloud.google.com/apis/credentials. Once we’ve created a key, we’ll create a parameter that contains it:
:params key => ("<insert-key-here>")
We’re going to use the apoc.nlp.gcp.entities.stream
procedure, which will return a stream of entities found for the text content contained in a node property.
Before running this procedure against all of the articles, let’s run it against one of them to see what data is returned:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
RETURN entity
LIMIT 5;
Each row contains a name
property that describes the entity.
salience
is an indicator of the importance or centrality of that entity to the entire document text.
Some entities also contain a Wikipedia URL, which is found via the metadata.wikipedia_url
key.
The first entity, RethinkDB, is the only entity in this list that has such a URL.
We’re going to filter the rows returned to only include ones that have a Wikipedia URL and we’ll then connect the Article
nodes to the WikipediaPage
nodes that have that URL.
Let’s have a look at how we’re going to do this for one article:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
(1)
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
(2)
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
(3)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
(4)
MERGE (node)-[:HAS_ENTITY]->(page)
We can see how running this query connects the article and taxonomy sub graphs by looking at the following Neo4j Browser visualization:
Now we can run the entity extraction technique over the rest of the articles with help from the apoc.periodic.iterate
procedure again:
CALL apoc.periodic.iterate(
"MATCH (a:Article)
WHERE not(exists(a.processed))
RETURN a",
"CALL apoc.nlp.gcp.entities.stream([item in $_batch | item.a], {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
MERGE (node)-[:HAS_ENTITY]->(page)",
{batchMode: "BATCH_SINGLE", batchSize: 10, params: {key: $key}})
YIELD batches, total, timeTaken, committedOperations
RETURN batches, total, timeTaken, committedOperations;