In April 2020, the APOC standard library added procedures that wrap the NLP APIs of each of the big cloud providers - AWS, GCP, and Azure.
These procedures extract text from a node property and then send that text to APIs that extract entities, key phrases, categories, or sentiment.
We’re going to use the GCP Entity Extraction procedures on our articles.
The GCP NLP API returns Wikipedia pages for entities where those pages exist.
Before we do that we’ll need to create an API key that has access to the Natural Language API.
Assuming that we’re already created a GCP account, we can generate a key by following the instructions at console.cloud.google.com/apis/credentials.
Once we’ve created a key, we’ll create a parameter that contains it:
:params key => ("<insert-key-here>")
We’re going to use the apoc.nlp.gcp.entities.stream procedure, which will return a stream of entities found for the text content contained in a node property.
Before running this procedure against all of the articles, let’s run it against one of them to see what data is returned:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
RETURN entity
LIMIT 5;
Table 8. Results
entity
{name: "RethinkDB", salience: 0.47283632, metadata: {mid: "/m/0134hdhv", wikipedia_url: "https://en.wikipedia.org/wiki/RethinkDB"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "RethinkDB", beginOffset: -1}}, {type: "PROPER", text: {content: "pemThe RethinkDB", beginOffset: -1}}]}
{name: "connection", salience: 0.04166339, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "connection", beginOffset: -1}}, {type: "COMMON", text: {content: "connection", beginOffset: -1}}]}
{name: "work", salience: 0.028608896, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "work", beginOffset: -1}}]}
{name: "projects", salience: 0.028608896, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "projects", beginOffset: -1}}]}
{name: "database", salience: 0.01957906, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "database", beginOffset: -1}}]}
Each row contains a name property that describes the entity.
salience is an indicator of the importance or centrality of that entity to the entire document text.
Some entities also contain a Wikipedia URL, which is found via the metadata.wikipedia_url key.
The first entity, RethinkDB, is the only entity in this list that has such a URL.
We’re going to filter the rows returned to only include ones that have a Wikipedia URL and we’ll then connect the Article nodes to the WikipediaPage nodes that have that URL.
Let’s have a look at how we’re going to do this for one article:
MATCH (a:Article {uri: "https://dev.to/lirantal/securing-a-nodejs--rethinkdb--tls-setup-on-docker-containers"})
CALL apoc.nlp.gcp.entities.stream(a, {
nodeProperty: 'body',
key: $key
})
(1)
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
(2)
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
(3)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
(4)
MERGE (node)-[:HAS_ENTITY]->(page)
1
node is the article and value contains the extracted entities
2
Only include entities that have a Wikipedia URL
3
Find a node that matches the Wikipedia URL. Create one if it doesn’t already exist.
4
Create a HAS_ENTITY relationship between the Article node and WikipediaPage
We can see how running this query connects the article and taxonomy sub graphs by looking at the following Neo4j Browser visualization:
Now we can run the entity extraction technique over the rest of the articles with help from the apoc.periodic.iterate procedure again:
CALL apoc.periodic.iterate(
"MATCH (a:Article)
WHERE not(exists(a.processed))
RETURN a",
"CALL apoc.nlp.gcp.entities.stream([item in $_batch | item.a], {
nodeProperty: 'body',
key: $key
})
YIELD node, value
SET node.processed = true
WITH node, value
UNWIND value.entities AS entity
WITH entity, node
WHERE not(entity.metadata.wikipedia_url is null)
MERGE (page:Resource {uri: entity.metadata.wikipedia_url})
SET page:WikipediaPage
MERGE (node)-[:HAS_ENTITY]->(page)",
{batchMode: "BATCH_SINGLE", batchSize: 10, params: {key: $key}})
YIELD batches, total, timeTaken, committedOperations
RETURN batches, total, timeTaken, committedOperations;
Table 9. Results
batches
total
timeTaken
committedOperations
4
31
29
31
This is a companion discussion topic for the original entry at https://neo4j.com/developer/graph-data-science/build-knowledge-graph-nlp-ontologies/
... View more