Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-21-2020 05:59 AM
We are currently enriching our knowledge graph with NLP derived metadata about the content on GOV.UK. See this blog post for context.
Our desired outcome is each piece of content (a page on GOV.UK; labelled a Cid; circa. ~ 500k pages) will have a relationship with the various entities it contains or mentions in its text. This can be Organisations, People, Dates and Services etc..
We use a modified version of BERT (trained on our data; we call it GovNER) to identify entities within each page. The output is a csv that looks like this (I've removed some entities and included only one row for ease of reading):
base_path,entities,updated_at,govner_version
/government/news/new-digital-resource-for-charity-trustees-launched,"[{""end"": 144, ""entity"": ""charity"", ""entity_type"": ""ORGANIZATION"", ""start"": 137}, {""end"": 2788, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 2778}, {""end"": 4155, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 4145}, {""end"": 4822, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 4812}, {""end"": 1557, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 1548}, {""end"": 2175, ""entity"": ""email"", ""entity_type"": ""CONTACT"", ""start"": 2166}, {""end"": 2847, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 2838}, {""end"": 2979, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 2970}]",2020-10-15 17:56:30.955360,0.1
The base_path is unique for each Cid, so we can use this to look up our Cids. We then would like to extract the data from the entities column which is in JSON format. As you can see there are a variety of entity_types which corresponds to different node labels. We need to iterate through and create a labelled node if it doesn't exist then create a relationship between the Cid and that entity. We would like to store the location of that entity in the text (start, end) as properties of the edge (in 3.5.16 I believe you can only store one property on an edge?).
Historically, we've used python to get our data into a simple nodelist and edgelist format, with one csv per nodelist label and one edge csv per relationship. We'd normally do the data wrangling in python then have a simple LOAD CSV in Cypher to load in each of the different node labels as the graph gets built. This will result in quite a few intermediary csvs, is there a cleaner way to do this using just Cypher and / or APOC?
We've not reviewed the best way to do this for about a year and are aware that things might have moved on with Cypher and APOC.
We are using Community edition 3.5.16.
10-21-2020 03:21 PM
I'm not really following what you're expecting the graph to look like. Perhaps you can manually create a small sample of your expected graph, at https://console.neo4j.org? The start and end don't really make sense to me without context.
That said, I can probably get you most of the way there:
LOAD CSV WITH HEADERS FROM "https://someurl.gov.uk/yourfile.csv" AS csvLine
MERGE (cid :Cid {base_path: csvLine.base_path})
WITH cid, apoc.convert.fromJsonMap(csvLine.entities) as entities
UNWIND entities AS entity
CREATE (e :Entity) SET e = entity
CREATE (cid)-[r :REL]->(e) SET r.start = e.start, r.end = e.end
10-23-2020 02:32 AM
Thanks!
My first attempt looked like this
// my first attempt
WITH apoc.convert.fromJsonList(
'
[
{"end": 144, "entity": "charity", "entity_type":
"ORGANIZATION", "start": 137},
{"end": 2788, "entity": "Commission", "entity_type": "ORGANIZATION", "start": 2778},
{"end": 1557, "entity": "questions", "entity_type": "CONTACT", "start": 1548}, {"end": 2175, "entity": "email", "entity_type": "CONTACT", "start": 2166}, {"end": 2847, "entity": "questions", "entity_type": "CONTACT", "start": 2838}, {"end": 2979, "entity": "questions", "entity_type": "CONTACT", "start": 2970}
]
'
) AS batch
WITH batch
UNWIND batch as row
//RETURN row
// https://gist.github.com/jexp/caeb53acfe8a649fecade4417fb8876a#lookup-by-id-or-even-list-of-ids
// this doesn't do what we want, we want to specify the label if we can
MERGE (to:Entity { entity_type: row.entity_type, name: row.entity})
WITH row, to
MATCH (from:Cid {name: "/government/news/new-digital-resource-for-charity-trustees-launched"})
MERGE (from)-[rel:CONTAINS_ENTITY { start_end_indices: '[' + row.start + ', ' + row.end + ']'} ]->(to)
And didn't do quite what we wanted.
So a workaround for this caveat Labels cannot be created dynamically, but you could use apoc.do.case to selectively assign a label based on existing data.
might be to create a general Entity
and store the entity_type
as a property, then use apoc.do.case
to cycle through and add an additional more specific label as desired?
Rather than having the umbrella Entity
label, we want the entity type to determine the label of the node. So we have a bunch of labels like; Person, Contact, Organisation, Finance, Service etc.
Thanks for your help by the well, it's reassured that I'm on the right track
10-23-2020 08:03 AM
Got the entities in and moved their entity_type
property to a label as desired.
// This proof of principle in cypher takes some test data from the entities column in entity_report.csv,
// the govner output (CAVEAT: had to remove double quotes from things)
// TODO: get it reading from the .csv
WITH apoc.convert.fromJsonList(
'
[
{"end": 144, "entity": "charity", "entity_type":
"ORGANIZATION", "start": 137},
{"end": 2788, "entity": "Commission", "entity_type": "ORGANIZATION", "start": 2778},
{"end": 1557, "entity": "questions", "entity_type": "CONTACT", "start": 1548}, {"end": 2175, "entity": "email", "entity_type": "CONTACT", "start": 2166}, {"end": 2847, "entity": "questions", "entity_type": "CONTACT", "start": 2838}, {"end": 2979, "entity": "questions", "entity_type": "CONTACT", "start": 2970}
]
'
) AS entities
WITH entities
UNWIND entities as entity
// this doesn't do what we want, we want to specify the label dynamically if we can
MERGE (e:Entity { entity_type: entity.entity_type, name: entity.entity})
WITH e, entity
// TODO: replace specific url with cid.name from matching row in csv
MATCH (cid:Cid {name: "/government/news/new-digital-resource-for-charity-trustees-launched"})
MERGE (cid)-[r :HAS_ENTITY { start_end_indices: [ toInteger(entity.start) , toInteger(entity.end) ]} ]->(e)
;
// We can't set labels dynamically
// Instead we move the entity_type property to a label, we choose not to remove it
// https://neo4j.com/labs/apoc/4.1/graph-updates/graph-refactoring/property-value-label/
MATCH (e:Entity)
CALL apoc.create.addLabels( id(e), [ apoc.text.upperCamelCase(e.entity_type) ] )
YIELD node
// we don't remove the generic Entity label
//REMOVE node.genre
RETURN node
;
10-23-2020 10:23 AM
Honestly, yours may be the best approach. Dynamically defining the labels when creating the node will still require a two-step cypher. The code you have in your last post will be easier to adjust and maintain than a large apoc.do.case
.
I think you've got it. That said, if you really want to explore the case
approach, here's a snippet to get you started:
MATCH (cid:Cid {name: "/government/news/new-digital-resource-for-charity-trustees-launched"})
WITH cid, apoc.convert.fromJsonList(
'
[
{"end": 144, "entity": "charity", "entity_type":
"ORGANIZATION", "start": 137},
{"end": 2788, "entity": "Commission", "entity_type": "ORGANIZATION", "start": 2778},
{"end": 1557, "entity": "questions", "entity_type": "CONTACT", "start": 1548}, {"end": 2175, "entity": "email", "entity_type": "CONTACT", "start": 2166}, {"end": 2847, "entity": "questions", "entity_type": "CONTACT", "start": 2838}, {"end": 2979, "entity": "questions", "entity_type": "CONTACT", "start": 2970}
]
'
) AS entities
UNWIND entities as entity
WITH cid, entity
WHERE entity.entity_type IS NOT NULL AND entity.entity IS NOT NULL
CALL apoc.do.case([
type = "ORGANIZATION", "
MERGE (e:Organization {name: name})
MERGE (cid)-[r :ORG { start_end_indices: [ toInteger(entity.start) , toInteger(entity.end) ]} ]->(e)
",
type = "CONTACT", "
MERGE (e:Contact {name: name})
MERGE (cid)-[r :CONTACT { start_end_indices: [ toInteger(entity.start) , toInteger(entity.end) ]} ]->(e)
"
],
"
MERGE (e:EntityUnhandled { entity_type: type, name: name})
MERGE (cid)-[r :HAS_ENTITY { start_end_indices: [ toInteger(entity.start) , toInteger(entity.end) ]} ]->(e)
",
{cid: cid, entity: entity, type:entity.entity_type, name: entity.entity}
) YIELD value
RETURN value
;
11-02-2020 09:13 AM
For completion, when reading from the file.
// This writes entities from entity_report.csv into the graph
// for testing on local
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///entity_report.csv" AS csvLine
CREATE (cid:Cid {name: csvLine.base_path})
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///entity_report.csv" AS csvLine
MATCH (cid:Cid {name: csvLine.base_path})
WITH cid, apoc.convert.fromJsonList(csvLine.entities) AS entities
UNWIND entities as entity
// this doesn't do what we want, we want to specify the label dynamically if we can
MERGE (e:Entity { entity_type: entity.entity_type, name: entity.entity})
WITH e, entity, cid
MERGE (cid)-[r :HAS_ENTITY { start_end_indices: [ toInteger(entity.start) , toInteger(entity.end) ]} ]->(e)
;
// SECOND CYPHER
// We can't set labels dynamically
// Instead we move the entity_type property to a label, we choose not to remove it, as we change it's case
// https://neo4j.com/labs/apoc/4.1/graph-updates/graph-refactoring/property-value-label/
// TODO:: Organisation instead of Organization; could fix here or upstream
MATCH (e:Entity)
CALL apoc.create.addLabels( id(e), [ apoc.text.upperCamelCase(e.entity_type) ] )
YIELD node
REMOVE node.entity_type
RETURN node
;
All the sessions of the conference are now available online