Neo4j

TonyOD27 · ‎01-26-2021

neo4j version: Community 4.2.0
dbms.memory.heap.initial_size=25G
dbms.memory.heap.max_size=25G
dbms.memory.pagecache.size=8G
desktop version: 1.3.11

Hello,

What would be the best way to write a cypher query that deletes an existing node from Neo4j if the unique ID on that node is not present in a csv file? I think I found a solution for this but I don't think I fully understand it. First I'm trying to write the query in pure Cypher so that I can pass it to my python script using the neo4j driver.

I'm running daily data extraction python scripts to create csv's and I need a cypher query to delete nodes from the existing Neo4j graph database if the node id is not found in the new csv file.

I think I found the answer here: Neo4J Delete Nodes With Field Value Not in CSV with Cypher - Stack Overflow

The daily csv files with about 100k rows have a format like this:

person_id, fname, lname, location
1, name1, lname1, london
2, name2, lname2, munich
3, name3, lname3, beijing
4, name4, lname4, tokyo

Let's say that on the next day, I run my python script and now the csv looks like this:

person_id, fname, lname, location
1, name1, lname1, london
2, name2, lname2, munich
3, name3, lname3, beijing

What would be the cypher query to delete the node with person_id = 4?

Based on the stack overflow thread, I have this:

:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS
FROM 'file:///'+$filecsv AS row

// Create a collection of the name ID's that you can check against in the cypher below.
WITH COLLECT(toInt(row.person_id)) AS newlist

//Find the nodes from your graph database to compare against newlist
MATCH(n:People)
WHERE EXISTS (n.name_id)
AND
NOT n.name_id IN newlist

DETACH DELETE n;

So am I right if I describe this query with the following statements?

Step 1: Load the daily file

:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS
FROM 'file:///'+$filecsv AS row

Step 2: create a collection of the name ID's that show up in the new csv file. This function stores the name id's in the list so that you can check for existance your current name id's in the graph database

// Create a collection of the name ID's that you can check against in the cypher below.
WITH COLLECT(toInt(row.person_id)) AS newlist

Step 3: Retrieve the existing nodes from the graph database. And delete the nodes if:

a) the current name already exists, and
b) the current name_id is not found in the new list.

//Find the nodes from your graph database to compare against newlist
MATCH(n:People)
WHERE EXISTS (n.name_id)
AND
NOT n.name_id IN newlist

DETACH DELETE n;

This seems to be working for me, but is there a better to write this if I plan to execute this query with the Python neo4j driver on a daily basis?

Any insights would be greatly appreciated.

Thank you

clem · ‎01-26-2021

Another way to do this (and might work better if you have lots of data), is to create a temporary Label on each node, indicating that that particular node was in the CSV.

Then run another pass, where you delete all nodes that don't have the Label.

Then delete the temporary Label.

Something like:

// Label nodes to keep
LOAD CSV WITH HEADERS
FROM 'file:///'+$filecsv AS row
MATCH ( n:People {name_id:toInt(row.person_id)} ). // find node to keep
SET n:Keep;  // add temporary label

// Find Nodes not to Keep and delete them
MATCH (n:Person)
WHERE NOT n:Keep
DETACH DELETE n;

// Clean up: Remove the temporary Keep label
MATCH(n:Person:Keep)
REMOVE n:Keep // remove the temp Label
RETURN n;

The reason for my suggestion is I'm not sure what the overhead is for building your list from the CSV file. In addition, I believe Cypher has to do a linear search to not find name_id IN LIST operator.

Labeled Nodes are internally kept in sets that are fast to access, so that checking for the (non) existence of a Node with a Label is fast.

I believe your code is order N^2 and my way is order N.

View solution in original post

clem · ‎01-26-2021

Another way to do this (and might work better if you have lots of data), is to create a temporary Label on each node, indicating that that particular node was in the CSV.

Then run another pass, where you delete all nodes that don't have the Label.

Then delete the temporary Label.

Something like:

// Label nodes to keep
LOAD CSV WITH HEADERS
FROM 'file:///'+$filecsv AS row
MATCH ( n:People {name_id:toInt(row.person_id)} ). // find node to keep
SET n:Keep;  // add temporary label

// Find Nodes not to Keep and delete them
MATCH (n:Person)
WHERE NOT n:Keep
DETACH DELETE n;

// Clean up: Remove the temporary Keep label
MATCH(n:Person:Keep)
REMOVE n:Keep // remove the temp Label
RETURN n;

The reason for my suggestion is I'm not sure what the overhead is for building your list from the CSV file. In addition, I believe Cypher has to do a linear search to not find name_id IN LIST operator.

Labeled Nodes are internally kept in sets that are fast to access, so that checking for the (non) existence of a Node with a Label is fast.

I believe your code is order N^2 and my way is order N.

TonyOD27 · ‎01-27-2021

Oh I think I better understand now what you mean by label. I had to read this:

So if I'm going to use a second label on the node then I don't need any more indeces, right?

clem · ‎01-27-2021

Exactly! Behind the scenes, Nodes with the same Label are collected in a Set so that searching for a Node is very fast.

TonyOD27 · ‎01-27-2021

Just curious, but how were you able to determine the differences in running time complexities between the two codes?

I read the Cypher docs referring to the IN LIST operator in the WHERE clause and, although it does not mention the linear search, it does make sense why using the IN LIST would perform slower than searching for nodes with Labels.

In terms of overhead, I'm writing cypher queries to perform the 3 steps listed below on 13 different node Labels with each label containing approximately 30,000 nodes, give or take. In total, my graph database contains 400,000 nodes connected by 550,000,000 million relationships (there's a stinking super node in there, still working on it).

If I'm going to set temporary labels to each of these sets of nodes, wouldn't I need an index on the n.keep property of the People node in order to make this updating process faster than the query from the Stack Overflow question? I suppose having this index should not add too much overhead if the total number of nodes remains at about 400,000. I'll give this a shot for now because I don't think there's too much overhead.

Since we're talking about overhead, I'm going to diverge away from the original question and just provide more info. Here are the other 2 cypher queries I'm using to update my graph database on a daily basis:

Step 1: If a Person_ID exists in Neo4j but not in the new csv, delete the node (this is what I am asking here)

Step 2: If a Person_ID exists in Neo4j and in the collected data, compare the node properties. If node properties are different, update the properties on the node. Here I'm using this query:

:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS
FROM 'file:///'+$file AS row
MATCH(n:People {name_id: toInteger(row.person_id)})
SET n += {firstName: row.fname,
			lastName: row.lname,
			location: row.location};

Step 3: If person_id exists in the new csv but not in Neo4j, create the new node:

:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS
FROM 'file:///'+$file AS row
MERGE (n:People {name_id: toInteger(row.person_id)})
ON CREATE SET n += {firstName: row.fname,
			lastName: row.lname,
			location: row.location};

clem · ‎01-27-2021

I'm making an educated guess on the run times.

I presume that a Cypher List is just an array. (I could be wrong, but I couldn't find any documentation that said otherwise.). Computer Science knows such things are linearly searched.

I know that Nodes with the same Label is collected as a Set which makes if fast to scan. I presume Set membership test is done via a Hash, so it should be order O(1). But I could be wrong.

TonyOD27 · ‎01-27-2021

Thank you very much for your help!

Neo4j

Delete Node if the Unique Property Constraint is not in CSV