cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to match one node and stop to work on it to update relationship while importing large data?

I am working on a large dataset using Neo4j 4.1 community edition. Every hour, there will be more than 2 million relationships and 10K nodes needing to update or create. It already takes me around one hour to update around 600K relationships for my graph even I have carefully prepared node CSV file and relationship CSV file separately without any duplicate rows in the files.

The process I do to update the graph hourly is:

  1. MERGE the node first by importing the node CSV file. (This process is fast.)
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM
'file:///node.csv' as line
WITH line
MERGE (status:Event {event_name: line.node, type: line.node_attribute});
  1. MATCH two nodes and MERGE the relationship.(This process is very slow)
:auto USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///relationship.csv' as line
WITH line
MATCH (prev_status:Event {event_name: line.start_node, type: line.start_node_attribute})
MATCH (status:Event {event_name: line.end_node, type: line.end_node_attribute})
MERGE (prev_status)-[t:TO {attribute_0: coalesce(line.edge_attribute_0, 'None'), attribute_1: coalesce(line.edge_attribute_1, 'None'), dt:date('2020-10-26'), weight: toFloat(line.edge_weight)}]->(status);

As in the graph, there will not be more than one node with the same type and properties. I am wondering if I can boost the relationship update process after finding the first node and just work on that node to update the relationship. My current Cypher will keep finding other nodes even I know there won't be another node satisfying the given condition. I guess if I can find a way to do that, the updating process can be faster. I don't know whether it does matter or not. Or there is another block to make this process such slow.

If not, is there any suggestion to make this update faster? I did a lot of research on it, but I didn't find any solution yet.

1 REPLY 1

I have figured out a way to improve the performance by adding index to the node properties, which makes this relationship updating process much faster than before (~10 mins for updating 2 million relationships):

CREATE INDEX FOR (e:Event)
ON (e.event_name);
CREATE INDEX FOR (e:Event)
ON (e.type);

I don't know if I use index correctly here, but it does improve the efficiency. On the neo4j doc, I also see the composite index, but I don't know if I can use the composite index here for my use case. Hopefully, the doc can include more details and examples.

However, the other issue that comes out is when I LOAD CSV file, while the file includes around 2 million rows (relationships), the process just reached completed status and return when finishing 520K relationships update. I am wondering if there is any limit for the LOAD CSV operation, even I am using PERIODIC COMMIT. My current solution here is to run the exact same query again to update the rest of relationships in the CSV file.