cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Upload large amounts of data on Neo4j Community Edition

Hello

I am working with Neo4j Community Edition running on EC2 (r5.16xlarge instance type). I am trying to upload data from S3 buckets.

I have a number of CSV files (each with 1M records) and I am trying to upload data into Neo4j. I used LOAD CSV initially and now I am using apoc.load.csv after checking out a few topics on the community forum. Even this process is also taking lot of time to upload the data. My query looks something like below.

CALL apoc.periodic.iterate('
CALL apoc.load.csv({file_path}) yield map as row return row
','
MERGE ....
MERGE ....
MERGE ...
...
...
...
...
...
{batchSize:10000, parallel:true});

As seen above, I have a lot of MERGE operations in the query. Even to upload 10K records, it is taking more than a minute. I need to upload millions of records every minute. On the forum, someone suggested me to try neo4j-admin import but for my use case, I need to mutate the graph with new data every hour.

I tried to change the EC2 instance types by increasing the memory and CPU but no success. Please suggest me on how to go about this.

Thank you!

5 REPLIES 5

do you have index created on those properties you are trying to do the merge operation

The MERGE operations are on nodes and links not on the properties of them. I’m using merge to create nodes and links. Is there any better way to do this?

The nodes and links shouldn’t be duplicated. This is why I’m using merge.

Thanks.

I understand that you do MERGE on nodes . But you should have a property which differentiates two nodes , right ? Ideally this would be the property based on which you don't want to create duplicate nodes and so you do MERGE operation rather than CREATE . I am saying that you need to have indices on these node properties to do the MERGE efficiently

The property of each node is the ID that differentiates one node from another. Can you please suggest me how to create indices on these while loading the data to make it faster? Thanks.