cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to speed up uploading data from csv in graph db

Hello , this is my first topic in neo4j community and I am learning neo4j .I am recently trying to upload data into neo4j graphDB from csv files. I have a written a python script for that. Among my csv files, some csv file is large (3.2 GB or above) which contains roughly 50 million or above rows. I have done bulk import first and it worked well but I need to upload data into existing database so I used load csv for importing data into graphdb. since my data is very large , I have used apoc library(version 3.5.0.4) for using parallel features. my current cypher query is

CALL apoc.periodic.iterate('
                     load csv with headers from "file:///relcashoutTest.csv" AS row return row ','
                     MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
                     MATCH (c:AGENT{WALLETID: row.AGENT})
                     MERGE(a)-[r:CASHOUT]->(c)
                     return count(*)
                     ',{batchSize:1000, iterateList:true, parallel:true}) 

this query for single cashout relationship. but I have others . In pyscript I am maintaining it dynamically.Happy thing is node creation works properly around 105 sec. I am facing problem to build relationships in nodes. My amazon instance have 32 CPU core with 240G RAM. I have observed that, firstly the parallelism works fine but after times it can't use all cores , in my case it is stuck between 2 -7 cores. I have printed some statistics , making 10 relations take 39 sec. yesterday I ran above relationship query for 8hours and I didn't get output. I am confused Constraint and indexing won't be helpful cause read and write trade off. Kindly help me out to solve this problem . my pyscript with this query works fine for small sized data. Thank you in advance. My neo4j version is 3.5.8

1 ACCEPTED SOLUTION

Remove the parallel and increase the batch size to 50k or such.
You can also remove the RETURN count(*)

What is your heap/page-cache configuration for Neo4j?

Do you have the constraints?

Can you share:

EXPLAIN MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
                 MATCH (c:AGENT{WALLETID: row.AGENT})
                 MERGE(a)-[r:CASHOUT]->(c)
                 return count(*)

View solution in original post

6 REPLIES 6

Hi,
Good choice to go with apoc.
Can you try increasing the batchSize to maybe 10K?
Also, try using this statement before the CALL statement : USING PERIODIC COMMIT 10000
Here is the documentation of this clause: https://neo4j.com/docs/cypher-manual/3.5/query-tuning/using/#query-using-periodic-commit-hint and https://neo4j.com/docs/cypher-manual/3.5/clauses/load-csv/#load-csv-importing-large-amounts-of-data

I tried to increase batch size 10000 , but same as it is

Remove the parallel and increase the batch size to 50k or such.
You can also remove the RETURN count(*)

What is your heap/page-cache configuration for Neo4j?

Do you have the constraints?

Can you share:

EXPLAIN MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
                 MATCH (c:AGENT{WALLETID: row.AGENT})
                 MERGE(a)-[r:CASHOUT]->(c)
                 return count(*)

Maybe avoid parallel when merging relationships, that can be a recipe for lock contention, as relationship creation requires locks on the start and end nodes. If the same nodes appear multiple times in the CSV then there could be contention and deadlock issues between concurrently executing batches.

Also make sure you have indexes on :CUSTOMER(WALLETID) and :AGENT(WALLETID)

@Michael thanks for your reply. I have put constraint on id and the performance have increased significantly fast

Try using CREATE instead of MERGE, i.e.,

 

CREATE(a)-[r:CASHOUT]->(c)

 Since Merge involve some time-complexity code so the efficiency would worse than CREATE