Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-19-2019 02:08 AM
Hello , this is my first topic in neo4j community and I am learning neo4j .I am recently trying to upload data into neo4j graphDB from csv files. I have a written a python script for that. Among my csv files, some csv file is large (3.2 GB or above) which contains roughly 50 million or above rows. I have done bulk import first and it worked well but I need to upload data into existing database so I used load csv for importing data into graphdb. since my data is very large , I have used apoc library(version 3.5.0.4) for using parallel features. my current cypher query is
CALL apoc.periodic.iterate('
load csv with headers from "file:///relcashoutTest.csv" AS row return row ','
MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
MATCH (c:AGENT{WALLETID: row.AGENT})
MERGE(a)-[r:CASHOUT]->(c)
return count(*)
',{batchSize:1000, iterateList:true, parallel:true})
this query for single cashout relationship. but I have others . In pyscript I am maintaining it dynamically.Happy thing is node creation works properly around 105 sec. I am facing problem to build relationships in nodes. My amazon instance have 32 CPU core with 240G RAM. I have observed that, firstly the parallelism works fine but after times it can't use all cores , in my case it is stuck between 2 -7 cores. I have printed some statistics , making 10 relations take 39 sec. yesterday I ran above relationship query for 8hours and I didn't get output. I am confused Constraint and indexing won't be helpful cause read and write trade off. Kindly help me out to solve this problem . my pyscript with this query works fine for small sized data. Thank you in advance. My neo4j version is 3.5.8
Solved! Go to Solution.
08-25-2019 11:29 AM
Remove the parallel and increase the batch size to 50k or such.
You can also remove the RETURN count(*)
What is your heap/page-cache configuration for Neo4j?
Do you have the constraints?
Can you share:
EXPLAIN MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
MATCH (c:AGENT{WALLETID: row.AGENT})
MERGE(a)-[r:CASHOUT]->(c)
return count(*)
08-19-2019 03:13 PM
Hi,
Good choice to go with apoc.
Can you try increasing the batchSize to maybe 10K?
Also, try using this statement before the CALL statement : USING PERIODIC COMMIT 10000
Here is the documentation of this clause: https://neo4j.com/docs/cypher-manual/3.5/query-tuning/using/#query-using-periodic-commit-hint and https://neo4j.com/docs/cypher-manual/3.5/clauses/load-csv/#load-csv-importing-large-amounts-of-data
08-21-2019 04:39 AM
I tried to increase batch size 10000 , but same as it is
08-25-2019 11:29 AM
Remove the parallel and increase the batch size to 50k or such.
You can also remove the RETURN count(*)
What is your heap/page-cache configuration for Neo4j?
Do you have the constraints?
Can you share:
EXPLAIN MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
MATCH (c:AGENT{WALLETID: row.AGENT})
MERGE(a)-[r:CASHOUT]->(c)
return count(*)
08-19-2019 04:46 PM
Maybe avoid parallel when merging relationships, that can be a recipe for lock contention, as relationship creation requires locks on the start and end nodes. If the same nodes appear multiple times in the CSV then there could be contention and deadlock issues between concurrently executing batches.
Also make sure you have indexes on :CUSTOMER(WALLETID) and :AGENT(WALLETID)
08-28-2019 09:43 PM
@Michael thanks for your reply. I have put constraint on id and the performance have increased significantly fast
11-16-2022 04:45 AM
Try using CREATE instead of MERGE, i.e.,
CREATE(a)-[r:CASHOUT]->(c)
Since Merge involve some time-complexity code so the efficiency would worse than CREATE
All the sessions of the conference are now available online