Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
04-22-2020 04:02 PM
Greetings,
I have the following load csv cypher wrapped inside the apoc.periodic.iterate.
I tried to increase the batch size and I am getting locking errors. It is loading at very slow pace. What is a way to increase the loading speed and not get dead lock errors.
I tried increasing the batchsize and setting parallel to false but it is still very slow. I need to load three files like this each with 1.5 million records. It ran for almost two hours and all I could see was 500K records processed. I have a box with 24 cores and 96Gig memory. CPU is barely used.
Is there a way to make this go faster. ? Thanks a bunch in advance.
CALL apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM
"file:///<path>/filename.csv" AS line return line ','
MATCH (c:Company {name: line.COM_CODE})
MATCH (frt:FreightReadingTariff {name: toInteger(line.READ_TARIFF)})
MERGE (fb:FreightBasis {name: line.FBASIS})
CREATE (r1:Renege {ID: line.SERID + "_" + line.XXXX + "_" + line.YYYY + "_" + line.ABC + "_" + line.MNO + "_" + line.NNN,
serid: toInteger(line.SERID), xxxx: toInteger(line.XXXX), yyyy: line.YYYY, abc: line.ABC, mno: line.MNO, nnn: toInteger(line.NNN),
a1a: line.a1a, a2b: line.a2b, a2c: line.a2c, a3c: line.a3c, a4c: line.a4c, mmm: toInteger(line.MMM),
n12: toInteger(line.N12), mn1: line.MN1, eff1: line.eff1, disc1: line.disc1, ftp: line.ftp, stp: line.stp,
dowtype: line.dow, pcat: line.pcat, dcat: line.dcat, unav: line.unav, txtbl: toInteger( line.txtbl),
ch1: toInteger( line.ch1)})
MERGE (c)-[:company_to_renege]->(r1)
MERGE (fb)-[:freightbasis_to_renege]-(r1)
MERGE (frt)-[:freightreadingtariff_to_renege]-(r1)
',{batchSize:1, iterateList:true, parallel:true})
Solved! Go to Solution.
04-30-2020 01:54 PM
04-22-2020 05:15 PM
What version of Neo4j? version of APOC?
Do you have any indexes defined on
the labels involved in the query? Can you return the output of cypher statement
call db.indexes();
04-22-2020 09:49 PM
Version of Neo4j: 4.0.3
Version of APOC : 4.0.0.7
Yes, I have defined constraints on all the labels I am using.
|7|"constraint_117e43f5"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Company"]|["name"]|"native-btree-1.0"|
|---|---|---|---|---|---|---|---|---|---|
|30|"constraint_b2b1ced2"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightReadingTariff"]|["name"]|"native-btree-1.0"|
|21|"constraint_d2391f5c"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Renege"]|["ID"]|"native-btree-1.0"|
|9|"constraint_e95fcfb3"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightBasis"]|["name"]|"native-btree-1.0"|
04-23-2020 01:35 PM
was there a reason for wrapping the LOAD CSV inside a apoc.periodic.iterate and not just simple LOAD CSV?
04-23-2020 02:01 PM
Thanks @dana.canzano. I started with LOAD CSV but it way too slow as well so I thought putting it inside the apoc.periodic.iterate will increase the parallelism but not much luck.
04-23-2020 04:56 PM
how have you configured your conf/neo4j.conf and parameters
dbms.memory.heap.init_size
dbms.memory.heap.max_size
dbms.memory.pagecache.size
how much RAM is on the machine?
Have your prefaced the LOAD CSV statement with 'EXPLAIN` so as to understand the query plan generated?
04-23-2020 06:51 PM
dbms.memory.heap.initial_size=28000m
dbms.memory.heap.max_size=28000m
dbms.memory.pagecache.size=45g
I think i ran a command which suggested these number and I am using them. It definitely got me out of the OutOfMemory exceptions I was getting earlier.
Total memory on the box is 96Gigs. I always see about 50Gigs of free memory.
I have been running EXPLAIN on the other queries but don't understand exactly how to interpret and make it better. Is there any video out there that can explain this EXPLAIN. :-).
Thanks again for your help.
04-28-2020 03:57 AM
Hi @skmami,
Memory usually means RAM not HARD drive.
Could you please tell us how much data you have to load?
Are you loading via browser or trying Java/Python code?
Don't you see much help with Periodic Iterate?
04-28-2020 08:49 AM
I have about 30gigs of data.
I am loading using LOAD CSV.
I am running into lock issues with periodic iterate.
Thanks
04-28-2020 10:15 AM
Please share query you are trying and also let us know free RAM you have?
04-23-2020 03:58 PM
I came across this tool called Kettle (Pentaho Data Integration) with the hopes that it allows me to load the data faster. It looks like this product is a hassle to even install. Wondering if anyone uses this with Neo4J plugins apart from just playing around?
04-23-2020 05:01 PM
In general it's not a good idea to executing loading queries that create relationships in parallel, since relationship creation requires locks on both nodes. In parallel in batches can lead to lock contention and deadlock.
A higher batch size without parallel tends to work fairly well.
You should of course make sure you have the right indexes created to support your initial MATCH and MERGE operations, indexes on:
:Company(name)
, :FrieghtReadingTariff(name)
, and :FreightBasis(name)
Also run an EXPLAIN of the query and ensure there are no Eager operators. While these are often needed for correct execution, they require manifestation of all results at that point in the plan, effectively disabling a periodic commit operation, which is another ingredient toward locking issues and potential deadlocks.
04-23-2020 06:58 PM
Thank you @andrew.bowman .
I did create unique constraints on Company, FreightReadingTariff and FreightBasis as shown below.
CREATE CONSTRAINT ON (f:FreightBasis) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f: Company) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f:FrieghtReadingTariff) ASSERT f.name IS UNIQUE;
Is this what you mean by creating index. ?
I am thinking I should break this loading query into separate jobs so I can run the creation of nodes in batch and parallel and then create relationships in batch with no parallelism.
Thanks
04-28-2020 03:14 AM
hello guys ,
i am new to neo4j and i am using neo4j enterprise version on a server and i am running the query to load csv and form nodes but query is taking about 40 hours to run. my data size is about 1.7 million .
here is my query
is there any way it can run faster??
''':auto using periodic commit 2000
load csv with headers from "file:///data.csv" as line
merge (s:State{State:line.State})
merge (p:Product{ProductId:line.ProductId,Price:line.ProductPrice,ProductName:line.ProductName})
merge (c:Category{Category:line.Category})
merge (cr:CategoryRollUp{CategoryRollUp:line.CategoryRollUp})
merge (cu:Customer{CustomerId:line.CustomerId,Grouping:line.NewGrouping})
merge (cu)-[:InteractsWith{Date:line.TransactionDate,Month:line.Month,EventScore:line.EventScore,Event:line.Event,Quantity:line.QtySold}]->(p)
merge (s)-[:HasUser]->(cu)
merge (c)-[:HasProduct]->(p)
merge (cr)-[:HasCategory]->(c)"""
04-28-2020 08:47 AM
Hello @shubham.
Welcome to Neo4J tool. It is the awesome product you will know very soon.
I ran into the same issue you are running into with loading the data which is taking very long time.
There is a technique you have to use to while loading the data to go faster.
I ran into this video on you YouTube which you might want to go through before you spend lot of time on trying to load the data.
It is 50 minutes videos but well worth the time.
There are some good tips on how to load faster.
Thanks
07-03-2022 01:26 AM
Hi Skami, can you share the Youtube link for the video, I am having similar issues. Thanks.
04-30-2020 01:54 PM
Good to hear. Glad to help.
12-23-2022 02:23 AM
Hi Shubham, could you please share the video link here.
All the sessions of the conference are now available online