Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-01-2018 11:48 PM
We have implemented the following: We want to create a fresh graph every day. But we want to keep the previous version as well. What we are doing is we split to CSV files the whole data and we do LOAD CSV. After the individual CSV was loaded - we deleted the 3rd version (previous’es previous) and we keep just the current + previous. The problem is that we get many Transaction Locking. This is not scaleable. What is the best way to: 1. Build a new graph quickly. 2. Maintain the previous graph as well. 3. Delete the oldest so we can maintain only 2 versions at a time. Thank you!
09-02-2018 02:39 AM
You have some options here:
Can you describe more of your graph structure, the CSV and the load scripts? Also how they are executed.
09-02-2018 04:38 AM
I am creating around 30k-40k jobs for each with CSV LOAD a different csv file. I am retrying but looks as if it is causing a lot of rejection and bottlenecks
Are labels more efficient than putting a property on the relationship?
09-02-2018 06:52 AM
Yes labels are definitely more efficient.
How large are your CSV files each / in total?
Perhaps it's easier to just run this from a queue. If you import all of them concurrently you'd also get into deadlock issues even without removing the old graph data.
09-02-2018 06:58 AM
Are your graphs from yesterday and today interconnected, or disjoint? For example, a node from 2 days ago, is it probably in today's graph as well?
Also it would be helpful to indicate when you're running into the locking problems, I guess it's when you're deleting the old stuff? Some sample cypher here could help.
09-02-2018 10:59 AM
So what we are doing is this.
The nodes on the graphs are consistent and never deleted.
Each day we calculate the relationships from scratch. The output will be divided into node's id separate CSV.
Those CSVs will be fed into the LOAD CSV - each load will do:
USING PERIODIC COMMIT
LOAD CSV FROM "#{s3_object}" AS csv WITH csv
MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })
MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })
MERGE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)
ON MATCH SET r.weight = toInteger(csv[1])
ON CREATE SET r.weight = toInteger(csv[1]);
All of those CSV loads are held into queue of 10 parallel workers. We are getting the deadlocks every time, sporadically.
[Mailtrack](https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality6&)
09/02/18, 8:58:47 PM
09-02-2018 11:00 AM
09-02-2018 03:03 PM
If you have new data anyway you can also use CREATE instead of MERGE to be faster, or are there duplicate affinities across the files (on network/version keys) ?
Another option to handle the retries (without needing to reload the whole CSV for every issue).
Load the CSV in your client code, and send batches (e.g. 10k) to a statement like this:
UNWIND $rows as csv
MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })
MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })
CREATE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)
SET r.weight = toInteger(csv[1])
And if you run into a deadlock issue, just split the 10k batch into smaller ones (e.g. 1k etc). Until the tx succeeds.
This should also be working automatically if you use transaction callbacks in your driver.
apoc.periodic.iterate has also a retry option, but I have to improve its behavior (i.e. splitting batches and more exponential backoff)
call apoc.periodic.iterate('
LOAD CSV FROM $url AS csv RETURN csv
','
MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })
MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })
CREATE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)
SET r.weight = toInteger(csv[1])
', {batchSize:10000, iterateList:true, retries:3, params:{url:$s3_object})
09-02-2018 09:16 PM
Thanks Michael.
Yes, the MERGE is because there could be another rel between the two, in the opposite direction.
Maybe I am tackling this the wrong way. Because initially I've loaded 1 single CSV file which is a few GB large, and it blew the machine memory.
Could I create 2 DBs on the same machine? so each version will go into a fresh one?
Otherwise I dont see this scale.
09-03-2018 02:58 AM
Yes you can run two servers on the same machine.
Even a multi-GB file shouldn't blow memory, do you still have the statement around that you used for that one?
09-03-2018 07:55 PM
The statement is the same. The data is split into files, instead of one big file
09-04-2018 12:27 AM
Then it should not blow up. Can you share the query plan for the statement?
All the sessions of the conference are now available online