Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-04-2021 08:23 AM
I'm new to Neo4j and working on a data set that contains 2.3M rows. I have got the following questions
Query used:
LOAD CSV WITH HEADERS FROM 'file.csv' AS row with toInteger(row.parentid) AS parentid, toInteger(row.childid) AS childid, row.name AS name
RETURN row LIMIT 100;
MERGE (c:childnode {id: childid, cname: name})
MERGE (p:parentnode {parentid: parentid, pname: name})
MERGE (p)-[:PARENT]->(c)
RETURN p,c;
Problem: It runs endlessly w/o LIMIT. Is there any way to make this code run in less time? How do I approach this?
Thank you for your help!!
Solved! Go to Solution.
05-10-2021 12:49 PM
No, 1 minute for 5000 nodes is far too slow. This should be subsecond.
You didn't mention if you added indexes or not. If you didn't, please do, and then verify with an EXPLAIN of the query.
We also know there is an Eager in the query plan, and that will sabotage the periodic committing. The reason is the two MERGE operations.
We would recommend doing two passes through your CSV first, one to only MERGE c
, another pass to only MERGE parent
, and a last pass to MATCH (instead of MERGE) both nodes and MERGE the relationship between them. That will remove the Eager.
05-04-2021 08:46 AM
Hello @Ankita_Rahavachari and welcome to the Neo4j community
CREATE CONSTRAINT child_id IF NOT EXISTS ON (n:childnode) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT parent_id IF NOT EXISTS ON (n:parentnode) ASSERT n.parentid IS UNIQUE
You can use this query after creating the constraints:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file.csv' AS row
WITH toInteger(row.parentid) AS pid, toInteger(row.childid) AS cid, row.name AS name
MERGE (p:parentnode {parentid: pid, pname: name})-[:PARENT]->(c:childnode {id: cid, cname: name})
Regards,
Cobra
05-05-2021 12:20 PM
Thank you for helping out! Your answer worked. However, I had only one problem when I tried to load the file using periodic commit
Error message:
Executing queries that use periodic commit in an open transaction is not possible.
Then I tried using the following, it worked.
:auto USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file.csv' AS row
Best,
Ankita
05-05-2021 12:35 PM
No problem, I'm glad it worked.
How long did it take to load the 2.3 million rows?
05-06-2021 11:34 AM
Since I was concerned with analyzing the relationships, I didn't notice the time it took to run.
It was working fine for some time and I replied to you,
after that, the query terminated because one of the rows failed the constraint qualification.
I think that I didn't efficiently frame the second part of my question initially.
So I'm trying it once again,
The data I'm dealing with looks similar to the data below
id | parent_id| name
1 | 1 | aaa
2 | 500 | bbb
3 | 1 | aaa
4 | 1000 | ccc
I'm trying to create a relationship between the parent_id and id in which case there can be only one instance of every id.
This will result in a tree structure(node 500 is the parent of id 2 & likewise).
I'm using the query below, also I tried it with LIMIT to check if it matched the requirement. It produced the correct result.
:auto using periodic commit 5000 load csv with headers from 'file.csv' as row
with toInteger(row.parent_id) AS parentid, toInteger(row.id) AS id, row.name AS name
merge (c:childnode {id: id})
on match set c.name = rank
on create set c.name = rank
merge (parent:parentnode {id: parenttaxid})
merge (c)-[:PARENT]->(parent)
return (c)-[:PARENT]->(parent)
I came to know this query(posted 2yrs ago) when I searched for a similar scenario in this community.
So for every 5 minutes, the node label gets updated by 5000. For 2.3 million rows, it will approximately take more than 20 hours
if I'm correct.
Is this normal with Neo4j?
Or Am I committing some silly mistakes?
Best,
Ankita R
05-07-2021 07:23 AM
5 minutes to add 5000 nodes doesn't seem right.
Please run an EXPLAIN of the query (excluding :auto
) and add the query plan here.
Also, don't RETURN anything from a load query like this, you can always query for the loaded data later.
05-06-2021 11:51 AM
You put UNIQUE CONSTRAINTS on childnode
and parentnode
and the query looks optimize so everything good. Now something you can do is to increase the RAM of your database
You can use this tool to get the values of parameters based on the RAM of your machine.
Regards,
Cobra
05-06-2021 11:56 AM
Ahh, That makes sense.
I will also check the link you have provided.
I appreciate your responses and timely help.
Thanks a lot.
Best,
Ankita R
05-07-2021 10:24 AM
Here is the Query Plan and the warning message:
Do let me know if you require another format of query plan instead of png.
Best,
Ankita R
05-07-2021 10:42 AM
Thanks, the query plan shows three things that are contributing to the slowness.
First, we see that label scans are being used for both parent
and c
, so you're missing indexes, making these lookups extremely expensive (it's scanning all :childnodes and all :parentnodes for every row, not just once).
A followup question here...what is the difference between a :childnode type and a :parentnode type? If these are supposed to be the same type and the only difference is the relationship they have to one another (especially if a child node can be the parent node of another), then this the wrong way to model your data. If they are of the same type, then you should use the same label for both, and let the relationships you create between them handle the relationship they have to each other, not with the label. If you do change the labels for your data, make sure you have an index or a unique constraint created so your MATCH and MERGE operations on those nodes is fast.
The other thing going on is that there is an Eager operator in the plan, and that's sabotaging the periodic commit behavior. We need to get rid of the Eager here, but first deal with the above issues.
05-07-2021 11:42 AM
I get the problem now. Yes, you are correct, they are supposed to be the same type. A child node can also be a parent of another. I will use the label childnode
for both parent and c.
:auto using periodic commit 5000 load csv with headers from 'file:///node.csv' as row
with toInteger(row.parentid) AS parentid, toInteger(row.id) AS id, row.rank AS rank
merge (c:childnode {id: id})
on match set c.name = rank
on create set c.name = rank
merge (parent:childnode {id: parentid})
merge (c)-[:PARENT]->(parent)
This query now takes approximately 1 minute to add 5000 nodes which is significantly fast than the previous query. I hope this query is as per your suggestion and is this rate ok for the processing of the query?
05-10-2021 12:49 PM
No, 1 minute for 5000 nodes is far too slow. This should be subsecond.
You didn't mention if you added indexes or not. If you didn't, please do, and then verify with an EXPLAIN of the query.
We also know there is an Eager in the query plan, and that will sabotage the periodic committing. The reason is the two MERGE operations.
We would recommend doing two passes through your CSV first, one to only MERGE c
, another pass to only MERGE parent
, and a last pass to MATCH (instead of MERGE) both nodes and MERGE the relationship between them. That will remove the Eager.
05-12-2021 12:18 PM
I've created an index on c.id
and the query was able to load 2,317,840 nodes within 2-3 minutes. Then I also followed the passes and eliminated the Eager operator. Finally, 2317840 relationships, got completed after 164007 ms.
Thank you for your help
All the sessions of the conference are now available online