Neo4j

Ankita_Rahavach · ‎05-04-2021

I'm new to Neo4j and working on a data set that contains 2.3M rows. I have got the following questions

When I implement the Load CSV query, I'm not able to load and view the whole dataset.
Do I have to use LIMIT to have an effective visualization?

Query used:

LOAD CSV WITH HEADERS FROM 'file.csv' AS row with toInteger(row.parentid) AS parentid, toInteger(row.childid) AS childid, row.name AS name
RETURN row LIMIT 100;

Working on a CSV file having 3 columns as mentioned above,
I'm trying to create a parent-child relationship between entries in parentid with childid using

MERGE (c:childnode {id: childid, cname: name})
MERGE (p:parentnode {parentid: parentid, pname: name})
MERGE (p)-[:PARENT]->(c)
RETURN p,c;

Problem: It runs endlessly w/o LIMIT. Is there any way to make this code run in less time? How do I approach this?

Thank you for your help!!

andrew_bowman · ‎05-10-2021

No, 1 minute for 5000 nodes is far too slow. This should be subsecond.

You didn't mention if you added indexes or not. If you didn't, please do, and then verify with an EXPLAIN of the query.

We also know there is an Eager in the query plan, and that will sabotage the periodic committing. The reason is the two MERGE operations.

We would recommend doing two passes through your CSV first, one to only MERGE c, another pass to only MERGE parent, and a last pass to MATCH (instead of MERGE) both nodes and MERGE the relationship between them. That will remove the Eager.

View solution in original post

Cobra · ‎05-04-2021

Hello @Ankita_Rahavachari and welcome to the Neo4j community

In Neo4j Desktop, you can only see the first 1000 rows in table view.
To speed up the load, you must first create UNIQUE CONSTRAINTS on nodes:
- CREATE CONSTRAINT child_id IF NOT EXISTS ON (n:childnode) ASSERT n.id IS UNIQUE
- CREATE CONSTRAINT parent_id IF NOT EXISTS ON (n:parentnode) ASSERT n.parentid IS UNIQUE

You can use this query after creating the constraints:

USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file.csv' AS row
WITH toInteger(row.parentid) AS pid, toInteger(row.childid) AS cid, row.name AS name
MERGE (p:parentnode {parentid: pid, pname: name})-[:PARENT]->(c:childnode {id: cid, cname: name})

Regards,
Cobra

Ankita_Rahavach · ‎05-05-2021

Thank you for helping out! Your answer worked. However, I had only one problem when I tried to load the file using periodic commit
Error message:

Executing queries that use periodic commit in an open transaction is not possible.

Then I tried using the following, it worked.
:auto USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file.csv' AS row

Best,
Ankita

Cobra · ‎05-05-2021

No problem, I'm glad it worked.

How long did it take to load the 2.3 million rows?

Ankita_Rahavach · ‎05-06-2021

Since I was concerned with analyzing the relationships, I didn't notice the time it took to run.

It was working fine for some time and I replied to you,
after that, the query terminated because one of the rows failed the constraint qualification.

I think that I didn't efficiently frame the second part of my question initially.
So I'm trying it once again,
The data I'm dealing with looks similar to the data below

id | parent_id| name
1 | 1 | aaa
2 | 500 | bbb
3 | 1 | aaa
4 | 1000 | ccc

I'm trying to create a relationship between the parent_id and id in which case there can be only one instance of every id.
This will result in a tree structure(node 500 is the parent of id 2 & likewise).

I'm using the query below, also I tried it with LIMIT to check if it matched the requirement. It produced the correct result.

:auto using periodic commit 5000 load csv with headers from 'file.csv' as row 
with toInteger(row.parent_id) AS parentid, toInteger(row.id) AS id, row.name AS name
merge (c:childnode {id: id})
on match set c.name = rank
on create set c.name = rank
merge (parent:parentnode {id: parenttaxid})
merge (c)-[:PARENT]->(parent)
return (c)-[:PARENT]->(parent)

I came to know this query(posted 2yrs ago) when I searched for a similar scenario in this community.
So for every 5 minutes, the node label gets updated by 5000. For 2.3 million rows, it will approximately take more than 20 hours
if I'm correct.

Is this normal with Neo4j?
Or Am I committing some silly mistakes?

Best,
Ankita R

andrew_bowman · ‎05-07-2021

5 minutes to add 5000 nodes doesn't seem right.

Please run an EXPLAIN of the query (excluding :auto) and add the query plan here.

Also, don't RETURN anything from a load query like this, you can always query for the loaded data later.

Cobra · ‎05-06-2021

You put UNIQUE CONSTRAINTS on childnode and parentnode and the query looks optimize so everything good. Now something you can do is to increase the RAM of your database

You can use this tool to get the values of parameters based on the RAM of your machine.

Regards,
Cobra

Ankita_Rahavach · ‎05-06-2021

Ahh, That makes sense.
I will also check the link you have provided.
I appreciate your responses and timely help.
Thanks a lot.

Best,
Ankita R

Ankita_Rahavach · ‎05-07-2021

Here is the Query Plan and the warning message:

Do let me know if you require another format of query plan instead of png.

Best,
Ankita R

andrew_bowman · ‎05-07-2021

Thanks, the query plan shows three things that are contributing to the slowness.

First, we see that label scans are being used for both parent and c, so you're missing indexes, making these lookups extremely expensive (it's scanning all :childnodes and all :parentnodes for every row, not just once).

A followup question here...what is the difference between a :childnode type and a :parentnode type? If these are supposed to be the same type and the only difference is the relationship they have to one another (especially if a child node can be the parent node of another), then this the wrong way to model your data. If they are of the same type, then you should use the same label for both, and let the relationships you create between them handle the relationship they have to each other, not with the label. If you do change the labels for your data, make sure you have an index or a unique constraint created so your MATCH and MERGE operations on those nodes is fast.

The other thing going on is that there is an Eager operator in the plan, and that's sabotaging the periodic commit behavior. We need to get rid of the Eager here, but first deal with the above issues.

Ankita_Rahavach · ‎05-07-2021

I get the problem now. Yes, you are correct, they are supposed to be the same type. A child node can also be a parent of another. I will use the label childnode for both parent and c.

:auto using periodic commit 5000 load csv with headers from 'file:///node.csv' as row 

with toInteger(row.parentid) AS parentid, toInteger(row.id) AS id, row.rank AS rank

merge (c:childnode {id: id})

on match set c.name = rank

on create set c.name = rank

merge (parent:childnode {id: parentid})

merge (c)-[:PARENT]->(parent)

This query now takes approximately 1 minute to add 5000 nodes which is significantly fast than the previous query. I hope this query is as per your suggestion and is this rate ok for the processing of the query?

andrew_bowman · ‎05-10-2021

No, 1 minute for 5000 nodes is far too slow. This should be subsecond.

You didn't mention if you added indexes or not. If you didn't, please do, and then verify with an EXPLAIN of the query.

We also know there is an Eager in the query plan, and that will sabotage the periodic committing. The reason is the two MERGE operations.

We would recommend doing two passes through your CSV first, one to only MERGE c, another pass to only MERGE parent, and a last pass to MATCH (instead of MERGE) both nodes and MERGE the relationship between them. That will remove the Eager.

Ankita_Rahavach · ‎05-12-2021

I've created an index on c.id and the query was able to load 2,317,840 nodes within 2-3 minutes. Then I also followed the passes and eliminated the Eager operator. Finally, 2317840 relationships, got completed after 164007 ms.

Thank you for your help

Neo4j

Creating relationship using MERGE runs endlessly