Neo4j

nikolai_mynkow · ‎03-11-2020

I have 2kk users with traditional sponsor relation. In other words a user can have one sponsor but a user could sponsor many other users. All the users are imported periodically and then the relations are created. Imagine it that every 2 hours you have to create the relations for all users. Till now the query was:

MATCH (u:User), (sponsor:User)
WHERE u.sponsorId = sponsor.userId
MERGE (sponsor)-[:SPONSORED]->(u)

This is very inefficient query in my mind because it creates a Cartesian product which is unnecessary because we know that a user can have only one sponsor. So I am trying to do something more efficient but I need help from someone to point me what I am doing wrong. Here is my latest attempt:

MATCH users =(begin:User)-[*]->(END )
FOREACH (n IN nodes(users)| MERGE (sp:User{userId:n.sponsorId})-[:SPONSORED]->(n) )

andrew_bowman · ‎03-12-2020

Have you done an EXPLAIN of the query?

The expensive part here is that if you're running this every 2 hours, it's looking at all nodes of one label or the other, matching to nodes that have likely already been processed and MERGing on existing relationships. The cost will just keep increasing.

Have you considered on import setting some temporary label, such as :Unprocessed, letting you match only to unprocessed nodes, match out to a sponsor, create the relationship, then remove the :Unprocessed label?

nikolai_mynkow · ‎03-12-2020

This is the explain for the currently working query. It is very slow and it takes 15-20 minutes to execute.

I am doing this in a docker container and the entire tree is created from scratch every 2 hours. The reason for this is because I have no idea what is the delta/what was changed during the 2 hours period.

What I really want to do is to iterate the all nodes only once. All nodes are created upfront and we know for sure that they are there. Should be something like:

foreach node in allNodes
{
     create relation (node.userId SPONSORED_BY node.sponsorId)
}

Where the node has two properties/fields:

userId
sponsorId

andrew_bowman · ‎03-12-2020

Can you give an EXPLAIN plan for this query, expanding all elements of the plan (double-down arrow in the lower right of the result pane)?

nikolai_mynkow · ‎03-13-2020

Sure, here is what I got:

andrew_bowman · ‎03-13-2020

You can see from the plan in the original query that it's doing exactly what you asked for:

MATCH (u:User), (sponsor:User) WHERE u.sponsorId = sponsor.userId MERGE (sponsor)-[:SPONSORED]->(u)

It does a label scan across all :User nodes, and for each one it does an index lookup of the user that is the sponsor.

You can use CREATE instead of MERGE for this, if no :SPONSORED relationships exist when this is run.

That said, the bottleneck is likely the sheer size of the transaction, since you're attempting to create all these relationships in a single transaction. That means all that state data must be manifested in the heap at once so it can be committed atomically.

You may be better off batching this so heap usage remains low.

With apoc.periodic.iterate() from APOC procedures, a batched query may look like:

CALL apoc.periodic.iterate("MATCH (u:User), (sponsor:User) 
WHERE u.sponsorId = sponsor.userId 
RETURN u, sponsor",
"MERGE (sponsor)-[:SPONSORED]->(u)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

nikolai_mynkow · ‎03-13-2020

GODLIKE

thank you @andrew.bowman

PS: I had to install the apoc => https://stackoverflow.com/a/56227465

Neo4j

How to create simple sponsor relation efficiently?