cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Query Optimisation Help - Aggregate count of relationships between two related nodes

PeteM
Node Clone

Hi, I have a bit of a challenge around optimising a query to run more efficiently, as currently it takes too long.

Just some background: The database size is 90m nodes, 700m relationships - although this query is being tested on a much smaller sample database size of 1m nodes, 20m relationships and the issue is the same.

What I am trying to achieve is create an aggregate count of a specific relationship between two already related nodes. <- This is actually working, however the issue is it is taking an extremely long time even on the smaller database sample.

So for example, a Person has opened a Campaign 5 times - so there is 5 :OPENED relationships between Person and Campaign. The objective is to create a summary relationship that includes this total count as a property on the new relationship being added TOTAL_OPENED.

// Starting with this
(p:Person)-[:OPENED]->(c:Campaign) (x5 times on the same Person and Campaign node pairs)

// Ending with this
(p:Person)-[newRel:TOTAL_OPENED]->(c:Campaign) SET newRel.total = 5

Here is the shape of the model I am starting with: (note that this is just one sample node pair being returned - the query actually creates a new relationship for 1m node pairs)

3X_7_e_7ed836cb98c6f1b11b1434a243ccdf91158107d5.png

Here is how it ends up: (note the new relationship TOTAL_OPENED created)

3X_4_9_492ceddffa24ec392ecc80df1008a7953f39bcad.png

Here is the query being run:

PROFILE MATCH (p:Person)-[rel:OPENED]->(c:Campaign)
WHERE NOT (p)-[:TOTAL_OPENED]->(c)
WITH p, rel, c LIMIT 2000
WITH COUNT(c) AS total, p, c
CREATE (p)-[totalRel:TOTAL_OPENED]->(c)
SET totalRel.total = total
RETURN p.emailId, total, c.campaignId

This query is being run with apoc.periodic.commit in batches of 2,000.

I have tried it also with putting the COUNT(c) AS total BEFORE the LIMIT, however it seems to increase the number of DB hits, not reduce it.

Here is a profile summary of the query:

Using the apoc.periodic.commit in 2k batches it took 1 hour 6 mins to actually do this on the 1m sample database.

Appreciate any help and all suggestions, thanks.

1 ACCEPTED SOLUTION

apoc.periodic.commit() isn't the best way to handle this. After each batch, the query is executed again from the very beginning, so you will be MATCHing back to nodes you've already processed. Since you have WHERE NOT (p)-[:TOTAL_OPENED]->(c) it will eventually filter out the ones you've already done, but still the MATCHing of your original pattern and filtering will end up being performed on previously processed nodes over and over and over again. It is a lot of unnecessary work.

You should consider using apoc.periodic.iterate() instead.

Also if you're on Neo4j 4.1.x or higher, consider using subqueries to help constrain the aggregations.

One approach you could use:

CALL apoc.periodic.iterate("
MATCH (p:Person)
WHERE size((p)-[:OPENED]->()) > 1
RETURN id(p) as personId",
"MATCH (p)
WHERE id(p) = personId
CALL {
 WITH p
 MATCH (p)-[:OPENED]->(c:Campaign)
 WITH p, c, count(c) as total
 CREATE (p)-[totalRel:TOTAL_OPENED]->(c) // if this rel type already exists in the graph, use MERGE instead
 SET totalRel.total = total
 RETURN total
}
RETURN count(*) ", {batchSize:10000}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessage

You may need to adjust your batch size, depending on the average number of distinct campaigns a single person may have opened. For example, if on average a single person has opened 10 campaigns, then a batchSize between 1000 and 10000 would be about right, since at a rare maximum (all 10000 persons in a batch opened 10 distinct campaigns each) it would result in 100000 new :TOTAL_OPENED relationships, which is around the max number of changes you want per batch.

View solution in original post

4 REPLIES 4

Instead for creating a new relation to accumulate the number of visit, increase a count property in your relation will make the whole thing drastically easier.

Adding a constraint to your Person node will help to improve the update query performance. If you have a lot of Person the summary will still take a while as you will anchor on each person and read a property value for each relation but will still be a lot faster.

PeteM
Node Clone

Thanks Gabriel - I am actually doing this already for future updates. Hence the newRel.total = 5 part, will be updated specifically using constraint against person.personId in a lookup.

However, the challenge is the data fed in already has these original relationships created prior to needing to introduce this single new relationship.

Am I just going to have to bite the bullet and run this query until it completes? I guess so. In that case, do you know of the best practice to setting the apoc.periodic.commit batch number - since I am currently using 2,000 batches.

apoc.periodic.commit() isn't the best way to handle this. After each batch, the query is executed again from the very beginning, so you will be MATCHing back to nodes you've already processed. Since you have WHERE NOT (p)-[:TOTAL_OPENED]->(c) it will eventually filter out the ones you've already done, but still the MATCHing of your original pattern and filtering will end up being performed on previously processed nodes over and over and over again. It is a lot of unnecessary work.

You should consider using apoc.periodic.iterate() instead.

Also if you're on Neo4j 4.1.x or higher, consider using subqueries to help constrain the aggregations.

One approach you could use:

CALL apoc.periodic.iterate("
MATCH (p:Person)
WHERE size((p)-[:OPENED]->()) > 1
RETURN id(p) as personId",
"MATCH (p)
WHERE id(p) = personId
CALL {
 WITH p
 MATCH (p)-[:OPENED]->(c:Campaign)
 WITH p, c, count(c) as total
 CREATE (p)-[totalRel:TOTAL_OPENED]->(c) // if this rel type already exists in the graph, use MERGE instead
 SET totalRel.total = total
 RETURN total
}
RETURN count(*) ", {batchSize:10000}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessage

You may need to adjust your batch size, depending on the average number of distinct campaigns a single person may have opened. For example, if on average a single person has opened 10 campaigns, then a batchSize between 1000 and 10000 would be about right, since at a rare maximum (all 10000 persons in a batch opened 10 distinct campaigns each) it would result in 100000 new :TOTAL_OPENED relationships, which is around the max number of changes you want per batch.

PeteM
Node Clone

Perfect, thanks Andrew..

Interestingly I actually discovered apoc.periodic.iterate and it did perform better for the :TOTAL_CLICKED relationships because there were not as many to be loaded in the first part of the iterate method. I did need the subquery capability to do the TOTAL_OPENED because there is nearly 200m of them.

I'm going to go ahead and upgrade so I can include the subquery and I'll try with your query above.

Thanks