cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Aggregation of node triplets into tuples in parallel

Consider the model (a1:Account)-->(t:Transfer {amount})-->(a2:Account)

where, say, :Account represents a bank account and :Transfer represents a bank transfer containing the numeric property "amount".

What is the best option to create additional relationships representing the total amount flown between all accounts pairs?
And how to make it scale to a graph containing hundreds of millions of :Transfers?

Naively one could do:

MATCH (a1:Account)-->(t:Transfer {amount})-->(a2:Account)
WITH a1, a2, sum(t.amount) as total
CREATE (a1)-[f:total_flow]->(a2)
SET f.amount = total

But the above does not use any parallelism and on a very large graph it will take ages and use up lots of heap space.

I have been thinking of using apoc.periodic.iterate() in the following way:
CALL apoc.periodic.iterate(
"MATCH RETURN a",
"MATCH (a)-->(t:Transfer)-->(:a2)
WITH a, a2, sum(t.amount) as total
CREATE (a)-[f:total_flow]->(a2)
SET f.amount = total
)
However because of the locking acquired by the db on the node a2 some CREATE operations (concurrently trying to add relationships to a2) will fail, right?
I guess the failure rate could be very high if the node a2 has a lot of incoming flows from many other accounts (i.e., there will be many queries that try to add relationships to the node a2 in parallel).
Setting an arbitrarily high retry value in apoc.periodic.iterate() does not seem a clean option.

Has anyone tackle this kind of problem or has any better strategy to do this large scale aggregation?

Is cypher not the best way to go for this kind of operations? Would it be any better to create a user defined procedure that accumulates the total flows on a concurrent java data structure while scanning the graph and then writes all the new edges at the end of the accumulation?

2 REPLIES 2

No it will not fail as by default it's not running in parallel.

Interesting approach on the statement, I like the idea of having the accounts being the driving point for the aggregation as you otherwise would need to do global aggregation queries which are more expensive.

And add the relationship-types if possible!

CALL apoc.periodic.iterate(
"cypher runtime=slotted MATCH (a:Account) RETURN a",
"MATCH (a)-[:TRANSFER]->(t:Transfer)-[:TRANSFER]->(a2)
WITH a, a2, sum(t.amount) as total
CREATE (a)-[f:total_flow]->(a2)
SET f.amount = total
", {batchSize:10000, parallel:false})

You could further group by source or target nodes or run a clustering algo if you wanted to identify independent clusters.

What is the reason for creating those relationships? Just curious because for graph algorithms you don't necessarily need them.

Thanks @michael.hunger

My goal is indeed to use parallelism if possible.
But I think we agree that if I change the default params to
CALL apoc.periodic.iterate("...", "...", {parallel: true})
I will incur in write locks on a2, right?

any other alternative idea?
I'm not interested in graoph algo here, but maybe there is something in the algo codebase that can be adapted to my scope? (maybe the code behind cypher projections can come in hand? https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/cypher-projection/#cypher-proj...). I'm happy to write a procedure if needed.

The motivation behind aggregation is purely performance-related.