cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Improve performance of apoc.refactor.mergeNodes

Peter_Lian
Node Clone

The following is the cypher that I  used to merge the duplicate node

###################

MATCH (n:User)

WITH n.user AS repeatuser, collect(n) AS nodes

WHERE size(nodes) > 1

CALL apoc.refactor.mergeNodes(nodes) 

YIELD node

RETURN node

######################

Question : How can I run the above query faster ?  I tried the following

 

CALL apoc.periodic.iterate(
'MATCH (n:Process) RETURN  n.pid AS repeatpid',
'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node return node', {batchSize:10000, parallel:true}) yield total
 
Although it work but it still take lots of time (with total 180,000 nodes with RAM has 500GB, 44 core CPU 88 threads)

Thanks.

 

1 ACCEPTED SOLUTION

How do you plan on running this?  
‘Call {} in transactions’ only works with implied transactions. This requires prepending ‘:auto’ when executing in the browser. 

:auto

MATCH (n:Process)

WITH n.pid AS repeatpid,  collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 

YIELD node

RETURN  5

} in transactions of 10000 rows

Can you remove the ‘return’ or both the ‘yield and ‘return’, or does it complain neither is allowed?

https://neo4j.com/docs/cypher-manual/current/clauses/call-subquery/#_batching

how do you have so many duplicates? 

View solution in original post

6 REPLIES 6

glilienfield
Ninja
Ninja

Do you need to return the whole node or anything for that matter?  If not, try removing the return statement.  If it complains you can’t end with a call without returning anything, return a constant or a limited number of node properties. 

You could wrap the apoc procedure in a ‘call subquery in transaction’ clause, importing ‘nodes’ using ‘with’. This would batch the updates.

 In your implementation using ‘apoc.periodic.iterate’, you are matching twice to get the same nodes. I would suggest the first query create the collections and return them. The second query calls the apoc method for each collection of nodes created in the first query.  This would be similar to using ‘call subquery’. 

Thanks, the "call subquery" and "remove return" work. But for the last (your suggestion), I tried the following

 

CALL{
CALL apoc.periodic.iterate(
'MATCH (n:Process) WITH n.pid AS repeatpid, collect(n) AS nodes WHERE size(nodes)>1 RETURN nodes ',
' CALL apoc.refactor.mergeNodes(nodes) yield node return 5', {batchSize:10, parallel:true}) yield total
}
 
It just keep running and do not reach the end. Could you give me some hint that where I wrong? Thanks.

You should not need the call subquery. I suggested using ‘call subquery with transactions’ as an alternative to apoc.periodic.iterate.

I assume the nodes you are merging have relationships, which will be merged too. As such, you may get record locking contention. Try not running it parallel.  Also, try increasing the batch sized. You could try 10,000. Decrease if you experience memory issues. 

Excuse me, now I have 483,000 node (all named as "process") with 2,950,000 relation (all named as "fork"), I tried the following 

(A) : Call subquery with transaction 

 

CALL{

MATCH (n:Process)

WITH n.pid AS repeatpid,  collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 

YIELD node

RETURN  5

}

}

 

(B) : apoc.periodic (no parallel)

CALL apoc.periodic.iterate(
'MATCH (n:Process) RETURN  n.pid AS repeatpid',
'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node RETURN 5', {batchSize:10000, parallel:false}) yield total
 
It spent 9mins and almost 50mins for (A) and (B), respectively. Is there any possible to add Batchsize for method (A) for anything that can improve the performance again? Since in fact in my database there would be about 1 billion node which need to be merged...
 
Thanks lot.

How do you plan on running this?  
‘Call {} in transactions’ only works with implied transactions. This requires prepending ‘:auto’ when executing in the browser. 

:auto

MATCH (n:Process)

WITH n.pid AS repeatpid,  collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 

YIELD node

RETURN  5

} in transactions of 10000 rows

Can you remove the ‘return’ or both the ‘yield and ‘return’, or does it complain neither is allowed?

https://neo4j.com/docs/cypher-manual/current/clauses/call-subquery/#_batching

how do you have so many duplicates? 

I tried what you showed, i.e.

 

:auto MATCH (n:Process)
WITH n.pid AS repeatpid,  collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
WITH nodes
CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 
YIELD node
} in transactions of 10000 rows
 
To my surprised that it take about 30mins which is slower than without in transactions of 10000 rows (9mins), why Batch not faster...?
 
By the way, may I asked that what's difference or what kind of timing that I should chose apoc.periodic.iterate or call subquery with transactions ?
 
Moreover, that's because I'm checking the security log data which contains lots of duplicate information (like user, internet name, computer...)
 
Appreciate about it.
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online