cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Apoc iterated MergeNodes will not merge all matching nodes

I am trying to merge duplicate nodes using a combination of apoc.periodic.iterate and apoc.refactor.mergeNodes but get a strange result. The code runs and seems to do the job with the clusters touched upon, but when finished there are still some clusters of duplicates left. When I run the code again some more duplicates are merged correctly but it does not go over the whole database.

The principle is as follows:
A central parent node (p) can have several child nodes (c) that sometimes are duplicates and should be merged. As merging criteria I am using a combination of

  1. link to the same parent node (p) with a unique number (p.number)
  2. identical property values on the linked child nodes (c.idstring)

Neo4j 3.5.4 Enterprise edition
Apoc 3.5.0.3
Code:

CALL apoc.periodic.iterate(
"Match (p:Parent)
RETURN p",

"WITH p
MATCH (p:Parent)-(r:HAS_LINK)-(c:Child)
WITH c.idstring AS idstring, p.number AS number,
COLLECT p AS nodes
CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node
RETURN node",

{batchsize: 1000, parallel: true})
;

6 REPLIES 6

Your statement cannot work, please try each query with explain

e.g. collect(p) as nodes or the relationship-syntax -[r:HAS_CHILD]->

I wouldn't do that in parallel because they can step on each other.
Also you must make sure that the batch size doesn't split across parents that share a child.
probably better to do the match in the driving query and pass the parent collection to the executing query

something like:

CALL apoc.periodic.iterate(
"MATCH (p:Parent)-[:HAS_LINK]->(c:Child)
WITH c.idstring AS idstring, p.number AS number, collect(p) AS nodes
RETURN nodes",

"CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node RETURN count(*)",

{batchsize: 1000, parallel: true})
;

The code I have been using works fine and does the job, but not all parent-child clusters are processed.

They won't step on each other since no child has more than one parent. Since the batch focusses on the parents only, I thought that all clusters in the database would be handled, 1000 clusters in every iteration. Or will the batch size include both parents and children, thus leaving some duplicate children unmerged..?

I just thought because you aggregate both on parent and child information, if there is p.number shared between parents then you'd get the effect I mentioned.

Ok, thanks. Any ideas about the batch content; will it only include parents or a mix of parents and linked child nodes?

If your tree is clearly separated and no repeating parent.number then your query should have isolated baches of parents, which then are aggregated in your query and merged.

Perhaps you can try to reproduce with one of the missing merged nodes in a non-periodic-iterate example?

I have repeated the mergenodes code without iteration on some remaining clusters, and then they are merged as they should... This is a mystery to me.