Neo4j

jhellquist · ‎08-29-2019

I am trying to merge duplicate nodes using a combination of apoc.periodic.iterate and apoc.refactor.mergeNodes but get a strange result. The code runs and seems to do the job with the clusters touched upon, but when finished there are still some clusters of duplicates left. When I run the code again some more duplicates are merged correctly but it does not go over the whole database.

The principle is as follows:
A central parent node (p) can have several child nodes (c) that sometimes are duplicates and should be merged. As merging criteria I am using a combination of

link to the same parent node (p) with a unique number (p.number)
identical property values on the linked child nodes (c.idstring)

Neo4j 3.5.4 Enterprise edition
Apoc 3.5.0.3
Code:
CALL apoc.periodic.iterate( "Match (p:Parent) RETURN p",

"WITH p
MATCH (p:Parent)-(r:HAS_LINK)-(c:Child)
WITH c.idstring AS idstring, p.number AS number,
COLLECT p AS nodes
CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node
RETURN node",

{batchsize: 1000, parallel: true})
;

michael_hunger · ‎08-30-2019

Your statement cannot work, please try each query with explain

e.g. collect(p) as nodes or the relationship-syntax -[r:HAS_CHILD]->

I wouldn't do that in parallel because they can step on each other.
Also you must make sure that the batch size doesn't split across parents that share a child.
probably better to do the match in the driving query and pass the parent collection to the executing query

something like:

CALL apoc.periodic.iterate(
"MATCH (p:Parent)-[:HAS_LINK]->(c:Child)
WITH c.idstring AS idstring, p.number AS number, collect(p) AS nodes
RETURN nodes",

"CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node RETURN count(*)",

{batchsize: 1000, parallel: true})
;

jhellquist · ‎08-30-2019

The code I have been using works fine and does the job, but not all parent-child clusters are processed.

They won't step on each other since no child has more than one parent. Since the batch focusses on the parents only, I thought that all clusters in the database would be handled, 1000 clusters in every iteration. Or will the batch size include both parents and children, thus leaving some duplicate children unmerged..?

michael_hunger · ‎08-30-2019

I just thought because you aggregate both on parent and child information, if there is p.number shared between parents then you'd get the effect I mentioned.

jhellquist · ‎08-31-2019

Ok, thanks. Any ideas about the batch content; will it only include parents or a mix of parents and linked child nodes?

michael_hunger · ‎08-31-2019

If your tree is clearly separated and no repeating parent.number then your query should have isolated baches of parents, which then are aggregated in your query and merged.

Perhaps you can try to reproduce with one of the missing merged nodes in a non-periodic-iterate example?

jhellquist · ‎09-04-2019

I have repeated the mergenodes code without iteration on some remaining clusters, and then they are merged as they should... This is a mystery to me.

Neo4j

Apoc iterated MergeNodes will not merge all matching nodes