Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-29-2019 06:16 AM
I am trying to merge duplicate nodes using a combination of apoc.periodic.iterate and apoc.refactor.mergeNodes but get a strange result. The code runs and seems to do the job with the clusters touched upon, but when finished there are still some clusters of duplicates left. When I run the code again some more duplicates are merged correctly but it does not go over the whole database.
The principle is as follows:
A central parent node (p) can have several child nodes (c) that sometimes are duplicates and should be merged. As merging criteria I am using a combination of
Neo4j 3.5.4 Enterprise edition
Apoc 3.5.0.3
Code:
CALL apoc.periodic.iterate(
"Match (p:Parent)
RETURN p",
"WITH p
MATCH (p:Parent)-(r:HAS_LINK)-(c:Child)
WITH c.idstring AS idstring, p.number AS number,
COLLECT p AS nodes
CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node
RETURN node",
{batchsize: 1000, parallel: true})
;
08-30-2019 03:07 AM
Your statement cannot work, please try each query with explain
e.g. collect(p) as nodes
or the relationship-syntax -[r:HAS_CHILD]->
I wouldn't do that in parallel because they can step on each other.
Also you must make sure that the batch size doesn't split across parents that share a child.
probably better to do the match in the driving query and pass the parent collection to the executing query
something like:
CALL apoc.periodic.iterate(
"MATCH (p:Parent)-[:HAS_LINK]->(c:Child)
WITH c.idstring AS idstring, p.number AS number, collect(p) AS nodes
RETURN nodes",
"CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node RETURN count(*)",
{batchsize: 1000, parallel: true})
;
08-30-2019 04:10 AM
The code I have been using works fine and does the job, but not all parent-child clusters are processed.
They won't step on each other since no child has more than one parent. Since the batch focusses on the parents only, I thought that all clusters in the database would be handled, 1000 clusters in every iteration. Or will the batch size include both parents and children, thus leaving some duplicate children unmerged..?
08-30-2019 05:37 PM
I just thought because you aggregate both on parent and child information, if there is p.number shared between parents then you'd get the effect I mentioned.
08-31-2019 02:15 AM
Ok, thanks. Any ideas about the batch content; will it only include parents or a mix of parents and linked child nodes?
08-31-2019 03:25 AM
If your tree is clearly separated and no repeating parent.number then your query should have isolated baches of parents, which then are aggregated in your query and merged.
Perhaps you can try to reproduce with one of the missing merged nodes in a non-periodic-iterate example?
09-04-2019 12:25 AM
I have repeated the mergenodes code without iteration on some remaining clusters, and then they are merged as they should... This is a mystery to me.
All the sessions of the conference are now available online