Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-24-2018 01:25 PM
Greetings all!
I have been struggling with what might be considered deduplication, except it's a matter of deleting similar nodes, rather than precise duplicates. Let me try and explain:
I have (:Author) nodes, with [:WROTE] relationships to (:Book) nodes. Each (:Book) node has a unique ID property, as well as a varying number of relationships to (:Topic) nodes. However, I have duplicate nodes for some Books, so they share: 1. the Author node which :WROTE them, 2. The 'title' property amongst several nodes in many cases.
What I wish to do is to keep the single node, per book with a unique 'title' property, linked to the Artist who wrote it, based on the MAXIMUM number of relationships to (:Topic) nodes- essentially thinning the database by purging "duplicates" with fewer Topic links. Is this possible? Easy?
Thank you,
Henry
09-24-2018 01:47 PM
To make sure I understand your data model correctly, this is your model:
(:Author)-[:WROTE]->(:Book)-[]->(:Topic)
Then you have some duplicates on the book nodes. There's a unique id for the books but the title would be the natural key and how you're determining if the book has a duplicate? You want to merge the duplicates, assuming the book with the most relationships is the one you want to keep?
Have you looked at the APOC merge procedures?
Do you really want to only keep the book node with the most relationships, or merge all the relationships onto a single node? I would think the latter because then you can combine all the work that was done to assign topics to books onto the single book node. If the former then write your query to collect the duplicates and unwind through the duplicates to delete.
09-24-2018 01:51 PM
You have the structure correct, yes. I think merging nodes would be a better solution, yes. Would that not create duplicate relationships? I'm not too familiar with the APOC merge procedures.
09-24-2018 02:18 PM
Yes merging nodes will repoint the relationships from the node going away to the node that is staying. But once everything is consolidated you can then do merge relationship clean up. Here's an older stackoverflow post with some sample code.
All the sessions of the conference are now available online