cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

WCC / Union.find algorithm - Help Requested (Large Database)

Consider the following graph which I created using the following:

#Create Nodes for testing
create (n1:person {name:'bob'})-[:SOME_RELATIONSHIP]->(c1:ARBITRARY_LABEL {id:'1'})
create (n11:person {name:'jane'})-[:SOME_RELATIONSHIP]->(c1)
create (n2:person {name:'jack'})-[:SOME_RELATIONSHIP]->(c2:ARBITRARY_LABEL {id:'2'})
create (n22:person {name:'steve'})-[:SOME_RELATIONSHIP]->(c2)
create (n23:person {name:'jane'})-[:SOME_RELATIONSHIP]->(c2)
create (n3:person {name:'joseph'})-[:SOME_RELATIONSHIP]->(c3:ARBITRARY_LABEL {id:'3'})
create (n4:person {name:'jack'})-[:SOME_RELATIONSHIP]->(c3)
return *

Graph Overview:

  1. People connect to a random green node via some arbitrary relationship.
  2. There are thousands of these connections within the current graph.
  3. It is apparent that there are duplicate people nodes with the same name connected throughout various parts of the graph.
  4. For example, Jane is connected to 1 and also to 2.
  5. Jack is connected to 2 and also to 3.
  6. The others however only have a single connection to the arbitrary green nodes.

Objective

  1. The objective is to perform a query with what I think is called (WCC - formerly union.find) or some other similar query that will dynamically search throughout the entire database to find other occurrences of each of the names and produce a single cluster for only those relationships where there is overlap.
  2. To better describe what the final outcome should look like let's try walking through a quick example. The algorithm will look at Bob and discover that there are no other occurrences of Bob in the graph and will then move on to Jane. Because Jane is connected to 1 and 2 it merges everyone connected to 1 and 2 together and points them all to a new UNIQUELY generated node using a property as a constraint with some randomly generated id . 1 and 2 are discarded after that point as there is no use for them in the graph anymore. The duplicate node for Jane is dropped as there should only be a single occurrence of her in the graph.
  3. At this point, bob, jane, steve, jack and jane all point to a new node. The algorithm then continues and finds that jack is connected to two separate nodes. The cycle repeats and the smaller set of names is merged into the one with the larger set of names. The final output should display a graph similar to the following:

3X_f_c_fc93f1d143117aeea2363ed92e4af36b5f52ec6a.png

Final Notes

  1. I currently use apoc.merge but find that a database consisting of millions of these, the process is quite tedious and slow. There are several presentations on union.find and WCC which I think might be the better way to go. Any help would be greatly appreciated.
1 REPLY 1

The WCC algorithm won't help you merge nodes. It might help you find the candidates for merging, but it seems you don't have a problem with that?

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online