cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Hashing values through the graph

We are comparing two very large tree graphs that are 95% identical. The goal is to flag the whole tree on the 2nd graph when a difference is found. I am hoping for a simpler solution using the hash function. I am struggling with updating one node, when it is part of a tree. That concept is just not sticking for me. In SQL this is a simple subquery. Neo4j, its probably simple but I'm not getting it. This is a variation on the fingerprinting example in the documentation, with the added complexity of needing to store the value.

To explain further, we organize by types. One of the important types is 'CallableUnit'. They represent blocks of nodes. If any node below "CallableUnit" changed (Red circle), the whole tree up to that CallableUnit needs to be marked. If any node has changed in a branch, I need to find the next highest "root" node (type:'CallableUnit') and mark it and all of its sub-nodes as a.changed = true. As you can see in this graphic, the whole path from the bottom to the CallableUnit node need have their attribute (a.changed) set to true. As you can also see from the Map of Application box, this is a decent sized application.

So we have two graphs, version 1 (original), and version 2 ( changed)

My plan is simple, but APOC is not cooperating. Every time a new graph is created, I was thinking of getting all of the CallableUnits (e.g. match (a:ProgNode {type:'CallableUnit')) and then iterate through that list of nodes (foreach) and calculate the hash value of that node (using a.text only), plus all of its subnodes (hash (a)-[*]->()) and storing it on the CallableUnit (set a.hash = ). That all gets done in batch and if it takes a bit, thats ok. Its still faster than a node by node comparison in Diff ( I think). Version 1 now has a hash value for every callableunit.

When the Version 2 graph is loaded, I would do the same load. Now both versions have a.hash set for every callableunit based on a.text

Then all i have to do is find CallableUnit nodes where the a.text = b.text, but the hash value (a.hash <> b.hash) is different. If I get a list of the CallableUnit roots that are different, I can mark the nodes from it to the bottom as a.changed = True. And just to make this MORE Complicated, I need to do it in batches because it is WAY too big to do in one shot.

CALL apoc.periodic.iterate(
"match 'b = (a:ProgNode {type:'CallableUnit', version:0})-[*]-()'", // get the sets of nodes below each callableunit
"foreach b ", // for each one of those
"c = apoc.hashing.fingerprint(b)", // calculate a hash
SET a.hash = c", // store it in the original
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)

Note: for completeness, If the callableUnit has vanished in version 2, then it was deleted ( and thats ok). we account for that differently. If it did not exist in version 1, it was added. That is a change.

Note2: Yes, the hash code would have to be changed as the version changed. Easy enough to rerun.

2 REPLIES 2

Anyone have any thoughts here ? Thanks

2 weeks - no hits ! Could use the help on approach.