Neo4j

Illidan7 · ‎12-28-2022

I have a graph which has (:UID)-[]->(:PII). And I have a 'weight' property assigned to the PII nodes according to what kind of PII it is. Example, SSN: 10, PHONE: 5, IP: 1, etc.

I want to calculate node similarity between the (:UID) nodes based on the PII shared between them.

I want to use (:UID)-[]->(:PII)<-[]-(:UID) to create (:UID)-[:connected {weight: sum(PII_weights)}]-(:UID). Where the strength of relationship between two people is the sum of all PII weights being shared between them

Is node similarity the right function to use here? Can you provide some sample syntax on how to do this?

I asked chatGPT this and after prompting it a bit more, it came up with this

CALL apoc.periodic.iterate( "MATCH (uid1:UID)-[]->(pii:PII)<-[]-(uid2:UID)
WHERE uid1 <> uid2 
RETURN uid1, uid2, SUM(pii.weight) AS totalWeight", 
"WITH {uid1: uid1, uid2: uid2, totalWeight: totalWeight} AS data 
MERGE (data.uid1)-[r:CONNECTED]-(data.uid2) 
ON MATCH SET r.weight = data.totalWeight 
ON CREATE SET r.weight = data.totalWeight", 
{batchSize: 1000, iterateList: true} )

which is similar to my current approach.

I was hoping to use GDS because I thought it would be more efficient than my way of doing it. It takes a very long time for millions of nodes

Look forward to any feedback on this!

nsmith_piano · ‎12-28-2022

Node similarity currently supports two metrics: Jaccard, and Overlap. https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/

Neither of those metrics are the sum of weights for common neighbors. However, they might still be useful for matching UIDs based on shared PII. To use either one of them with weights for different types of PII, you need to assign the weight to the relationships between UID and PII instead of as a property on PII.

zach_blumenfeld · ‎01-01-2023

@nsmith_piano answer is on point. You can refer to the documentation they linked for syntax and usage examples.

Another option, if you need even more scale, is using node embeddings + K-Nearest Neighbor (KNN). Here is an example with FastRP node embedding + KNN. For graphs with tens of millions of nodes you are better off using higher dimensional embeddings (128, 256, or up). Like with node similarity, you will need to weight the relationships and project them in an “UNDIRECTED” orientation.

Embeddings + KNN can be faster at larger scales, but it is an unsupervised learning approach with a stochastic element to it as opposed to the deterministic algorithm approach of node similarity or Cypher query.

Illidan7 · ‎01-03-2023

Thanks for both your inputs! Gives me a few different ways to approach this

I was able to refactor the graph so the weights were on the relationships between UID and PII. However, the node similarity algo was extremely slow. After discussing further with @nsmith_piano offline, the plan is to cut down the dense nodes in my graph which offer very little in identifying "interesting" connections. This should make things a lot faster

Thanks @zach_blumenfeld for positing the Embeddings + KNN approach! I will definitely be experimenting more with that

Neo4j

Weighted node similarity calculation on a heterogeneous graph