Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
12-28-2022 11:54 AM
I have a graph which has (:UID)-[]->(:PII). And I have a 'weight' property assigned to the PII nodes according to what kind of PII it is. Example, SSN: 10, PHONE: 5, IP: 1, etc.
I want to calculate node similarity between the (:UID) nodes based on the PII shared between them.
I want to use (:UID)-[]->(:PII)<-[]-(:UID) to create (:UID)-[:connected {weight: sum(PII_weights)}]-(:UID). Where the strength of relationship between two people is the sum of all PII weights being shared between them
Is node similarity the right function to use here? Can you provide some sample syntax on how to do this?
I asked chatGPT this and after prompting it a bit more, it came up with this
CALL apoc.periodic.iterate( "MATCH (uid1:UID)-[]->(pii:PII)<-[]-(uid2:UID)
WHERE uid1 <> uid2
RETURN uid1, uid2, SUM(pii.weight) AS totalWeight",
"WITH {uid1: uid1, uid2: uid2, totalWeight: totalWeight} AS data
MERGE (data.uid1)-[r:CONNECTED]-(data.uid2)
ON MATCH SET r.weight = data.totalWeight
ON CREATE SET r.weight = data.totalWeight",
{batchSize: 1000, iterateList: true} )
which is similar to my current approach.
I was hoping to use GDS because I thought it would be more efficient than my way of doing it. It takes a very long time for millions of nodes
Look forward to any feedback on this!
12-28-2022 12:32 PM
Node similarity currently supports two metrics: Jaccard, and Overlap. https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/
Neither of those metrics are the sum of weights for common neighbors. However, they might still be useful for matching UIDs based on shared PII. To use either one of them with weights for different types of PII, you need to assign the weight to the relationships between UID and PII instead of as a property on PII.
01-01-2023 11:55 AM
@nsmith_piano answer is on point. You can refer to the documentation they linked for syntax and usage examples.
Another option, if you need even more scale, is using node embeddings + K-Nearest Neighbor (KNN). Here is an example with FastRP node embedding + KNN. For graphs with tens of millions of nodes you are better off using higher dimensional embeddings (128, 256, or up). Like with node similarity, you will need to weight the relationships and project them in an “UNDIRECTED” orientation.
Embeddings + KNN can be faster at larger scales, but it is an unsupervised learning approach with a stochastic element to it as opposed to the deterministic algorithm approach of node similarity or Cypher query.
01-03-2023 12:09 PM
Thanks for both your inputs! Gives me a few different ways to approach this
I was able to refactor the graph so the weights were on the relationships between UID and PII. However, the node similarity algo was extremely slow. After discussing further with @nsmith_piano offline, the plan is to cut down the dense nodes in my graph which offer very little in identifying "interesting" connections. This should make things a lot faster
Thanks @zach_blumenfeld for positing the Embeddings + KNN approach! I will definitely be experimenting more with that
All the sessions of the conference are now available online