Neo4j

rc · ‎07-25-2022

hello,

I am a neo4j newbie and I am working on an entity resolution in my graph.

I have person nodes with a first name, last name and date of birth. I am looking to create an [:IS_SIMILAR] relationship between person nodes which are the same person entity.

I have 1M person nodes.

I am trying to use a cosine similarity function to resolve the entities. I have created the embeddings but my cosine similarity function is taking a very long time to run.

The code is provided below:

MATCH (p1:Person)
MATCH (p2:Person)
WHERE p1 <> p2
WITH p1 as person1, p1.embedding as p1Data, p2 as person2, p2.embedding as p2Data
WITH person1, person2, gds.similarity.cosine(
p1Data, p2Data
) AS cosineSimilarity
WHERE cosineSimilarity > 0.8
MERGE (person1) -[s:IS_SIMILAR]- (person2)
RETURN count(s)

I know this will create a cartesian product and try and evaluate the similarity of each pair of nodes. I would be very grateful if you could please let me know how I can optimise this as it is taking hours to run.

Any assistance will be greatly appreciated.

RC

rc · ‎07-25-2022

thinking of using knn instead

Cobra · ‎08-07-2022

Hello @rc 😅

Here is a solution more optimized with APOC plugin:

CALL apoc.periodic.iterate("
	MATCH (p1:Person) 
	MATCH (p2:Person) 
	WHERE id(p1) < id(p2) 
	WITH p1, p2, gds.similarity.cosine(p1.embedding, p2.embedding) AS cosineSimilarity 
	WHERE cosineSimilarity > 0.8 
	RETURN p1, p2, cosineSimilarity
	", "
	MERGE (p1)-[s:IS_SIMILAR]->(p2) 
	SET s.similarity = cosineSimilarity
	", 
	{batchSize: 10000, parallel: true}
);

If you want to use KNN, you will have first to project your graph in-memory, then apply the algorithm on it. The solution with graph in-memory should be faster normally.

Regards,
Cobra

glilienfield · ‎08-07-2022

I was going to mention it, but @Cobra has it in his solution. Try using ‘id(p1)<id(p2)’ instead of ‘p1<>p2’, as the latter will result in each pair of person nodes getting evaluated twice. This is because in the Cartesian product, persons (a,b) is also represented by (b,a). As a result your query is evaluating twice as many nodes as necessary. I suspect your query resulted in two relations being created for each pair of nodes. Using the condition with node id inequalities eliminates one pair of the two pairs.

rc · ‎08-08-2022

thank you - I will try this and let you know how i get on

Neo4j

Cosine similarity on 1M person nodes