cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Cosine similarity on 1M person nodes

rc
Node Clone

hello, 

I am a neo4j newbie and I am working on an entity resolution in my graph. 

I have person nodes with a first name, last name and date of birth. I am looking to create an [:IS_SIMILAR] relationship between person nodes which are the same person entity. 

I have 1M person nodes. 

I am trying to use a cosine similarity function to resolve the entities. I have created the embeddings but my cosine similarity function is taking a very long time to run. 

The code is provided below: 

MATCH (p1:Person)
MATCH (p2:Person)
WHERE p1 <> p2
WITH p1 as person1, p1.embedding as p1Data, p2 as person2, p2.embedding as p2Data
WITH person1, person2, gds.similarity.cosine(
p1Data, p2Data
) AS cosineSimilarity
WHERE cosineSimilarity > 0.8
MERGE (person1) -[s:IS_SIMILAR]- (person2)
RETURN count(s)

I know this will create a cartesian product and try and evaluate the similarity of each pair of nodes. I would be very grateful if you could please let me know how I can optimise this as it is taking hours to run. 

Any assistance will be greatly appreciated. 

RC 

4 REPLIES 4

rc
Node Clone

thinking of using knn instead

Hello @rc 😅

Here is a solution more optimized with APOC plugin:

CALL apoc.periodic.iterate("
	MATCH (p1:Person) 
	MATCH (p2:Person) 
	WHERE id(p1) < id(p2) 
	WITH p1, p2, gds.similarity.cosine(p1.embedding, p2.embedding) AS cosineSimilarity 
	WHERE cosineSimilarity > 0.8 
	RETURN p1, p2, cosineSimilarity
	", "
	MERGE (p1)-[s:IS_SIMILAR]->(p2) 
	SET s.similarity = cosineSimilarity
	", 
	{batchSize: 10000, parallel: true}
);

If you want to use KNN, you will have first to project your graph in-memory, then apply the algorithm on it. The solution with graph in-memory should be faster normally.

Regards,
Cobra

I was going to mention it, but @Cobra has it in his solution. Try using ‘id(p1)<id(p2)’ instead of ‘p1<>p2’, as the latter will result in each pair of person nodes getting evaluated twice. This is because in the Cartesian product, persons (a,b) is also represented by (b,a). As a result your query is evaluating twice as many nodes as necessary. I suspect your query resulted in two relations being created for each pair of nodes. Using the condition with node id inequalities eliminates one pair of the two pairs. 

rc
Node Clone

thank you - I will try this and let you know how i get on