cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Cosine Similarity Example Unexpected Results

Unexpected results when working through the Cosine Similarity examples in the documentation (https://neo4j.com/docs/graph-data-science/current/alpha-algorithms/cosine/).

Using Neo4j developer edition 4.0.4 and GDS 1.3.

(1) documentation seems to miss that you need a Native Projection to make the streaming examples work. You can easily do that by passing in nodeProjection:'*', relationshipProjection:'*' within the map or by using a pre-created named projection such as CALL gds.graph.create('blah', '*', '*') YIELD graphName, nodeCount, relationshipCount; but that should probably be shown.

(2) Code as presented returns some symmetric results, so for example "Praveena" "Karin" 1.0 and "Karin" "Praveena" 1.0. Algorithm is symmetrical, the posted examples don't show these entries but I don't see a way of removing them other than some sort of equality comparison on id(node) which is a bit ugly.

(3) The results for Zhen - Anya and Zhen - Karin seem unexpected to me. They should both return 0 as there are no dimensions in common however, while the documented example shows them both returning 0, in my results I find Zhen - Anya gives me 0 when streaming, and Zhen Karin has no result. Passing empty vectors (indicating no dimensions in common) into gds.alpha.similarity.cosine() also generates an error instead of the expected 0.

Handy queries:

// Person name and data being passed into gds.alpha.similarity.cosine.stream
MATCH (p:Person), (c:Cuisine)
 OPTIONAL MATCH (p)-[likes:LIKES]->(c)
 WITH p, {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
 WITH p, collect(userData) AS data
 RETURN p.name, data

The query that provides unexpected results compared to manual calculations and documented results:

MATCH (p:Person), (c:Cuisine)
 OPTIONAL MATCH (p)-[likes:LIKES]->(c)
 WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.cosine.stream({nodeProjection:'*', relationshipProjection:'*', data: data})
 YIELD item1, item2, count1, count2, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
 ORDER BY similarity DESC
1 REPLY 1

Initial observation: This is an alpha tier algorithm, so I don't think we should be too surprised by issues with bugs in the algorithm or the documentation. I've seen issues with other alpha status GDS algorithms. You might turn in a github issue, here https://github.com/neo4j/graph-data-science

Also, there is a new issue with Neo4j 4.1 I stumbled across while working through these examples, there is a different behavior that breaks this part of the cypher query

collect(coalesce(likes.score, gds.util.NaN())

after some investigation I narrowed it down to just this

return likes.score

which is throwing a Neo.DatabaseError.Statement.ExecutionFailed error in Neo4j 4.1
(at least when they come back from an OPTIONAL MATCH) the error detail is blank but looking at
the logs and verifying the fix below works, seems to indicate the issue is because we tried to "reference a property on a null relationship"

So there are two obvious work arounds, but there are probably other ways...

HACK FIX1: create a relationship with a null property as needed (with an inline ternary)
at the relationship level

(CASE likes
 WHEN likes = null THEN {score: null}
  ELSE likes
END)

HACK FIX2: at the property level

(CASE likes
 WHEN likes = null THEN null
  ELSE likes.score
END)

full cypher now

MATCH (p:Person), (c:Cuisine)
 OPTIONAL MATCH (p)-[likes:LIKES]->(c)
 WITH {item:id(p), weights: collect(coalesce((CASE likes
 WHEN likes = null THEN null
  ELSE likes.score
END), gds.util.NaN()))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.cosine.stream({data: data})
 YIELD item1, item2, count1, count2, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
 ORDER BY similarity DESC