Neo4j

mariel · ‎10-27-2019

Hey there,
I'm working with Euclidian Distance with Neo4J, and have come across what appears to be an error or challenge in the docs and hoping someone can shed some light on it.

See NEO4J documentation here: https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/euclidean/

According to the documentation, the lower the similarity score, the MORE similar the items "a score of 0 would indicate that users have exactly the same preferences".

However, when I look closely at their examples, they say that "Zhen" and "Arya" with a similarity score of 0 are the closest in similarity. But when I look at their food rating scores, neither "Zhen" or "Arya" have rated the same food types, they have nothing in common. However, "Praveena" to "Arya" with a similarity score of 8.0 actually both rated (Portugese - 7) and they both at least rated (Mauritian).

Here, it seems as though the HIGHER the score, the MORE similar the users. Any thoughts on if this is just a mistake in the documentation or if I've missed something?

Thanks,
Mariel

mark_needham · ‎10-28-2019

Hey,

So with Euclidean distance it is correct that the lower the score the more similar they are, but you are correct that a score of 0 means that they aren't similar at all. I'll update the documentation to sort that out.

Cheers, Mark

mariel · ‎10-28-2019

Hi Mark,
Thanks so much for the feedback. I'm still not convinced that the lower score = more similar Perhaps it is my implementation of it.

I'll use an example. I'm using the graph to build a recommendation engine that recommends shows to users based on their tastes. Each Show in the graph is represented in a node, and has a relationship to a "dimension" node for genres such as "MUSICAL_THEATRE", "COMEDY" or "MUSICAL".

Currently, there is only 1 show in the database that matches the "MUSICAL_THEATRE" dimension, Show A "Every Silver Lining". When I query the database for

MUSICAL_THEATRE = 15
Result is Show A "Every Silver Lining", with a similarity score of 13.

If I query a different dimension, such as MUSICAL = 15
Result is:
Show B "Literally Titanium" = 14
Show C "Tita Jokes" = 12
Show A "Every Silver Lining" = 11

If I attempt to do a query with both dimensions, I would expect the most similar show to be one that overlaps between both dimensions, which is Show A "Every Silver Lining". When I query both results:

dimensions=MUSICAL_THEATRE,MUSICAL&weights=15,15

I get back the result

Show A "Every Silver Lining" = 17.029
Show B "Literally Titanium" = 14
Show C "Tita Jokes" = 13

As show A matches both dimensions of "MUSICAL_THEATRE" and "MUSICAL" I would expect it to be the MOST similar match. Rather than Show C, with the lowest score, and which is only a match for the one dimension of MUSICAL.

Any thoughts if I'm missing something, or if the way I've implemented it would make sense that a larger score = MORE similar. Happy to provide more details if this doesn't make sense and really appreciate your assistance on this.
-Mariel

mark_needham · ‎10-31-2019

Hey,

If you want to calculate similarity based on overlap of items, I think the Jaccard Similarity algorithm might be a better choice - https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/jaccard/

The Euclidean (and Cosine/Pearson) algorithm would make more sense if you were doing something like comparing the ratings that different people had given to shows so that you can compare their tastes.

From playing around with Euclidean, I realise that a score of 0 could mean the scores are identical but it could also mean there's no overlap at all.

RETURN algo.similarity.euclideanDistance([7,7,7], [7,7,7]) AS similarity1,
       algo.similarity.euclideanDistance([7,8,7], [8,8,7]) AS similarity2,
       algo.similarity.euclideanDistance([1,1,1], [7,7,7]) AS similarity3

You can also see on the results below that similarity3, where the scores are less similar, has a higher value than the other two comparisons:

╒═════════════╤═════════════╤══════════════════╕
│"similarity1"│"similarity2"│"similarity3"     │
╞═════════════╪═════════════╪══════════════════╡
│0.0          │1.0          │10.392304845413264│
└─────────────┴─────────────┴──────────────────┘

And then this one has no overlap:

WITH [
	{item: 1, weights: [algo.NaN(),2,algo.NaN()]},
	{item: 2, weights: [1,algo.NaN(),4]}
] AS data

CALL algo.similarity.euclidean.stream(data)
YIELD item1, item2, count1, count2, similarity
RETURN item1 AS from, item2 AS to, similarity
ORDER BY similarity

Also has a score of 0:

╒══════╤════╤════════════╕
│"from"│"to"│"similarity"│
╞══════╪════╪════════════╡
│1     │2   │0.0         │
└──────┴────┴────────────┘

We probably need to see what other libraries do about this type of situation - should we be returning a null similarity if there's no overlap between the arrays? I'm not sure!

Neo4j

Euclidian Distance Similarity Question