Neo4j

awesomeanonymou · ‎10-05-2021

Let us assume that there are millions of nodes of a certain label, say Person. All the Person nodes have a property called fullName. I want to return top 5 matching nodes for each node by comparing each Person's fullName with the others. Example - Person A has fullName 'Michaels', person B 'Michael', person C 'Michel' and so on. Using an apoc text function, I can return top matching names based on its score. In this way, I want the top matching nodes for each person node (for million nodes.) I tried to frame a Cypher query but it's so time taking and would never give the results. It would be very helpful if this can be sorted out in an efficient and quick way. Thanks

michael_hunger · ‎10-05-2021

You can index the fullName with a fulltext index and then use fuzzy matching on that.

somthing along the lines of

MATCH (p:Person) 
CALL db.index.fulltext.queryNodes("name-index",'"'+p.name+'"~1', {limit: 5}) yield node, score
RETURN count(*)

Otherwise you can store the phonetics version of the name and aggregate/search on that.
With the apoc text funcitons you get text similarities but basically a cross product.

awesomeanonymou · ‎10-06-2021

Thanks for the response, Michael! However, for my use case, I wanna find similar matches even if the entire text is not matched. For example, let's have 4 nodes,
Person A has fullName - 'michael123'
Person B - 'michael678'
Person C - 'michel124'
Person D - 'shawn456'

In this case, if I query using Person A's fullName 'michael123', I won't be getting other nodes B and C which also have similar names. Ideally, in this case, I would want A to be matched with B as well as C with higher and lower scores respectively. I don't want to use apoc text similarity as it's time taking so it would be helpful if this can be sorted out in other ways.

ameyasoft · ‎10-06-2021

Try this:

match (a:Person {fullName: "michael123"})
match (b:Person) where b.fullName <> a.fullName

with a, b, apoc.text.clean(a.fullName) as norm1, apoc.text.clean(b.fullName) as norm2
with toInteger(apoc.text.jaroWinklerDistance(norm1, norm2) * 100) as similarity, a, b
with a, b,similarity where similarity >= 80 
return a.fullName as aname, b.fullName as bname, similarity

Result:

With person fullName = "michael678" instead of "michael1678" the similarity drops down to 88.

awesomeanonymou · ‎10-08-2021

Hi @ameyasoft
Thank you for your response. This approach works well for small amount of data.

But it takes so much time if it is done on a large number of nodes (millions of nodes)
That's why I have mentioned that I didn't want to use apoc text similarity functions.
It would have been cool if there was another way to quickly find top matching nodes for each from millions of nodes.

lingvisa · ‎10-09-2021

What about training a node embedding based on node fullname property, and then find topN nodes based on node similarities?

awesomeanonymou · ‎10-10-2021

How will I train embeddings on nodes having properties holding string values? (As any GDS algorithm would work only with numbers)

Also, would node embeddings work for disconnected nodes? (as the emails are not connected.)

michael_h_schoe · ‎11-30-2021

I would like to know a solution to this as well. One approach I took (but it doesn't really solve this exactly) was to create a new FullName node (which could also be done as a set of Metaphone / Soundex nodes -- not sure if that works with numbers or not) and link my accounts to that node... then could maybe collapse a path between those that share a phonetic encoding as similar name?

lingvisa · ‎10-10-2021

What about using transformer to get a vector representation of your full name string, then use that as node property. Your person nodes are disconnected?

Neo4j

Neo4j Cypher query to quickly find nodes with similar text property value