Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-05-2021 12:41 AM
Let us assume that there are millions of nodes of a certain label, say Person. All the Person nodes have a property called fullName. I want to return top 5 matching nodes for each node by comparing each Person's fullName with the others. Example - Person A has fullName 'Michaels', person B 'Michael', person C 'Michel' and so on. Using an apoc text function, I can return top matching names based on its score. In this way, I want the top matching nodes for each person node (for million nodes.) I tried to frame a Cypher query but it's so time taking and would never give the results. It would be very helpful if this can be sorted out in an efficient and quick way. Thanks
10-05-2021 04:14 AM
You can index the fullName with a fulltext index and then use fuzzy matching on that.
somthing along the lines of
MATCH (p:Person)
CALL db.index.fulltext.queryNodes("name-index",'"'+p.name+'"~1', {limit: 5}) yield node, score
RETURN count(*)
Otherwise you can store the phonetics version of the name and aggregate/search on that.
With the apoc text funcitons you get text similarities but basically a cross product.
10-06-2021 01:47 PM
Thanks for the response, Michael! However, for my use case, I wanna find similar matches even if the entire text is not matched. For example, let's have 4 nodes,
Person A has fullName - 'michael123'
Person B - 'michael678'
Person C - 'michel124'
Person D - 'shawn456'
In this case, if I query using Person A's fullName 'michael123', I won't be getting other nodes B and C which also have similar names. Ideally, in this case, I would want A to be matched with B as well as C with higher and lower scores respectively. I don't want to use apoc text similarity as it's time taking so it would be helpful if this can be sorted out in other ways.
10-06-2021 03:02 PM
Try this:
match (a:Person {fullName: "michael123"})
match (b:Person) where b.fullName <> a.fullName
with a, b, apoc.text.clean(a.fullName) as norm1, apoc.text.clean(b.fullName) as norm2
with toInteger(apoc.text.jaroWinklerDistance(norm1, norm2) * 100) as similarity, a, b
with a, b,similarity where similarity >= 80
return a.fullName as aname, b.fullName as bname, similarity
Result:
With person fullName = "michael678" instead of "michael1678" the similarity drops down to 88.
10-08-2021 05:05 PM
Hi @ameyasoft
Thank you for your response. This approach works well for small amount of data.
But it takes so much time if it is done on a large number of nodes (millions of nodes)
That's why I have mentioned that I didn't want to use apoc text similarity functions.
It would have been cool if there was another way to quickly find top matching nodes for each from millions of nodes.
10-09-2021 09:51 AM
What about training a node embedding based on node fullname property, and then find topN nodes based on node similarities?
10-10-2021 03:55 AM
How will I train embeddings on nodes having properties holding string values? (As any GDS algorithm would work only with numbers)
Also, would node embeddings work for disconnected nodes? (as the emails are not connected.)
11-30-2021 07:42 AM
I would like to know a solution to this as well. One approach I took (but it doesn't really solve this exactly) was to create a new FullName node (which could also be done as a set of Metaphone / Soundex nodes -- not sure if that works with numbers or not) and link my accounts to that node... then could maybe collapse a path between those that share a phonetic encoding as similar name?
10-10-2021 08:20 AM
What about using transformer to get a vector representation of your full name string, then use that as node property. Your person nodes are disconnected?
All the sessions of the conference are now available online