Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-19-2019 11:44 AM
Hi guys! I'm new to neo4j and I need some help with the current task.
I have a large number of nodes (for example 20000 or more) in one group, for example, People
. Each node in the group has a text property called name
. We can describe each node with such a JSON document:
{ "name": "Andrew smth else" }
.
The task is to remove all nodes with a similar name
value, for comparison I have to use a function from the apoc module (for example apoc.text.levenshteinSimilarity).
The problem is that my query contains a cartesian product and its execution time for 20000 nodes is extremely long.
My current cypher request:
match (p1:People)
with p1
match (p2:People)
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2
Explain:
Maybe there is another way of such a comparison? Or a way to speed up this query?
I will be glad for any help! Thanks and have a good day!
Solved! Go to Solution.
09-19-2019 11:52 AM
You'll end up with a cartesian product either way (that's the only real way to compare every node with every other node in the set) but there is a more efficient way to end up with that result set rather than doing a full person match per person. If you collect the person nodes and UNWIND twice, you'll end up with the cartesian product in what should be a more efficient way:
match (p:People)
with collect(p) as people
unwind people as p1
unwind people as p2
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2
09-19-2019 11:52 AM
You'll end up with a cartesian product either way (that's the only real way to compare every node with every other node in the set) but there is a more efficient way to end up with that result set rather than doing a full person match per person. If you collect the person nodes and UNWIND twice, you'll end up with the cartesian product in what should be a more efficient way:
match (p:People)
with collect(p) as people
unwind people as p1
unwind people as p2
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2
09-20-2019 02:12 AM
Hi Andrew!
Sorry to hear that, but thanks anyway for your reply and sample request!
09-22-2019 07:57 PM
One other note, to avoid mirrored results where the same nodes are used, but the variables switched (which would result in deleting all pairs for that query) you should use WHERE id(p1) < id(p2)
.
09-23-2019 01:57 PM
yes, I saw this expression in some posts, I use it too, thanks.
But I found another question, what if there are a lot more nodes in my database? For example 8M? collect(p)
killed my machine =(
Is there a way to bypass this? I have 16Gb of RAM and the heap size set to 10Gb
Or do I need to get a server with more computation capabilities?
09-23-2019 02:31 PM
Ah, in this case you'll either have to up your RAM and heap size, or revert to your previous version of the query with the two MATCHes. No way around that if you have more person nodes than your heap can cope with at the same time.
09-23-2019 03:52 PM
That is, two MATCH statements for the People group can help with this, but with this dataset it will be a extremely long, am I right? + increase RAM with heap size
All the sessions of the conference are now available online