cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Comparison on concatenated string extremely slow

I have Person nodes with basic string fields such (firstName,lastName, fatherName,motherName) and trying to link nodes based on those fields.

A simple query where I compare motherName to concatenation of first name and last name such as 

match(p1:Person) match (p2:Person) where p1.motherName=p2.firstName+' '+ p2.lastName return p1,p2  limit 500
takes a really long time , (removing ' ' from the concatenation does not make a difference )
while if comparing exact fields such as 
match(p1:Person) match (p2:Person) where p1.motherName=p2.firstName return p1,p2  limit 500

only takes a few seconds.

I have noticed something peculiar regarding transaction memory which is that in the first query the estimatedUsedHeapMemory is always 2097152 and currentQueryAllocatedBytes is 64, 

but I see the database is consuming around 7.5 GB of memory.

When running the 2nd query, the numbers for memory used for the heap and query are much bigger

I had successfully ran a query to link persons and fathers, that matches on exact fields, which took 2.5 hours. while the query for the mothers which needs to compare concatenated strings was still running after 9 hours with no result. 

Query for father linking, which was successful.

CALL apoc.periodic.iterate(

    "match (p1:Person) match(p2:Person) where p1.fatherName=p2.firstName and p1.lastName=p2.lastName  and p1.registryTown=p2.registryTown and p1.registryKadaa=p2.registryKadaa and p1.registryMouhafaza=p2.registryMouhafaza and p1.registryNumber=p2.registryNumber and p1.dateOfBirth>p2.dateOfBirth return p1,p2",

    "MERGE (p1)-[:CHILD_OF {parentRelationship:'FATHER'}]->(p2)",

    {batchSize:5000}

)

I have 4 million nodes, my db size is 3.14 gb , these are my memory settings

- NEO4J_server_memory_heap_max__size=5G
- NEO4J_server_memory_heap_initial__size=5G
- NEO4J_server_memory_pagecache_size=7G

1 ACCEPTED SOLUTION

Looking at the query plan for the 'fast' query, the query is executed by a 'hash join' between sets of nodes due to the simple equality in the 'where' clause.  I suspect you have indexes on these properties? 

plan2.png

On the other hand, the 'slow' query needs to form a Cartesian product and then filter each set of nodes by the 'where' clause predicate. It looks like it is calculating the sum of the firstName and lastName each time a pair of nodes is evaluated. This means if you have N nodes in your data set, each node has its firstName and lastName summed N times (since you did not exclude p1=p2), for a total N^2 times string additions. 

It seems you could get similar performance to the 'fast' query if you create a property to store the sum of the firstName and lastName and add an index for this property. If you don't want to maintain the property in your data, you can do it temporarily before executing the query. You would be calculating it once N times to store the data versus N^2 times every time the query is executed. 

plan.png

This is pure speculation, maybe the increase in memory of the 'fast' query is due to loading the indexes to perform the hash join and any associated data structures to perform the hash join. 

View solution in original post

7 REPLIES 7

MATCH (p1), (p2) WHERE ...

NOT MATCH (p1) MATCH(p2), this will cause a cartasian product so 4 millions X 4 millions

I had tried to use MATCH (p1), (p2), but got a warning from neo4j desktop interface that this way of writing the query will produce a Cartesian product. When writing it as MATCH (p1) MATCH(p2) i did not get this warning.

Also I had done some minor tests using both ways and the query was similarly slow in both cases

If you look at the query plan for both, you should see each forms a Cartesian product. 

the only difference between the two forms is when you have patterns with paths. 

Thank you, will keep that in mind 

Looking at the query plan for the 'fast' query, the query is executed by a 'hash join' between sets of nodes due to the simple equality in the 'where' clause.  I suspect you have indexes on these properties? 

plan2.png

On the other hand, the 'slow' query needs to form a Cartesian product and then filter each set of nodes by the 'where' clause predicate. It looks like it is calculating the sum of the firstName and lastName each time a pair of nodes is evaluated. This means if you have N nodes in your data set, each node has its firstName and lastName summed N times (since you did not exclude p1=p2), for a total N^2 times string additions. 

It seems you could get similar performance to the 'fast' query if you create a property to store the sum of the firstName and lastName and add an index for this property. If you don't want to maintain the property in your data, you can do it temporarily before executing the query. You would be calculating it once N times to store the data versus N^2 times every time the query is executed. 

plan.png

This is pure speculation, maybe the increase in memory of the 'fast' query is due to loading the indexes to perform the hash join and any associated data structures to perform the hash join. 

This is what I ended up doing. I created a new field which is the result of firstname+' '+lastname and ran the query for comparison on it.

As for the indexes, i initally had indexes on firstname and dateOfBirth, but when i first ran the "father linking query" i had the same slowness and memory limit issues. I had to drop all the indexes on Person so that the query would work

At the end, you want to have a valueHasJoin and not a huge cartesian product.