Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
02-13-2020 07:04 AM
I have the following query which runs very slowly:
However, removing a single label from the left-most matched node massively increases the speed due to preventing a cartesian join:
I don't understand why the label makes such a (terrible) difference to the query plan.
02-13-2020 07:37 AM
To avoid cartesian product, you are better to constrain the labels after separately and compare those constraints. so match (vi) where vi:first_label and vi:second_label return vi;
But obviously add the rest of your query around this. there is an article on this somewhere i will try and find, which probably explains it better
02-13-2020 07:49 AM
Found it 🙂 https://neo4j.com/developer/kb/how-do-i-report-on-nodes-with-multiple-labels/ Hope this helps. Let me know if you have anymore questions
02-13-2020 08:43 AM
Hi thanks for the quick response. I tried the suggested change which appears to have no effect on the query plan.
The following 3 variants run exactly the same query plan:
PROFILE
MATCH (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id
PROFILE
MATCH (vi)-[:VI_PH]->(ph)
WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx AND ph:PH_VI_xxxx
RETURN count(DISTINCT ph) AS PH_Id
PROFILE
MATCH (vi) WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx
WITH vi
MATCH (vi)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id
02-13-2020 08:45 AM
Neo 3.5.3 Enterprise
02-13-2020 08:49 AM
also as I'm not familiar with the data model it you might be able to further improve this performance. In both profiles you will see a Expand(all) block, which basically has to iterate over all the relationships for each starting node. For example if you have 10 nodes that match the first MATCH
and each has 100 relationships named :VI_PH
then its 10 x 100 operations to count. However there is metadata at each node which is a precomputed value representing the number of relationships by direction. For example if there a node which has 5 relationships associated named :VI_PH
that have a outgoing direction and 12 relationships associated named :VI_PH
that have a incoming direction then this data is recorded in the metadata and thus does not require you to iterate over each relationship and count 1 by 1. To access the metadata this would be accomplished via a return SIZE ( (:VI_PS_?????)-[:VI_PH]->() ) ;
When using this shortcut the PROFILE will include a 'GetDegree` function, for example
Projection
15,636
db hits
n, size ( (n)-[:IS_ASSIGNED_
TO]->() )
{size ( (n)-[:IS_ASSIGNED_TO
]->() ) : GetDegree(Variable
(n),Some(RelTypeName(IS_
ASSIGNED_TO)),OUTGOING)}
See https://neo4j.com/developer/kb/how-do-i-improve-the-performance-of-counting-number-of-relationships-... for more details
02-13-2020 10:35 AM
The query you're executing is using COMPILED runtime. I'm curious if prefixing it with CYPHER runtime=SLOTTED
or CYPHER runtime=INTERPRETED
would get you a better plan. Can you let us know?
02-13-2020 11:05 AM
The plans for SLOTTED and INTERPRETED are identical except that the Expand(Into)
portion has 112,173,677 db hits instead of 8,062,221 db hits for COMPILED
02-13-2020 11:57 AM
Thanks, so it's not a compiled runtime issue.
You are using an older version of 3.5.x. Do you have any ability to test this out on a copied and upgraded db? 3.5.14 is the latest patch release.
02-17-2020 07:37 AM
Ok, just tried this with 3.5.14 with exactly the same results
02-18-2020 08:11 AM
Can you also check if the results are the same on 3.5.14 when prefixed with CYPHER runtime=SLOTTED
or CYPHER runtime=INTERPRETED
?
02-18-2020 09:17 AM
Just tried it - same results with both.
02-19-2020 01:04 AM
For Cypher to think a cartesian project and expand into is the most efficient plan must mean that it thinks the cardinality of the combined labels is very low. It could think this if the individual label cardinality is relatively low, and it has no idea of correlation between cardinalities. For example, if the cardinality of (n:A) is 10% of all nodes, and the cardinality of (n:B) is also 10%, we would model the cardinality of (n:A:B) as 1% (ie. 0.1*0.1=0.01). But if many :A are also :B, then this calculation is very inaccurate, and for full correlation between :A and :B we should use the minimum of the two, which in my example would be 10% instead of 1%.
Could you let us know what the relative cardinalities are? Is there a correlation between nodes of the two types?
I have two more suggestions for making your query faster. One is to try your trick with only one label, and then before the RETURN, add an extra WITH vi, ph, 1 AS ignore WHERE vi:VI_PH_xxxx RETURN DISTINCT(ph).
The other suggestion is based on the fact that you seem to only want to know the number of unique ph for which there exists any relationship of that pattern, but are not interested in the count of relationships themselves. For this there is a specific syntax:
MATCH (ph:PH_VI_xxxx)
WHERE (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph)
RETURN count(DISTINCT ph) AS PH_Id
02-19-2020 03:11 AM
Your last suggestion - putting the relationship in the WHERE clause seems to massively improve the query plan. I will see if we can generalise this approach in our query builder code.
02-19-2020 03:04 AM
All nodes 13,601,888
:VI_PH_xxxx 380
:VI_PS_xxxx 480
:VI_PH_xxxx:VI_PS_xxxx 380
03-02-2020 01:19 PM
I'm glad the suggestion helped. I noticed also that your labels are strongly correlated. Every PS node is also a PH node. The planner certainly does not make good use of that information. We have plans to teach it that trick (or rather to maintain the necessary graph statistics to make that observation), but not sure when that will be done.
In the meantime, I'm very happy the last suggestion helped you get a faster query. The phrase 'massively improve' sounds like music to my ears!
All the sessions of the conference are now available online