Neo4j

neo4j7 · ‎10-30-2020

TL;DR:
Slow Cypher Query For Large Dataset.. Why?

QUERY:

EXPLAIN
WITH "(?i).test.*" as target
MATCH (AA:A)-[r:B]-(CC:C)
WHERE ANY(host IN AA.D WHERE host =~ target)

OPTIONAL MATCH (DD:D) WHERE CC.id =~ DD.CCid
OPTIONAL MATCH (EE:E) WHERE CC.id =~ EE.CCid
OPTIONAL MATCH (FF:F) WHERE CC.id =~ FF.CCid
OPTIONAL MATCH (GG:G) WHERE CC.id =~ GG.CCid
OPTIONAL MATCH (HH:H) WHERE CC.id =~ HH.CCid

WITH COLLECT(AA) + COLLECT(CC) + COLLECT(DD)
   + COLLECT(EE) + COLLECT(FF) + COLLECT(GG)
   + COLLECT(HH) as data
UNWIND data as datum
RETURN DISTINCT datum;

CONTEXT:

Without Neo4J Enterprise (procurement delays), I've been trying to performance tune some queries by hand using EXPLAIN and PROFILE, with limited success.

I'd like to leverage the wisdom of crowds at this point, if feasible..

Any insights to be shared about the performance heuristics (or obvious nonperformant clauses) of the query above would be immensely appreciated.

Thanks,
Neo4j User

andrew_bowman · ‎10-30-2020

First, none of these will leverage indexes for lookup, since the =~ regex match operator doesn't use the indexes. If you know the case of the property, then you can use the STARTS WITH or CONTAINS keywords, which will leverage indexes if they are present.

Next, you have a multiplicative cardinality issue from all those back to back optional matches. We recommend you collect as soon as possible after you've matched the thing, that way cardinality will reset (since the grouping key becomes distinct). The way you have it now, the row cardinality will multiply out with each successive OPTIONAL MATCH, which means your query is doing more and more unnecessary work.

so the pattern will be something like:

...
OPTIONAL MATCH ... WHERE ...
WITH <variables already in scope>, collect(<whatever you're matching to this time>) as <variablename>
OPTIONAL MATCH ... WHERE ...
etc.
...

Neo4j

Optimizing a Query