Neo4j

Dan_Keeley · ‎07-02-2020

Hi,

We have a simple setup where there are two types of nodes , a base node and another node which is an actor that can link the base nodes together.

All i want to do, is that from a given base node, to know how many other nodes i can get to that have a particular property set.

MATCH(n:Base{baseNumber:"123456"})-[*..2]-(n2:Base) WHERE n2.property > 2 RETURN COUNT(n2)

The query works. but it's not fast. In some cases 4-5s to run, because it's processing millions of nodes. The profile shows no differences whether i add an index on "property" or not.

So can this be tuned further, or must I change the model somehow? I've checked the database size etc and it all ought to be fitting in ram. When i batch run these queries the system ends up sitting at 100% cpu - it gets through them, without too much pain, but they're just taking too long.

To clarify, I can query the data for 2000 nodes (individual queries), in between 7 and 13 minutes.

dave_pirie · ‎07-02-2020

Indexes are only used to find the starting nodes in a match, like the (n) in your query.

The problem is likely in your undirected, variable-length relationship. Are your relationships really directionless? Without a label on the relationship, it is considering all relationships from that node.

It is also helpful to profile the query by adding "PROFILE" before the query. It will show exactly how the query is processed and where cardinality has exploded and slowed your query down.

Dan_Keeley · ‎07-02-2020

Sadly in this case yes - we're just looking for any connection. hence i do want any relationship from that node.

yes, the PROFILE did show the explosion.

So; I suspect this is wrong, but i could resolve this by creating the relationships based on the node attribute i'm filtering on right? then i wouldn't have to get to the node before "filtering" it. It would lead to some weird relationship names, and feels wrong, but at least you wouldnt have to go to all the nodes at the final level. you'd still need to go to all the nodes at the first level.

dave_pirie · ‎07-03-2020

Reducing cardinality in the query is a delicate balancing act, but a critical exercise in making queries like yours return in a reasonable time. There are many presentations on Cypher Query Optimization.

Since you are really only interested in the presence of one or two relationships between the n and n2, match the two endpoints and filter on the existence of the relationships.

PROFILE MATCH (n:Base {baseNumber:"123456"})
MATCH (n2:Base) 
WHERE n.property > 2 
     AND ( (n)--(n2) OR (n)--(:Base)--(n2) )
RETURN COUNT(n2);

In the Profile, you'll still see a huge expansion for the second predicate. Anything longer than two relationships, should likely use apoc.path.expand(), or one of the other apoc.path functions, depending on your data model. It may still be a worthwhile exercise to try apoc.path.expand() on your data set.

Neo4j

Optimising query performance with a relatively simple match