cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How the cardinality of the pattern (a:A)-[r:R]->(b:B) is estimated?

wdmzssyh
Node Link

I have tried profiling the following simple cypher query:

profile match (a:A)-[r:R]->(b:B) return *

The plan is as follows:

  1. NodeByLabelScan. According to neo4j statistics, it is clear the first step, NodeByLabelScan cardinality can be estimated with an accurate value.
  2. ExpandAll. The count of nodes from a specific label through a given edge label is stored as well. So this cardinality is accurate as well.
  3. Filter. In this query, the estimation is accurate as well. The actual cardinality is 1 as well. But I tried other queries, the estimation is not accurate. I am curious how the filter step cardinality is estimated.

It will be helpful you can point me to the source code location as well.

4 REPLIES 4

In most cases estimated rows isn't very useful for user-driven debugging or tuning. It's usually there to help the query planner have a ballpark estimate for some operations. Once you get into traversing and filtering on any realistic graph the estimated rows will likely not be accurate. In order to get accurate row counts, you need to actually run the query (such as via PROFILE).

As for how the filter step cardinality is estimated, the graph knows you are traversing all :R relationships from :A nodes, so it likely looks at the counts store for the count of :R relations incoming to :B nodes. The equivalent of:

MATCH ()-[:R]->(:B)
RETURN count(*) as degree

My guess would be that this is only accurate for this query because you're starting from a label scan without filtering, so it knows all (:A)-[:R]->() relationships are going to be traversed, so it can make a reasonable guess from the counts store for the relationships going to :B nodes.

Of course, if the :R relationships going to :B nodes do not originate from :A nodes, then you may well see the estimate going off at that point, since the counts store does not track relationships that include both the labels of the start and end nodes ((:A)-[:R]->(:B) and patterns like it are not in the counts store).

Thanks a lot!

This "guess" is what I am interested in. It should rely on some statistical information. Could you point me to the source code that compute this guess?

I'm afraid I don't know the answer to that. The codebase is fairly massive and complex (here be monsters), I'm not fluent in Scala, and I'm not a developer on the codebase.

Looking at the 3.5 community code, this package and this one likely have relevant code, but these are but components in a huge machine, I think it would take a non-trivial amount of effort to grok how that's all put together and working.

What may be more useful to you is to learn more about what information lies in the count store. You already have an overview from the stats link you posted earlier.

If you have APOC Procedures, you can use CALL apoc.meta.stats() to get a dump of the count store statistics that are used during query planning and help drive the estimated rows and cost planning.

If you don't have APOC, then you can use CALL db.stats.retrieve('GRAPH COUNTS') , which should get you the same statistics in a different format, but it also includes constraints and index info.

And as for the guess, simply knowing what the count store knows we can infer it, at least to some degree:

It knows the counts for (:A), so the label scan estimates should be spot on.

Because we're looking at all :A nodes, it can correctly assume cardinality for (:A)-[:R]->(), which is in the count store.

The missing piece though is to what kind of nodes those :R relationships are going to. The count store doesn't store counts of patterns with labels on both ends, so this: (:A)-[:R]->(:B) is unknown.

But it does know the count of ()-[:R]->(:B), and it knows the count of ()-[:R]->(). If the counts of both of those are the same (meaning all :R relationships go to :B nodes), then it should be able to get an accurate estimate. Otherwise, that's where that guesswork comes in, and that's where I lose certainty. My best guess would be it's using the ratio of :R relationships going to :B nodes to the number of total :R relationships, and multiplying that against the estimated rows so far.