Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-11-2020 01:44 AM
Hi all,
Neo4j 4.1 Community Edition
I am trying to get a better understanding of the terms inside the profile operators:
I understand (1) well enough, however I have trouble fully understanding (2-6).
For (2) Estimated rows, I don't understand how this is determined, i.e. the number of estimated rows. I also want to understand if it is always better to reduce this number or if there are any specific cases where that would be the exception?
For (3) DB hits, I understand that this refers to the number of database hits and that for tuning a query this number should be minimized. What I don't understand is why sometimes this value is higher/ lower in specific cases (assuming no indexes).
Example: I have created 5000 nodes, then when I try to match one of the nodes,
PROFILE MATCH (t:Test1)
WHERE t._uuid = 'test1'
RETURN t
I don't understand why there are 5001 db hits in the NodeByLabelScan, shouldn't this value by 5000? I also don't understand why there are 10000 db hits for filter?
For (4) Cache Hits, I rarely see this, I am not sure what are the circumstances to get this?
For (5) Cache Misses, I rarely see this and if I do I always get 0, which for tuning should always be the case. I want to understand if I do come across this problem, how would I solve/avoid it?
For (6) Memory, I understand for tuning this should be minimized. However, I am not sure what exactly affects memory (other than size of search) and how to accomplish minimizing this?
Thanks in advance.
Solved! Go to Solution.
08-25-2020 01:29 PM
Here's my take on this:
Rows
This is arguably the most important thing in a PROFILE plan, since Cypher operations execute per row. The idea with query tuning is to reduce the work done as much as possible while still getting the correct answer, so streamlining a query to reduce unnecessary rows during query execution is usually a win. Watch for where the rows spike in your query, that may be a good opportunity to tune.
Estimated rows
These are calculated/estimated from graph statistics, and are usually important for the query planner when formulating and comparing plans. I haven't found them to be very useful for tuning myself, as these are often ballpark figures.
DB hits
As Sameer says, these are abstract units of storage engine work whenever data in the database is touched, and as such db hits are not necessarily equal to each other 1:1. You can also treat this as kind of a ballpark figure, smoke to draw your attention to where there may be a fire to put out, but not always so. Queries have to do work to deliver correct results, and that requires db hits. Watch for massive db hit spikes, and where you see this, let it draw your attention to the rows flowing between operations near these points of the query.
4., 5., and 6. (for cache hits, misses, and memory
This has to do with the pagecache, which is the in-memory cache of the graph. Whenever possible the pagecache is used for db operations, to avoid the I/O hit of having to access the graph on disk. High pagehits and low cache misses indicate good utilization of the pagecache. If you start seeing higher cache misses and fewer cache hits, it may mean that your pagecache isn't big enough to hold the whole graph in memory, so you may need to look at adjusting your memory allocation, or increasing the memory in the system.
Here's some resources for memory:
https://neo4j.com/docs/operations-manual/3.5/performance/memory-configuration/
https://neo4j.com/docs/operations-manual/3.5/tools/neo4j-admin-memrec/
Also important to check in PROFILE and EXPLAIN plans, check how nodes are looked up in the plan. In your example, for instance, we see a NodeByLabelScan followed by a Filter on the property. It's more efficient to create an index (on :Test1(_uuid) ) so the index can be used for quick lookup, you'd see a NodeIndexSeek instead, with far fewer db hits and avoiding the need for filtering across many rows.
The section of query operators may be a helpful reference, most of the node lookup operators are at the top:
https://neo4j.com/docs/cypher-manual/current/execution-plans/operators/
In general, lookup via index is going to be more performant than a label scan (which requires looking at all nodes in the label), which in turn is going to be more performant than an all nodes scan (which has to look at all nodes in the graph).
08-11-2020 01:47 AM
08-11-2020 01:57 AM
Hello @Cobra
I took at look at the documentation. I understand better (2) Estimated rows, (4) Cache hits & (5) Cache Misses. However, I still don't fully understand the DB hits as explained in the exampled. Furthermore, I don't see the anything on memory.
08-11-2020 02:05 AM
Andrew gave an explanation about DbHits
here
There is an article about memory consumption
Other links about memory:
08-11-2020 02:26 AM
Thanks, the memory articles you gave was very useful to understanding it better. However, I still don't really understand DB hits fully. Is my explanation right for the above example, for the 2nd plan I get 10000 hits because there are 2 properties for each node (id and _uuid)? However, I still cant understand why the 1st plan has 5001 hits instead of 5000 hits.
08-11-2020 02:35 AM
My knowledge is very limited, I just know it's better when the DbHits
is low, you can add indexes or constraints to reduce it. I don't know how exactly it's computed.
08-11-2020 02:48 AM
Your knowledge is still very wide. At least wider than mine
I guess I will have to do more research into this . Thanks a lot for the help 🙂
08-11-2020 11:03 PM
Each operator will send a request to the storage engine to do work such as retrieving or updating data. A database hit is an abstract unit of this storage engine work.It is at a conceptual level and if you want to know how it translates to actual work you can review the Neo4j Architecture in detail.
08-25-2020 01:29 PM
Here's my take on this:
Rows
This is arguably the most important thing in a PROFILE plan, since Cypher operations execute per row. The idea with query tuning is to reduce the work done as much as possible while still getting the correct answer, so streamlining a query to reduce unnecessary rows during query execution is usually a win. Watch for where the rows spike in your query, that may be a good opportunity to tune.
Estimated rows
These are calculated/estimated from graph statistics, and are usually important for the query planner when formulating and comparing plans. I haven't found them to be very useful for tuning myself, as these are often ballpark figures.
DB hits
As Sameer says, these are abstract units of storage engine work whenever data in the database is touched, and as such db hits are not necessarily equal to each other 1:1. You can also treat this as kind of a ballpark figure, smoke to draw your attention to where there may be a fire to put out, but not always so. Queries have to do work to deliver correct results, and that requires db hits. Watch for massive db hit spikes, and where you see this, let it draw your attention to the rows flowing between operations near these points of the query.
4., 5., and 6. (for cache hits, misses, and memory
This has to do with the pagecache, which is the in-memory cache of the graph. Whenever possible the pagecache is used for db operations, to avoid the I/O hit of having to access the graph on disk. High pagehits and low cache misses indicate good utilization of the pagecache. If you start seeing higher cache misses and fewer cache hits, it may mean that your pagecache isn't big enough to hold the whole graph in memory, so you may need to look at adjusting your memory allocation, or increasing the memory in the system.
Here's some resources for memory:
https://neo4j.com/docs/operations-manual/3.5/performance/memory-configuration/
https://neo4j.com/docs/operations-manual/3.5/tools/neo4j-admin-memrec/
Also important to check in PROFILE and EXPLAIN plans, check how nodes are looked up in the plan. In your example, for instance, we see a NodeByLabelScan followed by a Filter on the property. It's more efficient to create an index (on :Test1(_uuid) ) so the index can be used for quick lookup, you'd see a NodeIndexSeek instead, with far fewer db hits and avoiding the need for filtering across many rows.
The section of query operators may be a helpful reference, most of the node lookup operators are at the top:
https://neo4j.com/docs/cypher-manual/current/execution-plans/operators/
In general, lookup via index is going to be more performant than a label scan (which requires looking at all nodes in the label), which in turn is going to be more performant than an all nodes scan (which has to look at all nodes in the graph).
All the sessions of the conference are now available online