cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Extremely slow query when profile looks very good?

massung
Node Clone

I must admit I'm quite baffled at this. I've gone through quite a bit of work adding some extra nodes to help make - what would be a really bad range query - profile quite nicely. But it runs incredibly slowly the first time, and subsequently quite fast. I presume this is due to disk access and loading the page cache, but still, the number of nodes being examined is quite small.

The example query is ast the end of this post along with the plan output from the profile.

To give some context, this is a large graph (~5B nodes and ~20B relationships) of genomics data. About 2 TB of data total in the store.

The :OverlapRegion is just a fixed, 100K region of the genome (indexed by chromosome) that can be linked from. Since Neo4j doesn't have true compound indexes, this allows us to quickly zero in on a region and then fan out to only the features within that region.

In the query below, it uses an index on chromosome to correctly narrow down just the 1,351 :OverlapRegion nodes to look at and find the 6 that are either within - or envelope - the range cared about. That should be extremely fast and is.

Next, just find all the variant relationships (and are also limited to the region in question). There aren't that many variants linked to them and it went through 11,211 relationships to find the 9,837 that were within the bound in question.

I would expect this to be blazing fast on a 16-CPU machine with 64 GB of RAM. Admittedly, this is community edition, so only 4 CPUs are being used, but still. 1,351 + 11,211 = 12,562 nodes being filtered wouldn't 10 seconds on a 486DX even if I forgot to press the turbo button.

Again, every subsequent run of the query takes ~20 ms. So I have to believe this is a caching/disk-access issue. There are over 50 million :Variant nodes in the graph, so it's possible Neo4j is seeking all over the place just to read the 11,211 it needs to. But, I'm unsure how to prove that (or - more importantly - how to rectify it if so). Similarly, once it's cached, all I have to do is run the same query on a different chromosome and it's ungodly slow again. Even going back it's as if it unloaded what it knew about previously and is slow again.

And our configuration is:

dbms.memory.heap.initial_size=23g
dbms.memory.heap.max_size=23g
dbms.memory.pagecache.size=27400m

With 27 GB of page cache, I should be able to have every :OverlapRegion, every :Variant, and more loaded into memory without a second thought and plenty of room to spare for whatever else comes along. Assuming this is a cache issue, is there a way for me to tell Neo4j to pre-warm the cache and to load - and keep loaded - particular node labels in memory?

Any insights here very much appreciated. Also, if anyone could comment as to whether or not this is an issue that would be magically solved by using the Enterprise version, I'd love to know that (and what those feature(s) would be), as we're considering making the leap in the coming months.

Thanks!

match (o:OverlapRegion)
where (o.chromosome = '11') and
      ((o.start >= 11000000 and o.start < 11500000) or
       (o.end >= 11000000 and o.end < 11500000) or
       (o.start < 11000000 and o.end >= 11500000))

match (o)-[:OVERLAPS_VARIANT]->(v:Variant)
where 11000000 <= v.position < 11500000

return count(v)

11 REPLIES 11

couple of thoughts.
Via the browser if you run :sysinfo it shoudl report 'Page Cache:' details. Page Cache faults would be indicative of getting the data from the file system as opposed to RAM/Page Cache. So if you capture the output of :sysinfo and then run the query for the first time I would expect to see that Page Cache Faults increases.

But as to

With 27 GB of page cache, I should be able to have every :OverlapRegion , every :Variant , and more loaded into memory without a second thought and plenty of room to spare for whatever else comes along. Assuming this is a cache issue, is there a way for me to tell Neo4j to pre-warm the cache and to load - and keep loaded - particular node labels in memory?

With a 2TB graph and only 27GB pagecache thats less than 1% of the graph is in the pagecache. Further Neo4j automatically keeps track of what is in the pagecache at all time such that on restart it will autowarm the cache with the data which was in the cache prior to the prior stop of Neo4j.
I think the real issue is that your pagecache of 27GB is small such that data is constantly getting cycled in-out.

Thanks for the info! When I run :sysinfo it gives me no page cache metrics:

Page Cache
Faults	-
Evictions	-
File Mappings	-
Bytes Read	-
Flushes	-
Eviction Exceptions	-
File Unmappings	-
Bytes Written	-

I'm guessing this is an Enterprise only feature? Or could this indicate that perhaps the page cache isn't being used?

apologies as I missed that this was a community edition install.

  1. :sysinfo will not display the Page Cache detail
  2. the auto warm-up of the pagecache over Neo4j restart is also an enterprise only functionality

This reply is mostly going to be a stream-of-consciousness as I try a few things and post the results for anyone else who comes across this post so it can help them as well...

So, let's do the math... the profile in the original post took ~7 seconds to run the first time for that chromosome on an AWS EC2 instance with an EBS volume. EBS has sustained 10k IOPS. So, let's add it up assuming everything had to be fetched from disk:

1300 + 11200 = 12500 nodes = ~1.2s of nodestore seeks
14500 + 33600 = 48100 properties = ~4.8s of propertystore seeks

That's now 6s of just seeking, assuming no co-located data (unlikely for some of that data) and leaving ~1s of time left for random other stuff. That's starting to get into the realm of "okay, I can believe that." Although, that "random other stuff" should be on the order of ms at worst given how little data there is being looked at.

Since the biggest problem here is the property store seeks, let's see if our losses can be limited. First up, what if I just remove the position test on the :Variant node (which would make the query invalid, but it's a good test of theory):

Well, the 33k property hits went down to 10,600. So we should have shaved ~2s off our seek times. And - roughly - it did.

Oddly, to my surprise, though, we still have 10,600 property (or node) hits. I didn't expect that the node label test was resulting a seek/data load (that wasn't already present). Since - for this query - [:OVERLAPS_VARIANT] can only link to a :Variant node, it's not even worth it to test for the label. So, let's remove that, too, which should shave another second off.

2X_6_634ea98b84f4babde978f4437d519c9637ca64fe.png

And, not only is the test gone, but another 900ms was shaved off the query time. So, I was able to shave 3 seconds off with a few tests.

That leads me to the first test, which consists of 7 compares across 3 properties. One of those is required, but potentially the other 6 can get knocked down to just 2 for large enough regions. And, doing so takes the 20k db hits filter down to 3700. Further reducing the execution time by ~1.5 seconds.

Furthermore, I could get the # of db hits even lower by re-indexing :OverlapRegion by start instead of chromosome. This means I access more nodes, but have fewer db hits, which yielded faster results, which is a bit counter-intuitive.

It'd be great if some of this knowledge was documented somewhere along with explanations as to why.

--

At this point, I'm comfortable saying that a chunk of this is disk seeking and the page cache. But there's still a lot of unanswered questions regarding what's happening under-the-hood, because:

  • Once x.foo has been loaded, (x.foo > 1000) and (x.foo < 2000) shouldn't be almost 2x the execution time as (x.foo > 1000).

  • Getting the label of a node - after already fetching the node - results in another fetch operation?

  • Every compare counts as a db "hit", and each "hit" (whatever unit of work that is) adds to the overall cost. I'd think once loaded, I could use any of the properties in a node - especially a small node - without incurring more significant costs.

For EBS volumes, one can assume 10k db "hits" == ~1 second of time spent looking things up unless it's already in cache. Obviously this could be improved with SSD, but the big wins are to be had by getting rid of "hits" all together. And even removing label names from queries can reduce the time a query takes to run significantly (note: this I find sad, as the label names help to make the queries self-documenting).

Can you supply PROFILE plans with all nodes expanded (double-down arrow in the lower right corner of the result pane). With what you've given us, we can't be fully sure the order the planner has used, and which operations in the plan correspond to which parts of the query.

massung
Node Clone

Sure thing. I'll post 3 queries and the profile plans for each.

With the first 2 I would expect the db hits for each of these plans to be identical. The work done me be slightly different, but the result is the same and the properties that need loading are the same as well. So, if there's a reason that they aren't it'd be great to get a reason why.

Sure, here's query #1 and it's plan:

profile

match (o:OverlapRegion)
where (o.chromosome = '3') and
      ((o.start >= 11000000 and o.start < 11500000) or
       (o.end >= 11000000 and o.end < 11500000) or
       (o.start < 11000000 and o.end >= 11500000))

match (o)-[:OVERLAPS_VARIANT]->(v:Variant)
where 11000000 <= v.position < 11500000

return count(v)

And its profile:

And the second query:

profile

match (o:OverlapRegion)
where (o.chromosome = '3') and
      (o.start <= 11500000 and o.end > 11000000)

match (o)-[:OVERLAPS_VARIANT]->(v)
where 11000000 <= v.position < 11500000

return count(v)

And its profile:

Just because it might take a second noticing the comparison difference of the second query, I promise the resulting nodes found are identical.

So, my questions given these:

  1. The first query consists of 21,450 db hits when performing the start/end comparisons while the second is only 2,930. If it's just counting the number of property comparisons taking place, that makes sense. But there should be next to no difference in computation time between the two.

  2. The first query requires 31,518 db hits on the second filter, while the second query only needs 20,842. Yet the filter being performed is identical. This makes me thing that it's not the # of comparisons taking place, but rather something like db reads (to be continued...).

  3. I have no idea why the first filter is able to use an AUTOINT while the second is using an AndedPropertyInequalities expression. I'm guessing, there is some inefficiency there?

And, for our third query, we'll simply just change the condition for the :Variant node. It'll be wong, but just to test a theory...

where v.position >= 11000000

As you can see, the resulting plan is almost identical, except the second filter can now use an AUTOINT and it has exactly 1/2 the db hits: 10,421 for exactly 1/2 the comparisons.

If I change it to v.position >= 11000000 and v.position < 11500000 then it goes back to the AndedPropertyInequalities expression and ran 1s slower.

Again, ignoring the "wrong" query, I would have expected all of these to perform near identically (within ms) to each other. Any extra cost to a query should be - from on past experience - impacted by the number of nodes/records searched and the number of disk reads required. The number of nodes hit in all 3 of these is exactly the same, and the properties being queried are exactly the same in all cases as well; there should be no difference in the number of reads required in each case.

It would be great to have an actual definition of what a db "hit" is, too. Is it a property comparison? a lookup (which may be cached)? a disk read? a bucket term that could mean just about anything?

Thanks for looking!

Hi there,

A db hit is a bit of a bucket term, an interaction with the db (whether it's through the pagecache or hitting disk), as opposed to operations on projected variables already in memory. As such, db hits are not equal to each other in terms of time or effort for execution. They are guides for places in the query that may need attention, and they should draw your eye, but use them in conjunction with other clues when optimizing. Often (but not in all cases) looking at the rows flowing between operations can be more important (since Cypher operators execute per row).

We can do some checking to see if multiple access to the same properties are happening and see if there's a better way.

For query #1, try this and see how it compares, we'll project out both properties after our MATCH (the index lookup is on chromosome, not the ranges), then do the comparisons.

profile

match (o:OverlapRegion)
where (o.chromosome = '3')
WITH o, o.start as start, o.end as end
WHERE 
      (11000000 <= start < 11500000) or
       (11000000 <= end < 11500000) or
       (start < 11000000 and end >= 11500000)

match (o)-[:OVERLAPS_VARIANT]->(v:Variant)
WITH v, v.position as position
where 11000000 <= position < 11500000

return count(v)

And as for testing AUTOINT vs ANddedPropertyInequalities, you can try doing a projection between each inequality check. Unsure what kind of difference you may or may not see:

...
match (o)-[:OVERLAPS_VARIANT]->(v)
WITH v, v.position as position
WHERE 11000000 <= position
WITH v, position, 1 as ignored
WHERE position < 11500000

Compare that against just projecting out v.position and using that in the range comparison with both bounds. If you're seeing a major difference, then perhaps that's something I can pass along on my end for investigation.

massung
Node Clone

Bump. Curious about any insights from the Neo4j team on this and the questions posed. Thanks!

you may wish to read Compound index no longer being used

massung
Node Clone

Andrew,

Projecting those values does - in fact - make a considerable improvement. Obviously there's a single db hit per property being projected, but once projected the use of the projected value is much more efficient (it would seem).

2X_a_ab40d1443efcad3628cf0047170d3dcf3153f0ea.png

At the beginning, projecting the start and end allowed for only 2k db hits instead of the original 14.5k for the same set of properties. Similarly, when it gets to the variant, projecting the position takes 11k db hits as opposed to the original 33k and then the filter was able to optimize to the AUTOINT comparison.

The end result is the query now runs (worst case for the example above) in 2 seconds, which is almost a ~4x improvement over the original worst case.

My gut tells me that there's something wrong under-the-hood. The profile of the final query is more closely aligned with my mental model of what I would expect should happen in the original case as well: read/fetch the properties that are going to be needed, cache them off in the query and then be able to use them at-will as often as needed as opposed to a potential fetch per use.

Thanks for helping me come up with a temporary work-around. Here's hoping there's just a simple bug that can be fixed - or improvement that can be made - rather easily so this hand-rolled projecting of properties won't be necessary in the future.

Glad to hear you're seeing good improvements. Sad to hear the inequality predicates aren't quite working optimally, I had thought that they were.

I do know that planner improvements are happening that address this exact issue with property caching (on a wider scale within a query). They'll be somewhere in the 4.x.x releases, just want to let you know that these improvements have already been planned out and are in the pipeline.

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online