Neo4j

david2 · ‎07-09-2019

I have a 18GB database, on 3.5.4 enterprise.
When running simple index queries such as match (n:Product {pk:"id"}) return n, which runs at <5ms with single connections, will slow down to >1000ms when handling >10 simultaneous connections, with most of the queries state as planning.

Testing on a variety of systems, from 8 - 64 avail cores, the CPU has always been low. With no obvious memory issues.
To alleviate the issue, I upped the page and heap memory, as well as increasing thread pools and query cache, to no avail:

dbms.memory.heap.initial_size=25G
dbms.memory.heap.max_size=25G
dbms.memory.pagecache.size=30G
dbms.connector.bolt.thread_pool_min_size=50
dbms.connector.bolt.thread_pool_max_size=1000
dbms.connector.bolt.thread_pool_keep_alive=10s
dbms.query_cache_size=100000

We typically restore the graphs from file backups into docker containers for deployment with:
neo4j-admin restore --from=/data/payload --force=true

I do see in debug.log that there are lots of entries such as:

2019-07-09 17:01:24.434+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2102, gcTime=2149, gcCount=1}
2019-07-09 17:05:57.297+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2160, gcTime=2242, gcCount=1}
2019-07-09 17:10:24.668+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2189, gcTime=2272, gcCount=1}
2019-07-09 17:14:28.038+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2210, gcTime=2263, gcCount=1}
2019-07-09 17:18:20.321+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2254, gcTime=2251, gcCount=1}
2019-07-09 17:22:27.098+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2226, gcTime=2286, gcCount=1}

This can't be normal since the DB isn't that big. Any ideas?

andrew_bowman · ‎07-09-2019

That's strange, can you PROFILE the query and add it here? Also, can you try parameterizing your query instead of using string literals? That may help as no replanning will be needed, all similar queries with different parameters will use the same cached query plan.

Was anything else executing at the time? Any triggers setup with your db?

david2 · ‎07-09-2019

The profile is exactly just 1 item hit, I've optimised my queries. The actual calls are more complex with a mix of queries, but the primary slow downs comes from concurrent mix calls like this:

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d}) return n', d=arg)

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d})-[:HAS_CONTEXT]-(c:CloneID) return c.name', d=arg)

I'll see a fast performance degrade from 200 items/s to 30 it/s on 128 threads. For example.
I've tuned -XX:NewRatio=1. Still same thing. I'm on GCP with 8 vCPUs, 52 GB memory btw.

andrew_bowman · ‎07-09-2019

Do you have query logging enabled? If so, could you provide your query.log, or some excerpts from it? We're looking for timing breakdown info (you may need to enable that in your config)

andrew_bowman · ‎07-09-2019

Is there a way you can test with 3.5.7? We had a fix in 3.5.6 regarding our query collector which runs in the background, as it was hogging memory. It may be related, if it was contributing to heap usage and triggering GCs. At the least we can rule out any bugs fixed since 3.5.4.

david2 · ‎07-09-2019

I upgraded to 3.5.7, seems to help

david2 · ‎07-10-2019

@andrew.bowman
The upgrade helped, but I'm still running to delays on long running jobs, querys on 3.5.7:

/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.074+0000 INFO 129 ms: (planning: 92, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.078+0000 INFO 0 ms: (planning: 0, waiting: 0) - 91 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server/35.202.
116.199:7687> conversion - match (s:Product)<-[r:HAS_CONTEXT]->(t) where s.pk=$pk return s,r,t - {pk: 'xxx'} - {}
2019-07-10 14:29:25.313+0000 INFO 147 ms: (planning: 90, waiting: 0) - 13 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.464+0000 INFO 129 ms: (planning: 89, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.467+0000 INFO 0 ms: (planning: 0, waiting: 0) - 49 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server/35.202.

So the majority of the time was spent on planning

andrew_bowman · ‎07-17-2019

You can save on planning time by using parameters instead of literals in your queries, it will be able to use the same plan as long as only the parameters differ between calls.

trinhl · ‎10-22-2020

Hi guys, I sill got the same problem although I upgraded to 4.1.0
Many concurrent connections (i.e 500 concurrent connections) requested to Neo4j, the slower performance I got.
I had tried some solutions. Increase heap size, using parameter with query.
And my queries are cypher projection query in order to create subgraph catalog.
So what should I try?

Neo4j

Extremely slow query time with concurrent queries