Read / write performance dramatically degrades with concurrent queries
‎12-03-2019 09:55 PM
- Neo4j version: 3.5.13 enterprise
- Driver: Java Bolt driver 1.7.5
- Server: Single node - 8vCPUs - 64 GB RAM - Ubuntu 18.04
- Heap: 24 GB
- Pagecache: 28 GB
I am noticing that the performance of my reads / writes degrade very quickly when running concurrent queries over an extended period of time. I have a single node Neo4j server and 2 workers that run on different servers. Both the workers run 10 threads each, and process a steady stream of transactions. Ideally I would like to have more workers. For each transaction, a thread runs a few Cypher queries, a mix of read and write queries. The database size is currently ~3.5 GB (4M nodes, 10M relationships).
I have the slow query log enabled (threshold 100ms) and notice some read queries start showing in there pretty quickly. The same queries go from 150-200ms to more than 1s in the slow query log over a short period of time (a few minutes), and then stay in the 1.5s - 2s range. If I increase the number of worker threads this goes up even higher.
One of the example queries is the following:
match (ps:Persona)<-[:BELONGS_TO]-(n:Card {card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=93159 return id(n) as node_id
Here is the above query's profile:
Can using where id(ps)=x
be the culprit here, although I am not sure why that should be an issue performance wise? My goal is to have a throughput of around 500-600/s but I am only reaching 8-10/s with the current performance.
‎12-04-2019 12:26 AM
There's for sure something wrong here. Getting 8/10 req/s out of this configuration is close to nothing. A target throughput of 500-600 req/s should be easy to reach.
id(x)
should not cause any issues. It's a O(1)
operation that basically performs a seek operation on your IO subsystem.
Couple of questions to understand the issue better:
- What's the cardinality? How many
Cards
does aPersona
have? Are there "superpresonas" with a huge number of cards? - How many properties to you store on your
Card
nodes? - Can you check using htop (or similar) how busy your CPUs are going while you increase the load?
- Did you do any config changes with respect to bolt serverside pool sizes?
- Can you rule out network latency of being an issue? Are your clients based in the same physical location than your server?
- Do you have any heap extensive operations (large writes, heavy read operations like
ORDER BY
,DISTINCT
, ...) ? If not I think heap of 24G might be too large.
On a different notice: if you're on AWS and use Ubuntu, consider enabling ENA, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html. For the given issue I suspect this won't make a big difference, but worth a try.
‎12-04-2019 10:13 AM
- On an average a
Persona
has 1-2 cards. There are some anomalies where aPersona
would have>50
cards but that is highly uncommon. I get similar slowness on other similar queries too with other Labels in place ofCard
. The cardinality on the other ones is even lower than theCards
- A
Card
only has 2 properties - The CPU usage stays between 25-40% consistently
- I increased the minimum pool size hoping it would help, but I had problems with the default configuration too. Right now I have it as:
dbms.connector.bolt.thread_pool_min_size=20, dbms.connector.bolt.thread_pool_max_size=400, dbms.connector.bolt.thread_pool_keep_alive=10m - The clients are in the same zone as the server. I am on DigitalOcean and the clients connect to the server using private IP. Also, since I see the queries slowing down in the slow query log (the same queries have their execution times go up), network latency might not be an issue. On that thought, can disk I/O be an issue instead?
- My reads and writes are quite simple. The most work a read does would be a
collect(n)
on around 10-15 nodes with not many properties. I can try lowering the heap size and see if that helps
Are there any specific things I should be trying to look for in the logs? I have GC logs enabled too. Another thing I am noticing is that the throughput starts high, keeps going down for a while, and then goes back up for a short period of time, and then starts going down again (this pattern basically repeats a lot, see chart below):
‎12-04-2019 01:07 PM
Here is a small sample from my query.log
. It looks like the queries are spending a lot of time in the planning phase. But also the time taken apart from planning seems to be quite high. I would expect these queries to take not more than a few ms:
2019-12-04 20:56:17.368+0000 INFO 1556 ms: (planning: 1095, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51654 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343875 return id(n) as node_id - {}
2019-12-04 20:56:17.564+0000 INFO 1667 ms: (planning: 1282, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42346 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343741 return id(n) as node_id - {}
2019-12-04 20:56:17.952+0000 INFO 1552 ms: (planning: 1246, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51648 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343896 return id(n) as node_id - {}
2019-12-04 20:56:17.967+0000 INFO 1686 ms: (planning: 1319, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42326 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343838 return id(n) as node_id - {}
2019-12-04 20:56:17.967+0000 INFO 1777 ms: (planning: 1385, waiting: 0) - 32 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42334 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:IP{ip_address: {ip_address}}) where id(ps)=5343939 return id(n) as node_id - {}
2019-12-04 20:56:18.067+0000 INFO 1857 ms: (planning: 1467, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51650 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343921 return id(n) as node_id - {}
2019-12-04 20:56:18.091+0000 INFO 1393 ms: (planning: 1085, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42348 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5344009 return id(n) as node_id - {}
2019-12-04 20:56:18.094+0000 INFO 1231 ms: (planning: 953, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51662 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343862 return id(n) as node_id - {}
2019-12-04 20:56:18.286+0000 INFO 1652 ms: (planning: 1018, waiting: 0) - 14 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42344 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5333585 return id(n) as node_id - {}
2019-12-04 20:56:18.288+0000 INFO 1687 ms: (planning: 1245, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51652 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343980 return id(n) as node_id - {}
2019-12-04 20:56:18.353+0000 INFO 1762 ms: (planning: 1376, waiting: 0) - 26 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51664 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:IP{ip_address: {ip_address}}) where id(ps)=5343781 return id(n) as node_id - {}
2019-12-04 20:56:18.421+0000 INFO 1727 ms: (planning: 1418, waiting: 0) - 48 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42340 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5344063 return id(n) as node_id - {}
2019-12-04 20:56:18.432+0000 INFO 1840 ms: (planning: 1211, waiting: 0) - 16 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42342 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5276381 return id(n) as node_id - {}
2019-12-04 20:56:18.456+0000 INFO 1746 ms: (planning: 1293, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42338 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5344098 return id(n) as node_id - {}
2019-12-04 20:56:18.474+0000 INFO 1870 ms: (planning: 1437, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42330 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343702 return id(n) as node_id - {}
2019-12-04 20:56:18.478+0000 INFO 1768 ms: (planning: 1240, waiting: 0) - 8 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42328 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343999 return id(n) as node_id - {}
2019-12-04 20:56:18.770+0000 INFO 1676 ms: (planning: 1293, waiting: 0) - 9 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51656 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343929 return id(n) as node_id - {}
2019-12-04 20:56:18.831+0000 INFO 1577 ms: (planning: 1133, waiting: 0) - 12 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51660 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5276381 return id(n) as node_id - {}
2019-12-04 20:56:19.099+0000 INFO 1795 ms: (planning: 1336, waiting: 0) - 8 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51646 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=2386925 return id(n) as node_id - {}
2019-12-04 20:56:19.144+0000 INFO 1715 ms: (planning: 1286, waiting: 0) - 8 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51658 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343975 return id(n) as node_id - {}
2019-12-04 20:56:19.227+0000 INFO 1618 ms: (planning: 1287, waiting: 0) - 6 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.85.52:42346 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5344156 return id(n) as node_id - {}
2019-12-04 20:56:19.266+0000 INFO 1849 ms: (planning: 1508, waiting: 0) - 25 page hits, 0 page faults - bolt-session bolt neo4j neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6 client/10.132.44.126:51654 server/10.132.139.216:7687> neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5211232 return id(n) as node_id - {}
‎12-04-2019 04:58 PM
You really should be using parameters in your queries.
Currently, since the values you're using in the queries are hardcoded, each is treated as distinct query in the query cache, so you aren't able to take advantage of that and the query needs to be replanned each time.
If you parameterize (with something like where id(ps)={psId}
) then it should look like the same query no matter how the parameters change, allowing you to skip the planning and reuse the query from the cache.
‎12-04-2019 05:20 PM
Yes you are right!! I can't believe I was missing that. Every other aspect of my queries are parameterized, for some reason I totally forgot to parameterize the id
part. My bad. Thanks for pointing that out.
Also, removing the query planning time, it looks like the queries spend ~300-400ms
after the planning phase (not sure if there are other things to consider there too). Is that normal? I would expect it to run a lot faster since the profile shows that it's an index lookup and it's almost always a pagecache hit.
‎12-04-2019 05:48 PM
Can you show an expanded query plan of a query for us to see?
‎12-04-2019 10:43 PM
Here is the query plan for one of the queries that shows up the most in the slow query log. The other queries are very similar to this one, just use different labels and properties (indexed).
‎12-04-2019 11:38 PM
Can you please switch off detailed query logging? Maybe put a treshold of 1sec and make your you've switched off all the detailed options from https://neo4j.com/docs/operations-manual/current/monitoring/logging/query-logging/ - those do impact performance.
Are your CPUs saturated once you run the test with these changes? Would be also interesting how much you spend in "IO wait".
Another thing you can try as a experiment:
- limit the bolt thread pool to ~= number of cpu cores:
dbms.connector.bolt.thread_pool_max_size=10
- and establish a queue for buffering, note this is a experimental feature:
dbms.connector.bolt.unsupported_thread_pool_queue_size=10000
That should remove frequest CPU context switches (which are overhead) and increase throughput. If you see client side bolt errors, consider to increase the queue size parameter.
‎12-10-2019 10:15 AM
Have you tried another cloud provider for comparison?
I had terrible performance on Digital Ocean, evidently they over deployed on certain instances, and after the latest outage, I moved to GKE ON Google cloud that solved performance and reliability altogether.
Your setup looks pretty normal to me and should run way faster than that so I would double check if a quick test on gcp and Aws runs better or worse.
‎12-19-2019 09:00 AM
I had very recently a very similar encounter with extremely slow concurrent write operations.
Apparently, neo4j has a hard time with mixed read/write concurrent operations. In my case, 7 concurrent clients degraded performance by over a 100%. Beyond that, neo was effectively dead past 10 concurrently writing clients with latency & throughput reaching outrageous levels. However, a few things helped me to make things better:
- Centralize write operations through an API Gateway with one DB client handling all DB operations.
- Batch write operations(!)
- Sequential everything possible that does not occur in parallel.
Fixing this issue was a very sobering exercise, but it occurs to me that Java's concurrency underlying neo4j seems to work way better with batches & queues. There is an article discussing the details of tuning neo for concurrent write operations but the very existence of that article left me wondering why I actually need to batch write operations on the application side?
It is particularly hard to swallow when recognizing that DGraph easily handles 500 concurrent connections with each pushing thousands of write operations while neo essentially dies way earlier.
Is there anyone from the Neo4J engineering team around who is willing to share some insights of how to handle hundreds of concurrent client connections with each having thousands of write operations?
Also, why do we we need to batch write operations in the first place to get reasonable performance?
‎12-30-2019 03:22 PM
Marvin,
it would be great to understand more details on your issue. Would you mind posting it as a new topic with more details neo4j version, setup & configuration, (the type of your read/write queries) PROFILE output for either, query log, and esp. also outputs of call dbms.listQueries()
which also show potential contention.
Thank you!
‎01-19-2020 04:06 PM
@michael.hunger sorry, no way.
The problems became so bad that we already migrate everything away from neo. For us, neo was a complete waste of time and not worth fixing.
‎09-13-2020 01:58 PM
Hello,
I have the same problem with Neo4j: many errors, many side effects and low performance...
May I ask you about alternative of Neo4j? I try to find out the same graph database, but I have not a good alternative...
Best regards
‎07-17-2022 12:59 PM
@marvin-hansen I am running into a similar issue. Could you please share what did you end up using and how has it been with you after 2 years?
Thanks
‎12-19-2019 01:04 PM
Thanks so much for the all the ideas guys. I think the main bottleneck here was with the query cache. I parameterized all of my queries, disabled detailed query logging and put a higher threshold on the slow query log (500 ms). With that I am able to reach a throughput of well above 1000 req/s. I realized how important query cache is and how important it is to parameterize queries.