Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-08-2020 07:18 AM
I have a read query (call it Query Y) that has been going on for over 8 hours. My dataset is large, but it is not that large that a query would take over 8 hours. A similar query (call it Query X) to this one completed in 61 minutes, before I started with this query.
The query starts out with looking up a Person
node with a given id
(INDEX is already applied on the id
field). Then it locates friends and friends of friends of that Person
, and goes on to locate all Forum
nodes where the friends are a member of, using a hasMember
relationship, such that the joinDate
property on the relationship is after a given start date. Lastly, it needs to return the count of all Post
nodes, for posts which were created by these friends, that are contained in the forum.
All nodes are indexed on their id
field.
Here's the query plan with EXPLAIN
-
I could not get the PROFILE
results because the query never completed.
When I looked at the debug.log
file, I noticed a lot of Stop-the-World
(STW) pauses, sprinkled around during the execution of these two long running queries (X and Y). On reading about it, it seems that these pauses occur when the JVM GC pauses all other application threads.
Here is the relevant portion from the debug.log
file -
2020-01-07 15:39:11.648+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=180, gcTime=192, gcCount=1}
2020-01-07 15:39:11.824+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=130, gcTime=0, gcCount=0}
2020-01-07 15:49:57.542+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=189, gcTime=212, gcCount=1}
2020-01-07 15:50:01.690+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=119, gcTime=130, gcCount=1}
2020-01-07 16:01:23.829+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=162, gcTime=173, gcCount=1}
2020-01-07 16:21:25.807+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=130, gcTime=139, gcCount=1}
2020-01-07 16:21:25.995+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=0, gcCount=0}
2020-01-07 16:22:28.916+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=125, gcTime=188, gcCount=1}
2020-01-07 16:22:30.558+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=314, gcTime=0, gcCount=0}
2020-01-07 16:25:31.980+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=135, gcTime=144, gcCount=1}
2020-01-07 16:25:32.231+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=169, gcTime=0, gcCount=0}
The complete set of STW logs while the two queries were running is in this github gist, and the complete debug.log
file is here. For reference, the relevant logs would be at the very end, on 2020-01-07 (when I executed these two queries).
What can I do in this case? I had already stopped and re-started both queries X and Y, when they were taking too long. While query X completed in 61 minutes the second time, query Y went on for over 8 hours, and never completed. For comparison, the other queries I execute takes only 1-3 minutes on average, for this dataset. My memory settings in the configuration file are set at heap(initial) - 4G
, heap(max) - 8G
and pagecache - 6G, based on the recommendations suggested here. I don't have the exact recommendations by the memrec
tool, because I'm on JDK 11, and the tool seems to work only with JDK 8.
I have a 16GB RAM, should I set the heap size (both initial and max) to maybe 12G
, based on the suggested config -
The heap memory size is determined by the parameters
dbms.memory.heap.initial_size
anddbms.memory.heap.max_size
. It is recommended to set these two parameters to the same value. This will help avoid unwanted full garbage collection pauses.
01-08-2020 12:44 PM
thanks for the debug.log
Couple of issues.
We recommend setting dbms.memory.heap.init_size and dbms.memory.heap.max_size to the same value. From your debug.log I see for example
2020-01-04 00:08:44.292+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.pagecache.size=6G
2020-01-04 00:08:44.293+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.heap.max_size=8G
2020-01-04 00:08:44.293+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.jvm.additional=-XX:+UseG1GC
so if anything lets set them to the same value.
As stated in your initial problem description
I have a 16GB RAM, should I set the heap size (both initial and max) to maybe
12G
, based on the suggested config -
and yes this is evident by
2020-01-04 00:08:45.884+0000 INFO [o.n.i.d.DiagnosticsManager] Total Physical memory: 16.00 GB
But the size of your graph is at least some 34G, as evidence
2020-01-01 23:24:53.604+0000 INFO [o.n.i.d.DiagnosticsManager] Total size of mapped files: 35.74 GB
so its going to be difficult to get the entire graph in RAM (i.e. dbms.memory.pagecache.size) given the size of your graph is 2x the total RAM of the instance. Neo4j can still operate when the entire graph does not fit all in RAM.
From the logs it appears you have set the dbms.memory.pagecache.size as
2020-01-04 00:39:58.522+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.pagecache.size=6G
so if 6G goes to the pagecache and we set 8G to the heap for both min/max heap, we would still have 2G left over for other OS processing. But still that means only 6G of the 34GB graph is in RAM, so less than 25% of the graph is in RAM. If you had more RAM we could increase the pagecache and get more of the graph in RAM.
Finally in looking at the debug.log you have what appears to be the default settings for transaction log retention which is defaults to each log should grow as much as 250MB and keep 7 days worth of these logs. You will see dozens and dozens of logs and as reported in the debug.log at
2020-01-04 00:15:29.964+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.77: 2019-12-26T20:28:22-0500 - 250.02 MB
2020-01-04 00:15:29.964+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.78: 2019-12-26T20:31:00-0500 - 250.03 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.79: 2019-12-26T20:32:46-0500 - 250.02 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.8: 2019-12-26T19:31:14-0500 - 250.03 MB
2020-01-04 00:15:29.965+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.80: 2019-12-26T20:33:25-0500 - 250.01 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.81: 2019-12-26T20:33:53-0500 - 250.01 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.82: 2019-12-26T20:34:19-0500 - 250.02 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.83: 2019-12-26T20:34:44-0500 - 250.03 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.84: 2019-12-26T20:35:11-0500 - 250.02 MB
2020-01-04 00:15:29.966+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.85: 2019-12-26T20:35:35-0500 - 250.03 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.86: 2019-12-26T20:36:02-0500 - 250.01 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.87: 2019-12-26T20:36:29-0500 - 250.04 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.88: 2019-12-26T20:36:55-0500 - 250.02 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.89: 2019-12-26T20:37:33-0500 - 250.03 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.9: 2019-12-26T19:31:37-0500 - 250.01 MB
2020-01-04 00:15:29.967+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.90: 2019-12-26T20:37:58-0500 - 250.01 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.91: 2019-12-26T20:38:28-0500 - 250.02 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.92: 2019-12-26T20:40:04-0500 - 250.00 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.93: 2019-12-26T20:40:29-0500 - 250.04 MB
2020-01-04 00:15:29.968+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.94: 2019-12-26T20:40:56-0500 - 250.04 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.95: 2019-12-26T20:41:21-0500 - 250.03 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.96: 2019-12-26T20:42:59-0500 - 250.03 MB
2020-01-04 00:15:29.969+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.97: 2019-12-26T20:43:26-0500 - 250.00 MB
2020-01-04 00:15:29.970+0000 INFO [o.n.i.d.DiagnosticsManager] neostore.transaction.db.98: 2019-12-26T20:43:52-0500 - 250.01 MB
Although retention settings of the transaction logs will not hurt/improve query performance you might want to reconsider the retention duration. Please see https://neo4j.com/docs/operations-manual/3.5/configuration/transaction-logs/ for details on configuring retention
01-08-2020 01:04 PM
Thank You for the detailed response. I really appreciate it.
I set the heap (both initial and max) size to 10G
, and started the query. It's been over two hours, and it is still running, but this time I don't see any stop-the-world pause
logs in debug.log
. The latest entry in debug.log
is -
INFO [o.n.k.i.p.PageCacheWarmer] Page Cache warmup completed. 783341 pages loaded. Duration: 40m 30s 860ms.
Regarding this -
My initial dataset (csv files) is around 30GB, which might be how the total size of mapped files is 35.74 GB, but the entire database size in Neo4j (after loading all the nodes and relationships) comes to around 240GB. I'm not sure if that number needs to be taken into account for the pagecache
size calculation, instead of 35.74 GB.
Either way, I am now able to understand what pagecache
size actually means in terms of how the graph fits in memory, thanks to your explanation -
I wasn't quite clear on this from reading the docs.
I'll keep the query running this time, and hope it completes at some point. I will keep you posted.
01-08-2020 01:23 PM
My initial dataset (csv files) is around 30GB, which might be how the total size of mapped files is 35.74 GB, but the entire database size in Neo4j (after loading all the nodes and relationships) comes to around 240GB.
240GB? If so ??? and if you only have 16GB of total RAM on the instance then less than 10% of the graph would be in RAM.
How did you determine the database size is 240GB?
Is this inclusive of transaction logs (i.e. neostore.transaction*) files which are not to be included in database size since they are not loaded into the pagecache. Which is why my last response also said you might want to change your transaction log retention parameters?
01-08-2020 02:39 PM
This is the folder size of the database folder corresponding to my project on Neo4j Desktop. I got this folder path from the solution recommended here. It is located under neo4jDatabases/database-xxxx-xxxx-xxxx-xxxx
, as shown in this screenshot -
This is also the size that used to be shown up until a few months ago in Neo4j Browser itself, as you can see here -
But after the last few updates in Neo4j Desktop, this size is no longer shown, so I'm guessing they must have removed it.
Since I'm taking the folder size of the entire database-xxxx
folder, I think it does include all the transaction log files, indexes, etc.
In fact, I've been meaning to ask this for quite sometime now - all my datasets(csv files) expand to 7 times its size when completely loaded into Neo4j. For instance, my 1GB dataset occupies roughly 7GB disk space when imported in Neo4j, 3GB dataset occupies around 23GB disk space, and so on... I'm not sure why this is the case.
The graph is on a spinning disk.
01-08-2020 02:56 PM
ok... spinning disk might account for the slower performance, as well as the fact that not much of the graph is in RAM (i.e. dbms.memory.pagecache.size).
Database size is not simply determined by the total size of the graph.db file system. I could create 10m nodes every day and retain transaction logs for 30 days versus a system where I create 10m nodes every day but retain transaction logs for only 3 days. Same experience in terms of total number of nodes after 30 days, but the graph.db size will be greatly different based upon transaction log retention differences
01-08-2020 03:22 PM
Also, there's some context as to why I'm running the project on a spinning disk (my personal laptop) instead of an SSD. I'm working on my thesis with Neo4j for the last couple of years now (back when Neo4j was in v2), and during this time, I've tried to run it several times on AWS EC2 with much more powerful instances, like the m3 large, etc. But since I was using the community edition then, I came across a lot of issues, like db crashes, hang/freezing, etc which would require re-running the same query multiple times, or sometimes to start over with importing the data from scratch all over again.
During this time, I've logged some of these issues in the neo4j github repository, like this and this.
Since I'm just a student, using powerful EC2 instances for all of this would have exponentially increased the cost, especially since there was no way of knowing if the current operation would complete successfully or not.
In fact, people still report issues like exponentially large query times or unresponsiveness while using EC2 instances like m5.2x large, as seen in this post.
Regarding this, is there some resource that I can read up on, to understand how to effectively change the retention log parameters?
Thank You once again..
01-08-2020 01:25 PM
also
INFO [o.n.k.i.p.PageCacheWarmer] Page Cache warmup completed. 783341 pages loaded. Duration: 40m 30s 860ms.
which took 40minutes and appears a bit slow. Is you graph on SSD drives or spinning disk?
All the sessions of the conference are now available online