Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-15-2018 02:29 AM
Last night both our servers crasht (or stopped working for about 10 minutes) after a GC error.
This is the GC log record from the last incident this morning:
2018-10-15T11:10:46.108+0200: 1367360.645: [Full GC (Allocation Failure) 23G->21G(23G), 85.9388715 secs] [Eden: 0.0B(1208.0M)->0.0B(1208.0M) Survivors: 0.0
The logs from both servers are in this drive: https://drive.google.com/drive/folders/1PhwMXwImOYjBMzFvGqk3REopX9hxnHki?usp=sharing
We restarted N2 after the incident, this morning N1 freezed again on the same error.
I guess our memory settings are not optimized.
dbms.memory.heap.initial_size=24200m
dbms.memory.heap.max_size=24200m
dbms.memory.pagecache.size=28100m
Both servers are running with 6 CPU on 64 GB.
10-16-2018 01:39 PM
Can you enable query logging and gc logging and share the query logs and gc logs?
10-16-2018 01:53 PM
Hi Michael,
GC logs are in the Google drive folder already. Together with the debug logs and neo log.
I don't think Query logging will be much of a use. We run thousands of queries a minute.
10-20-2018 06:24 PM
Query logs would still be helpful.
You might choose to add a treshold but then it might filter out some relevant bits.
10-20-2018 06:49 PM
Looking at the logs there are a lot of resource exhaustions,
the bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.
Also the heap utilization is always almost on top (the gc log starts with 21G of 23G) and creeps upwards to 23G.
Already at startup the store uses almost all memory which is really odd.
Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.
Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}
10-22-2018 07:45 AM
Hi Michael,
The bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.
Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.
Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}
https://drive.google.com/drive/folders/1InbtELmbuThBCl-PXtpDvtTId64ZT6j2?usp=sharing
Thxs again for looking into it.
10-22-2018 03:39 PM
Unfortunately as you can see from the logs that server has a very different startup behavior, the heap memory is all free at startup and also freed during GC.
# debug.log
Memory Pool: G1 Old Gen (Heap memory): committed=5.55 GB, used=0.00 B, max=5.86 GB, threshold=0.00 B
# gc.log
[Eden: 2486.0M(2486.0M)->0.0B(3536.0M) Survivors: 64.0M->64.0M Heap: 2617.4M(6000.0M)->130.4M(6000.0M)]
So not really sure how to continue, except trying to take a heap dump of that prod server.
This is definitely something that should be a support issue.
10-22-2018 04:43 PM
Hey Michael,
I can take a heap dump from the prod server. Need to do that when restarted or just now when it's running?
10-23-2018 03:09 PM
if it's started and shows that 23G used of 23G in debug log or GC log with almost 23G full.
All the sessions of the conference are now available online