Neo4j

tim_hanssen · ‎10-15-2018

Last night both our servers crasht (or stopped working for about 10 minutes) after a GC error.

This is the GC log record from the last incident this morning:

2018-10-15T11:10:46.108+0200: 1367360.645: [Full GC (Allocation Failure) 23G->21G(23G), 85.9388715 secs] [Eden: 0.0B(1208.0M)->0.0B(1208.0M) Survivors: 0.0

Unbuntu 14 LTS
3.4.7 Enterprise in HA clustering
BOLT

The logs from both servers are in this drive: https://drive.google.com/drive/folders/1PhwMXwImOYjBMzFvGqk3REopX9hxnHki?usp=sharing

We restarted N2 after the incident, this morning N1 freezed again on the same error.

I guess our memory settings are not optimized.

dbms.memory.heap.initial_size=24200m
dbms.memory.heap.max_size=24200m
dbms.memory.pagecache.size=28100m

Both servers are running with 6 CPU on 64 GB.

michael_hunger · ‎10-16-2018

Can you enable query logging and gc logging and share the query logs and gc logs?

tim_hanssen · ‎10-16-2018

Hi Michael,

GC logs are in the Google drive folder already. Together with the debug logs and neo log.

I don't think Query logging will be much of a use. We run thousands of queries a minute.

michael_hunger · ‎10-20-2018

Query logs would still be helpful.
You might choose to add a treshold but then it might filter out some relevant bits.

michael_hunger · ‎10-20-2018

Looking at the logs there are a lot of resource exhaustions,

the bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.

Also the heap utilization is always almost on top (the gc log starts with 21G of 23G) and creeps upwards to 23G.
Already at startup the store uses almost all memory which is really odd.

Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.

Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}

tim_hanssen · ‎10-22-2018

Hi Michael,

The bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.

We will increase the pool size. And let you know if that helps.

Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.

We will move to CC somewhere in the next month, until now we used a 2 server setup with a arbiter. Since that is no option we CC we first needed to move to a new multi dc hosting cluster.

Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}

We just restarted the server with 6G heap, and after the server came online I took the heapdump. The dump with all the logs are in the google drive:

https://drive.google.com/drive/folders/1InbtELmbuThBCl-PXtpDvtTId64ZT6j2?usp=sharing

Thxs again for looking into it.

michael_hunger · ‎10-22-2018

Unfortunately as you can see from the logs that server has a very different startup behavior, the heap memory is all free at startup and also freed during GC.

# debug.log
Memory Pool: G1 Old Gen (Heap memory): committed=5.55 GB, used=0.00 B, max=5.86 GB, threshold=0.00 B

# gc.log
[Eden: 2486.0M(2486.0M)->0.0B(3536.0M) Survivors: 64.0M->64.0M Heap: 2617.4M(6000.0M)->130.4M(6000.0M)]

So not really sure how to continue, except trying to take a heap dump of that prod server.
This is definitely something that should be a support issue.

tim_hanssen · ‎10-22-2018

Hey Michael,

I can take a heap dump from the prod server. Need to do that when restarted or just now when it's running?

michael_hunger · ‎10-23-2018

if it's started and shows that 23G used of 23G in debug log or GC log with almost 23G full.

Neo4j

Server crashes on Full GC Failure