Neo4j

bill_dickenson · ‎02-05-2021

Our AWS Neo4j instance is down again with a "server refused connection"

We were cleaning up some nodes ( match (a: {}) detach delete) in batches of 1000 nodes every 5 minutes. So nothing extraordinary. About 30 seconds into the 5th batch, the process errored out and issues a "server refused" message.

Any further attempts issued the same message.

We were able to log in to linux and check Neo4j which appears to be running. But no one can connect. We restarted Neo4j, same result. We restarted linux and neo4j, same result. At this stage, we can't tell if bolt is refusing us, or neo4j, but in any event, no connection.

We cleared cache etc on local machines.

2nd time in 4 months, last time was also a delete but a bigger set.

Obviously this can't happen in production. How do we restart.
Help

Firewall has port 7473 open for HTTPS access,port 7474 open for HTTP access, and 7687 open for BOLT access.Port 1337 is open for Cypher Shell connections. Ports 80 and 443 are open for normal HTTP/HTTPS.

dbms.default_listen_address=<the AWS server addres 172.xxx.xxx.xxx>

dbms.default_advertised_address=<the public url [xxx.someplace.net](http://xxx.someplace.net)>

# Bolt connector
dbms.connector.bolt.enabled=true
dbms.connector.bolt.tls_level=OPTIONAL
#dbms.connector.bolt.listen_address=:7687 (deprecated, commented out)
dbms.connector.bolt.advertised_address=:7687

# HTTP Connector. There can be zero or one HTTP connectors.
dbms.connector.http.enabled=true
dbms.connector.http.listen_address=[0.0.0.0:7474](http://0.0.0.0:7474)
dbms.connector.http.advertised_address=:7474

# HTTPS Connector. There can be zero or one HTTPS connectors.
dbms.connector.https.enabled=true
dbms.connector.https.listen_address=[0.0.0.0:7473](http://0.0.0.0:7473)
dbms.connector.https.advertised_address=:7473

# SSL policies
dbms.ssl.policy.bolt.base_directory=/var/lib/neo4j/certificates/bolt
dbms.ssl.policy.https.base_directory=/var/lib/neo4j/certificates/https

**Confirmed that the SSL policies have symlinks to the appropriate certificates.**

neo4j.cert -> /etc/letsencrypt/live/<[our.server.name](http://our.server.name)>/fullchain.pem

david_allen · ‎02-05-2021

There are several possibilities of what could be causing this. You'll need to consult your debug.log to troubleshoot further, and maybe dump some logs from the time period where this occurred.

Insufficient heap; possibly some of your concurrent transactions are getting big enough that the database doesn't have enough memory and becomes unresponsive in advance of an "Out of Memory Error"
Long checkpointing - check the logs for evidence
"Stop the world GC pauses" - again, you'd need to spot this in the logs, but it's a memory tuning issue

bill_dickenson · ‎02-05-2021

Thank you;

Heap: Grepped the log and no indication of 'insufficient' or 'eap' (accounting for lower and upper case) and other than the change at startup, no indication.
Checkpointing: Oddly enough, there is a message. Token Manager exception, client triggered unexpexted error [Neo.DatabaseError.general.unknownError]:lexical error @ line 1, col 2 - EOF <> after "," - But no checkpoint errors.
GC ? General Config ? Nothing that look like it.

No obvious errors.

Neo4j

Neo4j refuses connection after delete