Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
02-21-2020 09:00 AM
Hi All,
To keep it crisp.
I have an application with a three node cluster and Neo4j DB is running on two of them.
It was a working setup before an interruption caused both nodes to go into a FOLLOWER role.
I restarted the services on both nodes and since then my Neo4j instances are not coming up.
If I track the logs, the debug.log do not show any issues and the neo4j-out logs show that both nodes are stuck in discovery phase and never move forward:
2020-02-21 11:14:05.236-0500 INFO Bolt enabled on 21.0.0.100:7687.
2020-02-21 11:14:05.246-0500 INFO Initiating metrics...
2020-02-21 11:14:05.538-0500 INFO My connection info: [
Discovery: listen=21.0.0.100:5000, advertised=21.0.0.100:5000,
Transaction: listen=21.0.0.100:6000, advertised=21.0.0.100:6000,
Raft: listen=21.0.0.100:7000, advertised=21.0.0.100:7000,
Client Connector Addresses: bolt://21.0.0.100:7687,http://21.0.0.100:7474,https://21.0.0.100:7473
]
2020-02-21 11:14:05.539-0500 INFO Discovering cluster with initial members: [21.0.0.104:5000, 21.0.0.100:5000]
2020-02-21 11:14:05.544-0500 INFO Attempting to connect to the other cluster members before continuing...
The tricky part is that even locally the neo4j is opening its connections on the relevant ports.
Also, I am not sure why its not able to form a connection with the other neo4j node.
I changed the relevant files in my server to start a server in standalone mode but then I am not able to add other nodes in the cluster due to an exception.
I have tried changing the parameter in neo4j.conf file to change "causal_clustering.expected_core_cluster_size=2" to a value of three.
But after the restart I again see it set back to 2.
I have tried unbinding graphdb and restarting the services, still no go.
I question is how can I prevent any one of the node to not check the other nodes during discovery and pass that phase?
Secondly, if its not able to reach the other node, why would it not be able to start service locally to make sure the relevant ports start responding locally.
Starting Nping 0.7.60 ( https://nmap.org/nping ) at 2020-02-21 10:46 EST
SENT (0.0014s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (0.0014s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (1.0026s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (1.0026s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused
SENT (2.0036s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (2.0037s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (3.0048s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (3.0048s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused
SENT (4.0059s) Starting TCP Handshake > 21.0.0.100:7474
RCVD (4.0060s) Possible TCP RST received from 21.0.0.100:7474 --> Connection refused
SENT (5.0071s) Starting TCP Handshake > 21.0.0.100:7687
RCVD (5.0071s) Possible TCP RST received from 21.0.0.100:7687 --> Connection refused
Any help would be appreciated to recover my neo4j from this phase without a rebuild.
Thanks,
Akshay
02-22-2020 02:59 AM
Can you please try to set causal_clustering.minimum_core_cluster_size_at_formation=2
?
See https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_causal_clu...
02-25-2020 08:02 AM
Hi Stefan,
I tried that option but looks like the neo4j.conf file is getting overwritten with old content again at the startup.
I am checking internally to see why is that happening.
But I have your suggestion noted.
Thanks,
Akshay
02-25-2020 01:10 PM
If you're on docker you need to use --env NEO4J_<setting>=<value>
instead, see https://neo4j.com/docs/operations-manual/current/docker/configuration/#docker-environment-variables
02-26-2020 09:59 AM
Hi Stefan, I am not docker environment.
The parameter that you provided is not supported on the neo4j version that I am running.
Anything else that I can try get the services up?
Thanks,
Akshay
02-26-2020 10:33 AM
So which version you're on?
03-03-2020 12:48 PM
03-03-2020 01:19 PM
I see these lines in the file you have attached.
\f0\fs24 \cf0
POD-1-vManage1:/var/lib/neo4j/conf#
POD-1-vManage1:/var/lib/neo4j/conf# cat neo4j.conf \
This file seems to be associated with some type of containerization.
Are you using some type of VM's to run the cluster?
Can you please provide bit more details about the env where you are trying to run the cluster?
03-03-2020 01:28 PM
Its running as a service on an application.
Each node is a different VM.
03-03-2020 02:06 PM
Can you please check these things?
Thanks
Ravi
03-03-2020 02:12 PM
The extra characters were due to me copying the file incorrectly. The file is clean.
There is no firewall in between the nodes. This was working fine before the cluster went down due to an outage.
The port 5000 telnet even from the local node is giving me a connection refused.
I see the below in the logs all the time. It never passes this phase:
2020-02-26 15:14:29.257-0500 INFO My connection info: [
Discovery: listen=21.0.0.100:5000, advertised=21.0.0.100:5000,
Transaction: listen=21.0.0.100:6000, advertised=21.0.0.100:6000,
Raft: listen=21.0.0.100:7000, advertised=21.0.0.100:7000,
Client Connector Addresses: bolt://21.0.0.100:7687,http://21.0.0.100:7474,https://21.0.0.100:7473
]
2020-02-26 15:14:29.259-0500 INFO Discovering cluster with initial members: [21.0.0.104:5000, 21.0.0.100:5000]
2020-02-26 15:14:29.259-0500 INFO Attempting to connect to the other cluster members before continuing...
03-03-2020 02:58 PM
Can you run this commands and see if ports re open
sudo firewall-cmd --list-ports
If you don't see the ports then you need to add them to firewall
03-03-2020 07:23 PM
Hi,
The firewall-cmd did not work.
But I tried below and can see 5000 port listening.
POD-1-vManage1:/home/admin# netstat -an | grep 5000 | grep -i listen
tcp6 0 0 21.0.0.100:5000 :::* LISTEN
03-03-2020 07:25 PM
these are all the ports its listening on the cluster communication link
POD-1-vManage1:/home/admin# netstat -an | grep -i listen | grep 21.0.0.100
tcp6 0 0 21.0.0.100:5000 :::* LISTEN
tcp6 0 0 21.0.0.100:9200 :::* LISTEN
tcp6 0 0 21.0.0.100:9300 :::* LISTEN
tcp6 0 0 21.0.0.100:7000 :::* LISTEN
03-04-2020 05:16 AM
you are missing port 6000, 7474 and 7687 here.
Also, what about the other servers? are they also listening on those ports?
03-04-2020 06:32 AM
Thats correct.
But when I start the service it doesnt move any further from the point where it initializes the service discovery.
So I am not sure why the service wont start itself if the other members are not available.
Any logs that we can enable to find out why the service is not opening the relevant ports at the startup?
03-04-2020 07:11 AM
also I can successfully nping cluster ip on port 5000 and 7000 but not on any other port
05-15-2021 04:28 AM
Hi Akshay
Please check your memory configuration in environment variable.Memory configuration should match the version of neo4j you are trying to support.If you have upgraded from earlier version of neo4j then you must follow the neo4j KBS for additional help.
Yours faithfully
Sameer S Gijare
All the sessions of the conference are now available online