Neo4j

danjou_philippe · ‎10-16-2020

Hi, I tried this 3 times now from scratch. I followed documentation. I have 3 nodes in a LAN, no firewalls.

relevant config lines:

dbms.mode=CORE

# Expected number of Core servers in the cluster at formation
causal_clustering.minimum_core_cluster_size_at_formation=3

# Minimum expected number of Core servers in the cluster at runtime.
causal_clustering.minimum_core_cluster_size_at_runtime=3

# A comma-separated list of the address and port for which to reach all other members of the cluster. It must be>
# host:port format. For each machine in the cluster, the address will usually be the public ip address of that m>
# The port will be the value used in the setting "causal_clustering.discovery_listen_address".
causal_clustering.initial_discovery_members=10.4.0.100:5000,10.4.0.101:5000,10.4.0.102:5000

# Host and port to bind the cluster member discovery management communication.
# This is the setting to add to the collection of address in causal_clustering.initial_core_cluster_members.
# Use 0.0.0.0 to bind to any network interface on the machine. If you want to only use a specific interface
# (such as a private ip address on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.discovery_listen_address=10.4.0.100:5000

# Network interface and port for the transaction shipping server to listen on.
# Please note that it is also possible to run the backup client against this port so always limit access to it v>
# firewall and configure an ssl policy. If you want to allow for messages to be read from
# any network on this machine, us 0.0.0.0. If you want to constrain communication to a specific network address
# (such as a private ip on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.transaction_listen_address=10.4.0.100:6000

# Network interface and port for the RAFT server to listen on. If you want to allow for messages to be read from
# any network on this machine, us 0.0.0.0. If you want to constrain communication to a specific network address
# (such as a private ip on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.raft_listen_address=10.4.0.100:7000

I always end up with the following in logs:
NODE1

2020-10-16 14:57:26.258+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:26.258+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]
2020-10-16 14:57:31.074+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:31.074+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]
2020-10-16 14:57:31.939+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Pausing due to snapshot request (count = 1)
2020-10-16 14:57:31.939+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Resuming after snapshot request (count = 0)
2020-10-16 14:57:31.942+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Pausing due to snapshot request (count = 1)
2020-10-16 14:57:31.942+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Resuming after snapshot request (count = 0)
2020-10-16 14:57:36.177+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:36.177+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]

Node2

h
2020-10-16 14:57:08.579+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:18.583+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:28.586+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:31.941+0000 INFO [c.n.c.c.s.s.SnapshotDownloader] [neo4j] Downloading snapshot from core server at 10.4.0.100:6000
2020-10-16 14:57:31.944+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch
2020-10-16 14:57:38.589+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot

Node 3

2020-10-16 14:57:01.938+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch
2020-10-16 14:57:08.736+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:18.739+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:28.742+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:31.938+0000 INFO [c.n.c.c.s.s.SnapshotDownloader] [neo4j] Downloading snapshot from core server at 10.4.0.100:6000
2020-10-16 14:57:31.940+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch

What am I doing wrong here? This should be easy and straight forward
I also tried the unbind thing with neo4j admin but it didn't change anything, shouldnt be required on a fresh install anyway?

Thanks for help

david_allen · ‎10-18-2020

This is the problem.

Stop each server, run neo4j-admin unbind on each server, and then restart and it should be fixed.

The issue is that the cluster members think they have different databases, and they won't join and communicate if they have a "split brain"

danjou_philippe · ‎10-18-2020

Yes I did this multiple times already. It doesn't help. (try yourself)

But I saw some other post I found on google someone saying to delete everything in the database dirs, so I did, and now it seems to work.

Update your Documentation! It's wrong!

harvey_nguyen · ‎08-17-2021

It seems you are right, I have the same issue and neo4j-admin unbind doesn't work.

gigauser · ‎02-08-2022

Thank God, you saved my night!

gigauser · ‎02-08-2022

Ah, but I couldn't access "neo4j" database. So I deleted data folder and start and get succeeded.

bishnu12 · ‎02-15-2022

Not sure about others, but @david.allen ,

"Stop each server, run neo4j-admin unbind on each server, and then restart and it should be fixed."

this worked for me.

Thanks

Neo4j

Unable to setup 3 node cluster