Neo4j

Trane9991 · ‎06-26-2019

Hello.
I’m facing a pretty critical issue and I’m pretty sure that this is a Neo4J bug.
There is already existing github issue for it https://github.com/neo4j/neo4j/issues/12221, but I added more finds from my setup: https://github.com/neo4j/neo4j/issues/12221#issuecomment-505883604

Long story short:
I’m trying to do a Neo4J Causal Cluster setup based on the AWS ECS with awsvpc networking. So awsvpc networking provides a separate network interface (with the separate IP from the VPC) mounted into Docker Image.

Here are the problems I’m facing:

Setting causal_clustering.discovery_listen_address doesn’t work properly. When it is set to 0.0.0.0:5000 the connection info in logs says the expected info
```
Discovery:   listen=0.0.0.0:5000, advertised=10.10.1.128:5000
```
but on practice, running lsof -i tcp | grep neo | grep LISTEN displays, that while other ports listens properly on *:7000 or *:6000 and the Discovery port still being bound to the IP (as stated in the issue), for example:
```
java      6 neo4j  233u  IPv4 427591      0t0  TCP 169.254.172.28:5000 (LISTEN)
```
When Discovery port not being bound properly to the *:5000 looks like it is being bound to the random available network interface. In my case ECS containers has two interfaces (excluding loopback):
```
Interface ecs-eth0: 
     address: 169.254.172.x
```
and
```
Interface eth0: 
      address: 10.10.x.x
```
(this data coming from the debug.log file), where the ecs-eth0 is some internal ECS interface I don’t care about and the eth0 is the one that should handle communication. The problem is, when neo4j binds the listen port to eth0 - everything works fine, when ecs-eth0 - port 5000 is unreachable and node can’t join the cluster. And this happens at complete RANDOM, see more logs in the github issue comment https://github.com/neo4j/neo4j/issues/12221#issuecomment-505940194
Setting causal_clustering.discovery_listen_address=10.10.x.x:5000 doesn’t work as well . Even when I can see correct bind info from the lsof:
```
TCP ip-10-10-1-72.ec2.internal:5000 (LISTEN)
```
and in the connection info log
```
Discovery:   listen=10.10.1.72:5000, advertised=10.10.1.72:5000
```
it still doesn’t work (nodes simply not discovering each other, even if I can access port 5000 with nc) and it is a complete mystery to me why.

Thanks!

Neo4j

Causal cluster Discovery port listen 0.0.0.0:5000 doesn't work