Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-04-2019 01:44 PM
We are running the enterprise edition on an AWS ubuntu EC2 server, it is a single instance and not a cluster, and has been long running (months) with no problems. We hare recently introduced a system health check which fires ever 30 seconds and together with our normal operations appears to be fine, except occasionally at different/random times 2 of the health checks fail in succession returning a neo4j.exceptions.ServiceUnavailable: Cannot acquire connection to Address(host='[our db]', port=7687).
The next minute and everything is fine again.
I've checked the logs and there are no corresponding errors for the same time, the server remains up, and normal service is immediately resumed for the next check.
We are making the request from another AWS EC2 instance running Django/python with the neo4j driver.
11-06-2019 05:27 AM
How does your health check work?
11-06-2019 06:36 AM
@david.allen it's a health check on a load balancer that calls one of our URL endpoints which makes a call to our graph to return a piece of data. It runs every 30 seconds and is absolutely fine for most of the time, as are all the other calls we make to the graph the same way. But occasionally we are getting this error and it concerns me in case it is something that would hamper our scalability.
11-06-2019 06:39 AM
I'm not understanding. The connection error is on port 7687 which is usually bolt but the way you're describing it, it's hitting some kind of URL which implies HTTP/HTTPS on port 7474 or 7473.
Also, where is the health check deployed? Is it possible that high latency and/or network issues are causing recoverable errors in making a one-time connection to Neo4j?
For general health checks, I'd recommend these endpoints and not a bolt/7687 based approach. Are you using any of these?
https://neo4j.com/docs/operations-manual/current/monitoring/causal-cluster/http-endpoints/
11-06-2019 07:28 AM
Thanks for answering David. The health check is on our AWS infrastructure and is the load balancer for our hosted application server, the ELB pings a URL on our application server which uses the same connection details to make a call to the graph and it uses bolt as do all our application servers to make neo4j calls. My concern is why very sporadically I'm getting 2 of these calls failing together, maybe only once every 2 days, which means this has successfully executed over 5000 in-between. I'm sure there are better ways to do the health check directly to the neo4j instance, but I'm worried that this error may be caused by open connections, load, or something that would cause issues as we scale. We are not running a cluster.
11-06-2019 07:45 AM
I'm not really sure why this would be happening which is why I'm seeking to gather more information about how the health check operates and what your other opportunities for a health check are. The thing is, if your health check is failing, but at the same time you're reporting that the database is still functional and available at the same time, then this argues that the health check is not doing its job properly, which is why I was seeking to understand that better.
The next thing I would recommend is to try and look at the contents of debug.log and see if you can time isolate when this is happening, and what the machine is chattering to itself in the logs when you encounter this.
11-06-2019 08:49 AM
@david.allen thanks for this, I really appreciate you taking the time to try to help, and I just want to be clear that the health check is an internal application server to graph query as opposed to a neo4j function. The point is that it's a query from our application server that happens ever 30 seconds and fails only around once a day. When I look in the logs near the time that this has happened I see nothing within minutes of when the call is made, and I know that it's fired a number of times successfully within that period.
11-07-2019 01:57 AM
@david.allen would you happen to know why I would get that error sometimes? literally I know that the call worked twice in the moment before and twice in the moment after.
11-07-2019 04:49 AM
As I said I'm not entirely sure why this is happening. The details of how you've implemented your health check matter, and so it's hard to come further without an investigation of that. If you're an enterprise customer I would be happy to get you connected to a field engineer, or you can submit a support ticket to dive deeper into what's going on.
11-07-2019 05:16 AM
we are using enterprise edition through the startup programme.
11-07-2019 06:33 AM
This is the most relevant bit you've provided so far:
We have others who might be able to help on this forum, but we cannot proceed without more detail:
11-07-2019 06:40 AM
@david.allen I think I mentioned the health check is our own health check, it is part of our application load balancer which calls a URL end point at our application server (which is a django server running on an AWS EC2 instance) which calls a method within our application server which makes a call via the bolt connection to the neo4j instance on a separate server using bolt, the same as all our application server methods do. It is not a neo4j health check.
I know that the application server is able to make the connections because they happen 2 per minute successfully most of the time.
Also I can use our application (a website accessible via a browser which calls endpoints on our application server which in turn use the same connection settings to make a bolt connection to the graph to return data. And this appears to be working as well.
I only mention "health check" because that is what the load balancer at AWS refers to the URL entered that checks every 30 seconds for a valid response at the application server and if it doesn't get one it spins up another instance of our application server. The method called connects in the same way as all our application server to graph server connections do using bolt.
11-07-2019 06:49 AM
which calls a method within our application server which makes a call via the bolt connection to the neo4j instance on a separate server using bolt
Makes which call via which bolt connection? Have you verified how your code is making that connection, and that your connection pooling / driver settings are correct?
I know that the application server is able to make the connections because they happen 2 per minute successfully most of the time
Most of the time -- not all of the time? What happens the other times? When the application server makes these connections, presumably they're doing different cypher queries, possibly via a different connection pool?
I only mention "health check" because that is what the load balancer at AWS refers to the URL entered that checks every 30 seconds for a valid response at the application server and if it doesn't get one it spins up another instance of our application server
Right....got this. This implies that it's the app server failure which causes the load balancer/health check failure and gets your app server spun up again by ELB, I suppose. Presumably the root cause of that is the cypher connection error.
Without the app code (which I understand you might be reluctant to share on a public forum) you might not get to the bottom of this. But things to look at --
My best guess is that something about the way your app server is written is not using the driver appropriately, but I can't tell without diving into the source.
11-07-2019 06:59 AM
we are using the neo4j python library, this is how we get a connection and execute a cypher query:
from neo4j.v1 import GraphDatabase
neo = GraphDatabase.driver( NEO_URI,auth=(NEO_UID,NEO_PWD) )
with neo.session() as session:
res = session.run(
"MATCH (x)-[co:CLIENT_OF]->(p:Persona {bjid:{pid}}) "
"WHERE x.bjid={uid} "
"RETURN co.avatar_base64 as avatar ",
{
'uid': uid,
'pid': persona_bjid
} )
for match in res: return Response( { "status": "OK", "img": match["avatar"] if match["avatar"] else "" } )
The problem is not related to the ELB spinning up a new instance of the app server, the health check doesn't fail enough times to make this happen, which is how I know that most of the time it works. When it fails it fails twice, which means to attempts within 1 minute (30 seconds apart) but this only happens at most once a day and sometimes it's fine for days.
I don't believe we've got any different connection pooling or method for any of the other cypher queries that get executed successfully.
My concern was that we may be hitting some limit, open connections, connections not closing properly or long running queries, but I see nothing to else to suggest this in the logs and I'm out of ideas.
I really appreciate your help but I'm pulling my hair out.
11-07-2019 07:13 AM
Please have a look at the python driver docs and pay close attention to the connection pooling descriptions
https://neo4j.com/docs/api/python-driver/current/driver.html
If every time you do a check to your Neo4j database you are creating a new driver, this has the effect of creating a whole pool of connections for each check, and you're likely spamming the database with connections you're not using. At some point the server may be getting overloaded with unused connections. Best practice is to reuse driver objects and not recreate them every time.
11-07-2019 07:16 AM
Thank you @david.allen that sounds like the sort of thing I need. Is there any way I can see the number of connections/drivers?
11-07-2019 07:16 AM
Please read the docs. It's specified behind that link; there's a default and it's configurable.
All the sessions of the conference are now available online