Neo4j

inersha · ‎11-19-2021

Hello there,

I have a script that runs a few hundred thousand individual transactions to insert data in to Neo4j. Each of these Cypher queries uses the exact same format -- it's just the parameters that change from one query to the next.

However, at seemingly random intervals, an otherwise straightforward query will "hang" for a minute or two. Sometimes it will get imported, but if the query reaches 120s it will fail to be imported.

I kept an eye on debug.log to see what might be happening during this time, and I found this when I cancelled the script as a transaction was struggling to be imported:

2021-11-19 00:37:02.887+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 77412 seconds. Reason: IndexSelectivity(IndexDescriptor(Node(LabelId(3)),List(PropertyKeyId(2)),Set(),org.neo4j.cypher.internal.planner.spi.IndexDescriptor$$$Lambda$3179/0x00000008411d8040@21456f65,org.neo4j.cypher.internal.planner.spi.IndexDescriptor$$$Lambda$3180/0x00000008411d9040@8e006bf,false)) changed from 3.811430018203847E-9 to 3.6296907073761936E-9, which is a divergence of 0.047682709628576375 which is greater than threshold 0.03576053374670699. Query id: 31553318

Is this "stale query" likely to be the reason behind why Neo4j stalls for a few minutes every now and then? If so, is there anything I can do to avoid this happening?

The query is completely standard, and if I were to stop and restart the script the same query will run without a problem. It just seems that Neo4j will "hang" for a few minutes every now and then during this import script, and I don't know why.

At first I thought it may be some sort of garbage collection or "behind the scenes" memory management, but this "stale query" line from debug.log is the only clue I've got to what may be causing the cypher queries to stall during import.

Thanks in advance for any advice.

inersha · ‎11-19-2021

I've modified the script to wait 5 minutes after a failed transaction and to try again. During this time I got these errors in debug.log:

2021-11-19 20:07:18.064+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 176998 seconds. Reason: NodesAllCardinality changed from 1.340537249E9 to 1.378838326E9, which is a divergence of 0.02777778676279702 which is greater than threshold 0.01607034372525389. Query id: 38827179
2021-11-19 20:07:18.108+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 176998 seconds. Reason: NodesAllCardinality changed from 1.340537225E9 to 1.378838326E9, which is a divergence of 0.027777804168753573 which is greater than threshold 0.01607032773165019. Query id: 38827181
2021-11-19 20:07:33.093+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 177009 seconds. Reason: IndexSelectivity(IndexDescriptor(Node(LabelId(3)),List(PropertyKeyId(2)),Set(),org.neo4j.cypher.internal.planner.spi.IndexDescriptor$$$Lambda$3179/0x00000008411d8040@21456f65,org.neo4j.cypher.internal.planner.spi.IndexDescriptor$$$Lambda$3180/0x00000008411d9040@8e006bf,false)) changed from 3.811430018203847E-9 to 3.6296907073761936E-9, which is a divergence of 0.047682709628576375 which is greater than threshold 0.01606935697951539. Query id: 38827185
2021-11-19 20:07:33.409+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 177005 seconds. Reason: NodesAllCardinality changed from 1.340539986E9 to 1.378838326E9, which is a divergence of 0.027775801758501452 which is greater than threshold 0.016069721597536217. Query id: 38827187
2021-11-19 20:07:33.410+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/c7a51d5e] Discarded stale query from the query cache after 177005 seconds. Reason: NodesAllCardinality changed from 1.340539977E9 to 1.378838326E9, which is a divergence of 0.02777580828573516 which is greater than threshold 0.016069719376372454. Query id: 38827188

However, the Cypher query repeatedly failed on each attempt and the script became stuck constantly trying to run the query.

After restarting the script the query was instantly successful. Also, I could run other queries against the database during this time without a problem.

Does anyone know why Neo4j might get stuck on a query like this, even when it would otherwise normally run without a problem?

sameer_gijare14 · ‎11-20-2021

May be the query needs to be free formatted with correct query formatted correctly and aptly.
Thanking you
Yours faithfully
Sameer S Gijare

ameyasoft · ‎11-20-2021

Just a few steps before the failed transaction, add a step to run this: CALL db.clearQueryCaches(). You can add a few seconds pause before and after running this query.

inersha · ‎11-21-2021

Thanks for the suggestion. I've been running this query after every 2000 transactions, but it hasn't stopped Neo4j from hanging for about 5 minutes every now and then.

What has helped however is increasing the maximum heap size. Whereas before a transaction would fail after about 120s, it now seems that a transaction will hang for about 360s but then import successfully.

I also seem to be seeing far fewer "discarded stale query" lines in debug.log.

Still, whilst transactions aren't hanging and then failing like they're used to, I'm not sure what's happening behind the scenes to cause Neo4j to stall for around 360 seconds every now and then. Ideally I'd like to avoid this if possible. Nothing is showing up in debug.log to indicate what might be going on.

Does anyone know of any other logs or monitoring tools I can look at to try and debug why Neo4j might be stalling for 5 minutes every now and then? I'm pretty sure something is happening in the background, but I'm not sure what it is or what's causing it.

sameer_gijare14 · ‎11-22-2021

I think you need to set configuration parameter wisely otherwise it will run well for a some time and you will always have to change the configuration parameter.It is better estimate the min as well as max size of the data DB engine is holding on to.I would say that you can use memrec command for fine tuning JVM Garbage collector frequency.Also if size of data is large then you need to tune the transaction timeout period so that it goes well with the memory configurations.

Thanking you
Yours faithfully / sincerely
G Sameer Sudhir/ Geewani

dana_canzano · ‎11-23-2021

@inersha

What version of Neo4j?
How much total RAM on the instance and how have you defined conf/neo4j.conf parameters of

dbms.memory.heap.initial_size
dbms.memory.heap.max_size
dbms.memory.pagecache.size

the messages of Discarded stale query ..... are somewhat expected. Lets say you run a query now and it traverse :Movies and :Actors and now there are 20 :Movies and 100 :Actors. You then rerun the same query 30 mins later and there are 200 :Movies and 3000 :Actors. The planner is going to say.. wait your data has changed signifiicantly enough that the query plan in the cache from 30 mins ago may not be valid any more given the significant change in data. And thus a new query plan will be generated and thus the logging of the message

myron_higerd · ‎11-23-2021

You indicated you were using the same format with different parameters. Are you creating a new cypher string with each new set of parameters (constants in cypher) or are you creating a cypher that accepts parameters and then passing the parameters. If you are creating new cyphers each time then you may be overwhelming the cypher cache.

Neo4j

Stale Queries - Neo4j transactions stalling during import