Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-04-2021 08:55 AM
We're using Neo4j as a read model in an event sourced system. It's job is basically to process a stream of events and build a database for query purposes.
The events are translated into cypher statements and sent to new4j using the bolt driver, in batches of 1000-5000 statements per transaction. (sequentially, single threaded)
This has been working great up to neo4j 3.5.16, with neo4j processing +/- 1000 events per second when rebuilding from scratch.
Now, we just updated to neo4j 4.2.1, and I'm seeing following strange behaviour when doing a full rebuild (i.e. starting from an empty DB and processing all events, approx 3 million in total)
It basically keeps going at this rate from then on.
Now the strange part:
It's looks like neo4j isn't able to keep up somehow, but after waiting a while, all is back to normal.
What could be causing this?
Things I tried:
The logs do not contain any related information.
My environment:
01-09-2021 03:05 PM
Anything in your I/O behavior?
Can you turn that into a reproducible test?
Did you test it on another system? E.g. linux / windows / older OSX ?
Anything with transaction handling that changed? Or closing of sessions/transactions?
Do you use UNWIND $array for your batches ? or sending 1k-5k statements in individually?
01-11-2021 02:22 AM
I deployed this on our debian servers last week (AWS), and the behaviour is exactly the same. So it doesn't seem related to anything environment-specific.
I don't use UNWIND to combine statements as each event stands on its own, there is no guarantee that consecutive events are similar. It's possible that UNWIND is used to translate a particular type of event, but that won't include a whole lot of data. So yes 1-5k events are sent individually.
One thing I tried however is playing with the cypher.min_replan_interval property.
The default of 10s seemed suspicious, as that's exactly the time I need to wait before things get back to normal.
Increasing that setting also seemed to increased the time needed to wait.
One theory I came up with so far:
It's just a wild guess, as I don't know enough about neo4j internals to confirm it.
Could this make sense?
01-11-2021 03:19 AM
Yes that sounds sensible.
You should also wait for the indexes/constraints to be online with db.awaitIndexes()
btw. for keys it's better to use constraints than indexes.
One think you could do is to invalidate the plan cache manually with a procedure call.
call db.clearQueryCaches()
There are also some prefix-settings for forcing or skipping replanning as of 4.2
01-11-2021 09:16 AM
Did you try PROFILE
or EXPLAIN
on the 3.5 and 4.2 queries?
Do the resulting plots look the same?
It would be useful to know if the Query planning changed or not.
01-11-2021 10:48 AM
Hi @clem , given the releases of Neo4j from 3.5 to 4.2 with various improvements on index and other various parameters, I hardly doubt the explain plan will look alike.
I would follow michael.hunger advice.
01-11-2021 10:57 AM
My thought is that it would be good to rule out that the plan didn't get worse between the two releases.
If indeed the plan got worse, then it would be good to investigate.
Not likely, but possible.
01-19-2021 03:22 AM
@michael.hunger I did try out your suggestions:
So it looks like replanning doesn't work as it should ... ?
01-23-2021 06:05 AM
Probably best then to raise this as an issue with the core database (cypher planner)
can you copy this thread in to a GitHub issue here
07-29-2021 09:33 AM
@christian3 Did you ever figure out what might be causing this? I've just run into the same problem, though indirectly via the Javascript @neo4j/graphql package against neo4j 4.3
I'm currently manually calling db.clearQueryCaches()
more regularly than I'd like during a 3 hour batched insert job but at least it mostly works. Each time I call it it fixes the issue for a shorter time so occasionally I have restart the database container and pray the job doesn't fall over.
I'm only inserting nodes with 10 or so properties, and no edges during this pass, so I'm not doing anything complex, unless the 1st party abstraction layers are doing something strange.
For anyone else who's interested, here's the GitHub issue that was raised by Christian but closed by the dev team: performance issue after upgrade 3.5 => 4.2 · Issue #12721 · neo4j/neo4j · GitHub
08-02-2021 07:22 AM
@josh1
We're still using a workaround by pausing the job for a few seconds or calling clearQueryCaches(). As a rebuild is pretty rare, it's not that big of an issue for us.
I didn't get to making a reproducible scenario yet as requested in the github issue. We have so many parameters that could lead to this behaviour that it would probably be difficult to isolate and reproduce.
And hoping that the issue will eventually disappear in a future release...
All the sessions of the conference are now available online