Neo4j

pulkit_suhasari · ‎08-12-2020

Hello,
I have a java based microservice which uses Neo4j as the DB. Every request to the service either updates or create a new Node labelled as Device which has ~25 properties.
The DB has close to 4M Device Nodes.
I am using SDN(Spring-Data-Neo4j) save method to save my Device Object in the DB. I can see the time taken in save query spikes up to 5-7 secs in every 10-15 mins.
Neo4j Server Config:

Neo4j server version: 3.3.2 (Community Edition)
StandAlone Machine
Server Hardware Conf: 72vCPU 144GB RAM (basically using c5d.18xlarge AWS EC2 instance to host the neo4j server)
PageCache size property set to 60g.

I am also passing the depth option in the save method of SDN as 0 i.e. only the node should be updated and not its relationship still there is a regular spike in save device query.

My hunch is that the number of properties could be one of the reason but I am not sure. Please suggest as to what might cause these spikes in query time and what can be done to improve.

Thanks in advance.

anthapu · ‎08-12-2020

Can you please provide the queries running when it took a lot of time?

Also, can you please provide the schema details? You can run ":schema" in browser and paste the details.

Also, is the underlying disk SSD or HDD?

On a side node you would benefit a lot by moving to 3.5.x version, as there are lot of improvements with query performance in those releases.

pulkit_suhasari · ‎08-13-2020

Hello,

The microservice has 5-6 REST API endpoints and during save device query there are lot of other queries being executed, mostly all of them finish under subsecond but one particular query almost always spikes whenever this save device query spikes and takes almost equal or greater time than save device query.

@Query("MERGE (u:User {userId: {userId}}) set u.lat={lat}, u.lng={lng}, u.accuracy={accuracy}, "
      + " u.brandId={brandId}, u.updatedAt={updatedAt}, u.createdAt={createdAt} "
      + " return u")
  User saveUserCypher(@Param("userId") Long userId, @Param("lat") Double lat, @Param("lng") Double lng,
      @Param("accuracy") Integer accuracy, @Param("brandId") Integer brandId,
      @Param("updatedAt") String updatedAt, @Param("createdAt") String createdAt);

Schema:

Indexes
   ON :Device(Id) ONLINE 
   ON :Device(advertisingId) ONLINE 
   ON :Device(androidId) ONLINE 
   ON :Device(deviceHash) ONLINE 
   ON :Device(gkDeviceId) ONLINE 
   ON :Device(macAddress) ONLINE 
   ON :Device(updatedAt) ONLINE 
   ON :Device(uuid) ONLINE 
   ON :PaymentInstrument(updatedAt) ONLINE 
   ON :User(createdAt) ONLINE 
   ON :User(type) ONLINE 
   ON :User(updatedAt) ONLINE 
   ON :CD(linkId) ONLINE  (for uniqueness constraint)
   ON :PaymentInstrument(fingerPrint) ONLINE  (for uniqueness constraint)
   ON :User(userId) ONLINE  (for uniqueness constraint)

Constraints
   ON ( cd:CD ) ASSERT cd.linkId IS UNIQUE
   ON ( paymentinstrument:PaymentInstrument ) ASSERT paymentinstrument.fingerPrint IS UNIQUE
   ON ( user:User ) ASSERT user.userId IS UNIQUE

Provisioned IOPS SSD
Please advise.

anthapu · ‎08-16-2020

Is this query last one in a set of queries in a single transaction? If so, what you might be seeing the commit time also included in the response time.

The query itself does not seem to be a problem. Also, this seems to be pretty simple save method. Why is ".save()" pattern not being used?

Can you try profiling the query to see if query plan is seeing any issue?

pulkit_suhasari · ‎08-16-2020

Hello,
Looking for some guidance. Any help would be appreciated. @anthapu

pulkit_suhasari · ‎08-17-2020

Hello @anthapu
The merge query which I mentioned in my previous reply is itself an individual transaction and not part of any other transaction.
I moved to writing cypher from .save() function to mitigate the huge amount of time taken by .save() function itself.
I did try profiling the query in a stage setup of the prod DB and there seems to be no issue with the profile as per my knowledge.

Are you hinting that my save device query is taking time because of this time-taking merge query for saving User node? Please suggest.

Thanks for helping.

anthapu · ‎08-17-2020

Oh ok. I understand. You moved to Cypher query from save method and it works fine with the query. Is that correct?

In SDN load/save work in conjunction. If the data is not being fetched and saved in consistence manner the SDN cache can be out of sync.

Also, does the user has lot of relationships? If so, SDN may be spending time to validate if anything has changed from relationship perspective. That may be taking time.

This is a valid work around for that issue.

If you have query logs enabled, can you check the server activity in query logs when you try the save method to see what is happening?

pulkit_suhasari · ‎08-17-2020

Hi @anthapu
My initial question was for the time taken in save method of SDN for Device node. On your follow up question of queries which are being run when Device node is saved to which I mentioned a query(i.e. the cypher for saving User Node) which also runs on requests received to a different REST endpoint of the same service and takes a considerable amount of time transiently.

Yes, User Node does have lot of relationships which is precisely why we used the overloaded Save method save(S s, int depth) and passed depth value as 0 such that only the node is updated and not the relationships but still save time was not improved and we moved to cypher. We did see improvements in save time after moving to cypher but the cypher also spikes up intermittently so I mentioned if that has anything to do with the save device SDN method spikes.

So I was looking for help around SDN save method for Device Node.

Sure will check the server logs. Also, I have around ~25 properties on Device Node. Can that contribute to the time taken in save method?

pulkit_suhasari · ‎09-02-2020

@anthapu
UPDATE: Have upgrade the neo4j DB version from 3.3.2 to 3.5.21 and the migration is successful but the query still takes time.
The migration rules out any possibility of a query spiking due to a bug in any of the older versions.
One interesting thing which I noted was that my query spikes exactly matches with the start of the checkpointing process i.e. every 15 mins and on an average it takes ~9sec to complete the checkpointing as show in the below logs.

020-09-01 22:39:05.223+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 1719462798 checkpoint started...
2020-09-01 22:39:07.208+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 1719462828 to [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.b], from [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.a].
2020-09-01 22:39:14.951+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 1719462798 checkpoint completed in 9s 728ms
2020-09-01 22:39:14.952+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 2489

Can this be the reason of spikes? If yes why is the checkpointing taking so much time and how can I decrease it. Please suggest.

anthapu · ‎09-02-2020

Checkpointing the regular flushing of memory contents to disk. This won't be contributing to it.

If you have query logs enabled, when you see the issue can you capture the query log and debug log and post it here?

Would like to see what the server saw and could it be the SDN client side is taking time.

pulkit_suhasari · ‎09-02-2020

@anthapu
Unfortunately, query logging is an enterprise-only feature and I am using a community edition.
Regarding the debug logs, below is the log snippet from the latest occurrence of the issue.

2020-09-02 12:03:51.349+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 1721822405 checkpoint started...
2020-09-02 12:03:55.268+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 1721822636 to [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.b], from [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.a].
2020-09-02 12:04:09.477+0000 WARN [o.n.k.i.a.t.m.KernelTransactionMonitor] Transaction KernelTransactionImplementationHandle{txReuseCount=1588, tx=KernelTransaction[38129568]} timeout.
2020-09-02 12:04:25.990+0000 WARN [o.n.k.i.a.t.m.KernelTransactionMonitor] Transaction KernelTransactionImplementationHandle{txReuseCount=24649, tx=KernelTransaction[38128844]} timeout.
2020-09-02 12:04:26.015+0000 WARN [o.n.k.i.a.t.m.KernelTransactionMonitor] Transaction KernelTransactionImplementationHandle{txReuseCount=1713, tx=KernelTransaction[38131816]} timeout.
2020-09-02 12:04:26.033+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 1721822405 checkpoint completed in 34s 683ms
2020-09-02 12:04:26.035+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 2503

Let me know what your thoughts are?
Thanks

Neo4j

Saving a Node takes ~5-6 seconds