Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-22-2019 05:07 AM
Hi Team,
I am experiencing performance issues while reading data from disk. Here are some of the details regarding dataset and environment.
I am trying to make a cypher query on one of the indexed field which is expected to return at most 1000 records of maximum size 4MB.
I understand since my graph is bigger than the page cache, some of the data will be read from disk. But the read operation takes more than even 30 seconds in some cases.
Is that normal behaviour when neo4j reads indexed data from disk? Any help would be appreciated
Thanks
10-22-2019 11:57 AM
have you prefaced the query in question with PROFILE
OR EXPLAIN
so as to determine if the index is being utilized?
How have you determined your graph size is 34G? Does this include transaction logs as well, which are not included in the pagecahce
10-22-2019 12:23 PM
hello Dana,
Yes, I have used explain with the query & it does use the index. Here is the output of explain
I used :sysinfo to determine the size of graph. Here is the output of it. I think it includes transaction logs also. Not sure though. Transaction logs are around 2G.
10-22-2019 12:30 PM
thanks for this detail.
:sysinfo and 'Ttal Store Size:` of 33.93G does in fact include all graph.db/neostore.transaction* files. From your screenshot it appears your graph might be on the order of 25GB+/-.
The profile looks real good and it is surprising this would take 30 seconds. Is there some network latency in play here? if you run the query on the Azure instance itself and with bin/cypher-shell do you encounter the same 30 seconds? Are you running this through the Neo4j Browser? whereby some of the time may be as a result of rendering the result in a graph representation?
10-22-2019 12:43 PM
the queries are running from a different pod(spring boot application) within the same cluster. Network doesn't seem to be the issue since there are lots of other things running in the cluster. Initially we thought it could be disk throughput issue since we were using smaller disk but a recent disk upgrade also didn't helped.
How much should be the read time when neo4j reads indexed data from disk? Are there any benchmarking statistics? There is one another observation that reads gets even slower when checkpointing is happening.
10-22-2019 12:46 PM
checkpointing? how long is checkpointing taking? if you have access to the logs\debug.log and if running a *nix OS you should be able to get this detail by running
grep -i triggered | logs\debug.log | grep -i check
are you encountering a lot of Garbage Collection events in the debug.log?
10-23-2019 02:12 AM
Checkpointing was slow when we were using P10 disk. It was taking upto 14 minutes sometimes. But after upgrading, checkpointing has become very fast after disk upgrade. it's under 10 seconds mostly now.
garbage collection is not so much. just once in last 24 hours. Do you have rough idea how much should the query(with above query plan) take if all the data is read from disk?
10-22-2019 01:44 PM
Would you be able to provide the query, and is it possible to PROFILE the query and expand all elements of the query plan? The row and db hit info from a PROFILE plan is more useful for tuning.
10-23-2019 02:19 AM
Hi Andrew,
Please find expanded query plan below:
The query is:
MATCH (n:artifact) WHERE n.docId IN ['3747ee26-8b2e-40cf-bccc-c262be69fe67', '5cd4923c-0c22-4e79-b6da-75bd919da31f', 'e9afe2ec-3324-4027-968d-4f5839d71287', 'acc4a43c-9cb2-4bce-8ebc-9fc43bb5453a', '41579a30-809c-4cc8-bfc4-b01a114caa26', '0a37fe3a-0068-41eb-8d12-fb0795931501', 'eebd3a93-7da2-47e7-ae79-786f24ade2aa', 'fa689188-5075-4051-9515-903ec9042383', 'c791aaef-6b89-499f-974b-071eec329755', '4d350433-4874-4cec-9f6f-ad5621c1d232'] AND n.tenantId='my-tenant' AND n.language='en' RETURN n.id, n.graphVecEmbedding
docId has an index in artifact nodes
10-23-2019 02:33 AM
I think a composite index would help you here.
Please create an index on :artifact(docId, tenantId)
, then rerun the query and see if that helps.
10-23-2019 04:38 AM
But after filtering on index column, there would be hardly 2000 nodes at the max. Full scan for 2000 records should not take 30 seconds. I can even remove these filters on tenantId and langauge altogether. But i am still not sure why the read from disk is slow on an indexed column. Are there any benchmarking stats for neo4j read from disk?
All the sessions of the conference are now available online