I'm seeing HUGE changes in performance for the same query while running on a dedicated CentOS 7 system with adequate RAM, CPU, and disk. I'd like guidance in discovering the cause and how to fix it.
I'm running neo4j-enterprise v4.4.3.1. The platform is a CentOS 7 system running on a dedicated AWS EC2 "t2-large" instance. I am starting with the same db dump (created with neo4j-admin ), and adding the same data using the same query.
When I say "huge changes in performance", I mean that the same insertion job (a sequence of queries) sometimes finishes in 2903 seconds (just over 48 minutes) and sometimes takes 15792 seconds (4 hours and 23 minutes). That's a factor of FIVE!
This query has been running daily on earlier versions for more than a year, and always runs in about 45-50 minutes. In almost 2 years of experience with Neo4J enterprise, this is the first time I've seen anything like this.
I see no difference in the log files between a fast and slow run. I use Python scripts to run this, so it's being run with the same command line, the same database, and the same data each time.
The RAM utilization seems fixed at 75% (according to top ) while Neo4J is running. I'm accustomed to that.
On fast runs, the CPU utilization (according to top ) is often between 40% and 100%. On slow runs, it is often 3-4% and never above 10%.
At the moment, it is in the middle of another run and is already slow. I know, from examining the console output, that it is the middle of a neo4j-intensive query. Here is what the first lines of top show while this query is running:
# top
top - 16:57:17 up 8:01, 3 users, load average: 1.04, 1.08, 1.18
Tasks: 211 total, 1 running, 210 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.3 sy, 0.0 ni, 50.3 id, 49.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem : 8007412 total, 131620 free, 6417452 used, 1458340 buff/cache
KiB Swap: 2097148 total, 1488380 free, 608768 used. 1328000 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6468 root 20 0 8633976 5.6g 4944 S 2.7 73.2 21:53.56 java
38 root 20 0 0 0 0 S 0.3 0.0 2:05.73 kswapd0
1 root 20 0 125744 1600 896 S 0.0 0.0 0:02.23 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
Neo4J is the only Java app installed on the system. It's using only 2.7% of the CPU -- no wonder it's glacially slow!
According to show transactions (in the Neo4J browser), there are two transactions active:
query-2605 :
"call apoc.periodic.iterate("MATCH (target:FIPSItem)<-[:FEATURE]-(dp:Datapoint {datasetID: 'nytimes.20220112.us-counties'})
WHERE
dp.dailyCaseCount IS NOT NULL AND
dp.dailyDeathCount IS NOT NULL AND
dp.cumulativeDeathCount IS NOT NULL AND
dp.cumulativeCaseCount IS NOT NULL
RETURN target, dp AS dpTarget
", "MATCH (target)-[:ADJOINS]-(neighbor:FIPSItem)<-[FEATURE]-(dpNeighbor: Datapoint {datasetID: 'nytimes.20220112.us-counties', pertainsDate: dpTarget.pertainsDate})
WHERE
dpNeighbor.dailyCaseCount IS NOT NULL AND
dpNeighbor.dailyDeathCount IS NOT NULL AND
dpNeighbor.cumulativeDeathCount IS NOT NULL AND
dpNeighbor.cumulativeCaseCount IS NOT NULL
WITH
dpTarget,
stDevP(dpNeighbor.cumulativeCaseCount - dpTarget.cumulativeCaseCount) AS cumulativeCaseCountEdgeIntensity,
stDevP(dpNeighbor.cumulativeDeathCount - dpTarget.cumulativeDeathCount) AS cumulativeDeathCountEdgeIntensity,
stDevP(dpNeighbor.dailyCaseCount - dpTarget.dailyCaseCount) AS dailyCaseCountEdgeIntensity,
stDevP(dpNeighbor.dailyDeathCount - dpTarget.dailyDeathCount) AS dailyDeathCountEdgeIntensity
MERGE (analysisResult: AnalysisResult {analysisResultID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906.' + dpTarget.datapointID })
SET
analysisResult.datapointID = dpTarget.datapointID,
analysisResult.cumulativeCaseCountEdgeIntensity = cumulativeCaseCountEdgeIntensity,
analysisResult.cumulativeDeathCountEdgeIntensity = cumulativeDeathCountEdgeIntensity,
analysisResult.dailyCaseCountEdgeIntensity = dailyCaseCountEdgeIntensity,
analysisResult.dailyDeathCountEdgeIntensity = dailyDeathCountEdgeIntensity
WITH dpTarget, analysisResult
MATCH (analysisRun: AnalysisRun {analysisRunID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906'})
MERGE (analysisResult)-[:ANALYSIS_RUN]->(analysisRun)
MERGE (dpTarget)-[:ANALYSIS_RESULT]->(analysisResult)
", {batchSize:500})"
and
query-2814 :
"UNWIND $_batch AS _batch WITH _batch.target AS target,_batch.dpTarget AS dpTarget MATCH (target)-[:ADJOINS]-(neighbor:FIPSItem)<-[FEATURE]-(dpNeighbor: Datapoint {datasetID: 'nytimes.20220112.us-counties', pertainsDate: dpTarget.pertainsDate})
WHERE
dpNeighbor.dailyCaseCount IS NOT NULL AND
dpNeighbor.dailyDeathCount IS NOT NULL AND
dpNeighbor.cumulativeDeathCount IS NOT NULL AND
dpNeighbor.cumulativeCaseCount IS NOT NULL
WITH
dpTarget,
stDevP(dpNeighbor.cumulativeCaseCount - dpTarget.cumulativeCaseCount) AS cumulativeCaseCountEdgeIntensity,
stDevP(dpNeighbor.cumulativeDeathCount - dpTarget.cumulativeDeathCount) AS cumulativeDeathCountEdgeIntensity,
stDevP(dpNeighbor.dailyCaseCount - dpTarget.dailyCaseCount) AS dailyCaseCountEdgeIntensity,
stDevP(dpNeighbor.dailyDeathCount - dpTarget.dailyDeathCount) AS dailyDeathCountEdgeIntensity
MERGE (analysisResult: AnalysisResult {analysisResultID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906.' + dpTarget.datapointID })
SET
analysisResult.datapointID = dpTarget.datapointID,
analysisResult.cumulativeCaseCountEdgeIntensity = cumulativeCaseCountEdgeIntensity,
analysisResult.cumulativeDeathCountEdgeIntensity = cumulativeDeathCountEdgeIntensity,
analysisResult.dailyCaseCountEdgeIntensity = dailyCaseCountEdgeIntensity,
analysisResult.dailyDeathCountEdgeIntensity = dailyDeathCountEdgeIntensity
WITH dpTarget, analysisResult
MATCH (analysisRun: AnalysisRun {analysisRunID: 'zeetix.analysis_tools.edgeIntensityAnalyzer.20220201163906'})
MERGE (analysisResult)-[:ANALYSIS_RUN]->(analysisRun)
MERGE (dpTarget)-[:ANALYSIS_RESULT]->(analysisResult)
"
There are indices defined on all reference property names, constraints defined on the various IDs (the code generates a unique ID for each instance of a labeled node), and I've defined label and node indices according to the documentation.
I've seen references to "metrics" in the new documentation, but I don't see how to connect those metrics to what I'm seeing. So far as I know, I've not turned on any of that.
I dialed back the batch size of the above query to 500 to see if I was somehow running out of memory.
Why is the CPU utilization so low? Why do these queries sometimes take so long?
I've been chasing this since last week, and I've run out of things to look at. I invite guidance about what to instrument and how to do so.
Is there some combination of neo4j.conf that restrict CPU utilization? What could make Neo4J spend so little time doing anything when there is so much to be done?
View full article