Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-29-2019 12:24 PM
Hi,
I tried running the union find algorithm on my graph, but it seems to have hit a deadlock. It's been running for 12+ hours but the server shows no CPU usage or I/O.
I suspect the issue is related to the size of my graph. It has 650 million nodes, 2.5 billion relationships, and is about 200 GB on disk.
My graph model is very simple: there is only one node label and one relationship label
How can I resolve this deadlock? It seems most threads are stuck here:
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
at org.neo4j.graphalgo.impl.ParallelUnionFindQueue.lambda$compute$1(ParallelUnionFindQueue.java:85)
at org.neo4j.graphalgo.impl.ParallelUnionFindQueue$$Lambda$450/1742932504.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:122)
Cypher queries used:
// Load graph, this completes successfully
CALL algo.graph.load('my-graph','Label','REL_TYPE',{graph:'heavy'})
YIELD name, graph, direction, undirected, sorted, nodes, loadMillis, alreadyLoaded,
nodeWeight, relationshipWeight, nodeProperty, loadNodes, loadRelationships;
// This query hangs
CALL algo.unionFind('User', 'FRIEND', {graph:'my-graph',write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
It seems I cannot attach text files, but I have uploaded the Java stack trace and log files here:
Stack trace
neo4j.log
debug.log
Neo4j version: 3.5.3
Neo4j graph algorithms version: 3.5.3.3
Thanks,
Steven
Solved! Go to Solution.
04-04-2019 12:55 AM
03-31-2019 02:41 PM
Thanks a lot for your detailed information. We're looking into it this week.
I also created this GH issue to track it:
Cheers, Michael
03-31-2019 09:28 PM
My program has done the same thing, here are the details...
Data store size: 1.92 TB
Neo4j version: 3.5.3 Enterprise
Neo4j Graph algorithms version 3.5.3.3
8.9 Billion nodes, 18 Billion rels
Despite having a pretty complex graph model the basics are 9 Node type and 1 Rel Type. I don't know how to check the thread stack but my debug.log shows the exact same message:
2019-03-31 20:08:07.606+0000 INFO [o.n.k.i.p.Procedures] [neo4j.BoltWorker-5 [bolt] [/127.0.0.1:43070] ] LOADING 100% Memory usage: 239 GiB
2019-03-31 20:08:10.675+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1962, gcTime=2000, gcCount=1}
2019-03-31 20:08:12.306+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1321, gcTime=1329, gcCount=1}
2019-03-31 20:08:14.111+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1197, gcTime=1256, gcCount=1}
2019-03-31 20:08:15.906+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1192, gcTime=1263, gcCount=1}
2019-03-31 20:08:17.744+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1236, gcTime=1235, gcCount=1}
2019-03-31 20:08:19.538+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=991, gcTime=1045, gcCount=1}
2019-03-31 20:10:03.847+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=103707, gcTime=0, gcCount=0}
2019-03-31 20:10:05.322+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=973, gcTime=966, gcCount=1}
2019-03-31 20:10:15.187+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=9765, gcTime=9845, gcCount=3}
2019-03-31 21:00:58.017+0000 INFO [o.n.k.i.p.Procedures] [algo-19] CC(HugeParallelUnionFindQueue) 32%
It has been at 32% for more than 7 hours now.
My cypher query was executed in a tmux window in Cypher-shell on Ec2 of 488 GB ram and 64 CPU. Heap space set at 380g and PC at 60g. The command was
CALL algo.unionFind(null, 'OBSERVED_WITH', {write:true, graph:'huge', partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
Altogether the algorithm has been running for 25 Hours now.. it took about 18 Hours to reach the same point in the debug.log that OP was stuck at..
04-04-2019 12:55 AM
Is resolved in the current release:
All the sessions of the conference are now available online