cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

VM stop-the-world pause for simple LOAD csv query for creating a graph

Hi,

My code is as follows:

rom neo4j.v1 import GraphDatabase, basic_auth
import csv,os,glob
import time

print(os.getcwd())
subdir = '.\\..\\Dataset\\\\plasmin\\' #plasmin, acetylcholine_esterase
os.chdir(subdir)
print(os.getcwd())

fileList = [x.split(".")[0] for x in glob.glob('*.scalaprocessed_maxdist11_graph')]
#fileList = ['1A25']
print(len(fileList),fileList)

driver = GraphDatabase.driver("bolt://127.0.0.1:11005", auth=("neo4j", "123")) #7687 sandbox: neo4j, neo4j
session = driver.session()

session.run("MATCH(n) DETACH DELETE n")

for prot in fileList:
    print("\n", prot)
    if os.path.exists(prot+"_AllCCs_maxDist11.csv"):
        print("Already processed")
        continue
    # prerequisite. delete any pre-existing graph (else wil get added up for each subsequent protein)
    session.run("MATCH(n) DETACH DELETE n")
    
    #1. Create graph
    start = time.time()
    session.run("USING PERIODIC COMMIT "
                 "LOAD CSV FROM 'file:///'+{prot}+'.scalaProcessed_maxdist11/part-00000' AS line "
                 "MERGE (n:MyNode {Name:line[0]}) " 
                 "MERGE (m:MyNode {Name:line[1]}) " 
                 "MERGE (n) -[:TO {dist:line[2]}] -> (m) ", prot=prot)
    print("done step 1")
    #2 find CCs
    result = session.run("CALL algo.unionFind.stream('MyNode', 'TO', {graph:'huge'}) "
                          "YIELD nodeId,setId "
                          "MATCH (n) "
                          "WHERE id(n)=nodeId "
                          "WITH setId,collect(nodeId) as nodes, collect(n.Name) as labels,count(*) as size_of_component "
                          "ORDER BY size_of_component DESC "
                          "RETURN setId as componentId,size_of_component,labels as connectedTSRkeys ")
    # 3. save result
    with open(prot+"_AllCCs_maxDist11.csv","w",newline='') as csvfile: # newline='' <- to avoid blank line between two rows
        writer = csv.writer(csvfile,delimiter = ',')
        writer.writerow(['componentId','size_of_component','connectedTSRkeys'])
        for record in result:
            record =  str(record)[:-1].replace(", ",",").replace("'","").split() 
            #print("\n",record[1],record[2],record[3])
            writer.writerow([record[1].split("=")[1], record[2].split("=")[1], record[3].split("=")[1]])
    
    # 4. delete the graph (else wil get added up for each subsequent protein)
    session.run("MATCH(n) DETACH DELETE n")
    print("Total processing time = ", time.time()-start)

driver.close()
print ("Neo4j end")```

- I am using:
**CentOS 7** on a remote machine with **neo4j 3.5.8**.
**dbms.memory.heap.initial_size=32G
dbms.memory.heap.max.size=64G
dbms.memory.pagecache.size=128G** where available 256GB

- I am using python code to access neo4j through neo4j-python driver. 

The problem I am encountering is a VM stop-the-world pause is getting imposed after creating around 10000 nodes and the execution stalls though the code does not stop. 

The neo4j.log seems to be clean. 
Snapshot of the debug log:
![image|690x420](upload://fr9uU8Tu6uaaHRXqyryFbBCAxJn.png) 

Most of the available solutions are talking about adjusting the heap size, which in my case is more than enough I feel.
27 REPLIES 27

couple of issues

have you narrowed this down to the 'LOAD CSV' is triggering the vm stop-the-world? If so do you have indexes on :MyNode and on the Name property?

your code includes match (n) detach delete n; How many nodes/relationships does this end up deleting? If you are deleting 100k+ nodes relationships you might want to consider batching this using apoc.periodic.iterate. See https://neo4j.com/developer/kb/large-delete-transaction-best-practices-in-neo4j/.

vm stop-the-world messages are indicative of a java garbage collection and are a part of a java based app. You cant eliminate them for good but rather attempt to minimize

First, yes, LOAD CSV is causing the stop-the-world pause. I have checked periodically hoe many nodes are getting created, until 11k nodes are getting created (only nodes, not relationships yet), then the above pause is getting applied. The code is not moving to the next step.

Second, in pre cleanup stage it’s deleting the above created 11k nodes at maximum. The operation takes less than 1 sec.

Third, I have executed the same code in my laptop with 20k nodes in total to be created (min heap=2G, max heap=4G, available RAM=4G), the code does not get paused, but takes long time; for each protein (input file for one graph),
step 1 < 5 sec
step 2 > 24 hour
step 3 < 2 sec.
Therefore, it is not a feasible solution to use such low power machine to process > 2000 proteins in a loop.

thanks for the update.

If so do you have indexes on :MyNode and on the Name property?

I decided not to keep indexes on :MyNode on Name property as its seems not to be useful for me. If we think in terms of relational database, I have only one table in my database with one attribute Name, we have self-pointers for the table. Now, by indexing we cannot speedup to any level, while searching, we have to look up all the keys.

Here is a snapshot of one of my input file:

key_i    ,key_j    ,relatinship
7148759_2,7151776_1,783_788
7151776_1,8071389_1,783_788
7047240_2,7148759_2,597_639
8394145_1,8800128_1,732_735
6400697_1,6401718_1,612_653

We can see key '7148759_2' appeared as key_i in first line and as key_j in the third line in the relational table. We have only keys in a single level of the hierarchy, therefore, I guessed indexing on Name (which is key_i/key_j here) might not help.

Please correct me if I am wrong.

indexing will certainly help
Without an index on :MyNode(Name) a cypher statement that attempts to find all nodes with label :MyNode will perform a ScanByLabel and iterate over all nodes with this label and evaulate each of these nodes against the WHERE clause. However with an index it will simply perform a NodeIndexScan. In the case of an environment that has for example 100m nodes with label :MyNode and each Name property is unique, with no index you effectively have to examine all 100m nodes. With an index you would examine only 1 node

Index is the way to go. But this is no different whether graph or RDBMS. Same principles apply

I will apply index and get back to you with the results within 24 hours.

FYI, my last run of code got some results, though not complete: for one input file, it took 15 hr to create 11594 nodes and corresponding relationships. It failed after that. Can you think of any reason what might take 15 hr for a graph to be created, given the above settings?

Keep in mind that without those indexes, each MERGE you were performing was doing a node by label scan and property access (to determine which nodes fit your pattern)

You were performing 2 MERGEs per row, for every row, so if your CSV had 11594 rows then 23188 label scans were being performed along with property access total. While this would be cheap at the start, it would have likely grown more and more expensive the more nodes were present.

We would also advise you to do separate passes through the CSV for nodes. Your two MERGE operations on the same label likely created an Eager operation which disabled the periodic commit behavior of the load, leading to heap pressure and likely to issues with processing. Try creating the nodes first (if all exist under line[0]) then just one pass...but otherwise two passes one per line). When all nodes are created then for the relationships just do a pass using MATCH (instead of MERGE) and then MERGE the relationship.

Here is an update:
I have incorporated indexing before starting to load data from CSV and it improved step1. Here is the updated code (I have updated the deletion approach also).

.
.
.
driver = GraphDatabase.driver("bolt://127.0.0.1:7687", auth=("neo4j", "123")) #7687 sandbox: neo4j, neo4j
session = driver.session()

print("\nConnection Established.")

def delete_all():
        # get total #of nodes 
        res = session.run("MATCH(n) RETURN COUNT(*) AS n")
        total_nodes=0
        for item in res:
            total_nodes= item["n"]
        print("\n Existing nodes in db:", total_nodes)

        # get total #of relationships
        res1 = session.run("MATCH (n)-[r]->() RETURN COUNT(r) as r")
        total_rels=0
        for item in res1:
            total_rels= item["r"]
        print("\n Existing relationships in db:", total_rels)

        # delete all nodes in batches (for faster deletion)
        while total_nodes > 0 :
                res = session.run("MATCH(n) WITH n LIMIT 10000 DETACH DELETE n RETURN COUNT(n) AS count")
                count=0
                for item in res:
                        count = item["count"]#updates deleeted node count here
                total_nodes = total_nodes-count
        print("\n #of nodes in db after deletion completed = ", total_nodes)
start = time.time()
delete_all()
print("\n Pre cleanup time (sec): ", time.time()-start)

for prot in fileList:
    print("\n\n", prot)
    if os.path.exists(prot+"_AllCCs_maxDist11.csv"):
        print("\n Already Processed.")
        continue
    start = time.time()
    delete_all()
    pre_time = time.time()-start
    print("\n Pre cleanup time (sec): ", pre_time)

    # Database preparation 
    session.run("CREATE INDEX ON :MyNode(Name)")

    #1. Create graph
    start = time.time()
    session.run("USING PERIODIC COMMIT "
                 "LOAD CSV FROM 'file:///'+{prot}+'_conflict_resolved.txt' AS line "
                 "MERGE (n:MyNode {Name:line[0]}) "
                 "MERGE (m:MyNode {Name:line[1]}) "
                 "MERGE (n) -[:TO {dist:line[2]}] -> (m) ", prot=prot)

    end = time.time()
    step1_time = end - start
    print("\n Step 1 time (in sec) = ", end-start)

    #2 find CCs
    start = time.time()
    result = session.run("CALL algo.unionFind.stream('MyNode', 'TO', {graph:'huge'}) "
                          "YIELD nodeId,setId "
                          "MATCH (n) "
                          "WHERE id(n)=nodeId "
                          "WITH setId,collect(nodeId) as nodes, collect(n.Name) as labels,count(*) as size_of_component "
                          "ORDER BY size_of_component DESC "
                          "RETURN setId as componentId,size_of_component,labels as connectedTSRkeys ")
    end = time.time()
    step2_time = end - start
    print("\n Step 2 time (in sec) = ", end-start)
   #3. save result
    start = time.time()
    with open(prot+"_AllCCs_maxDist11.csv","w") as csvfile: # newline='' <- to avoid blank line between two rows
        writer = csv.writer(csvfile,delimiter = ',')
        writer.writerow(['componentId','size_of_component','connectedTSRkeys'])
        for record in result:
            record =  str(record)[:-1].replace(", ",",").replace("'","").split()
            print("\n",record[1],record[2],record[3])
            writer.writerow([record[1].split("=")[1], record[2].split("=")[1], record[3].split("=")[1]])
    end = time.time()
    step3_time = end - start
    print("\n Step 3 time (in sec) = ", end-start)

    #4. delete graph
    start = time.time()
    delete_all()
    end = time.time()
    post_time = end - start
    print("\n Post cleanup time (in sec) = ", end-start)

    print("\n Total time = ", pre_time+step1_time+step2_time+step3_time+post_time)

driver.close()
print ("\n Neo4j end")

The code executed for one input file and generated:

Existing nodes in db:11684
Existing relationships in db:4432666
Pre cleanup time (sec): 0.014729022979736328
Step 1 time (in sec) = 0.0004360675811767578
Step 2 time (in sec) = 1128.0729930400848
Step 3 time (in sec) = 0.19386005401611328
Post cleanup time (in sec) = 89.63982391357422
Total time = 1217.921842098236
Traceback (most recent call last):
  File "neo4j_copy.py", line 80, in <module>
    result = session.run("CALL algo.unionFind.stream('MyNode', 'TO', {graph:'huge'}) "
  File "/usr/lib64/python2.7/site-packages/neo4j/v1/api.py", line 331, in run
    self._connection.fetch()
  File "/usr/lib64/python2.7/site-packages/neo4j/bolt/connection.py", line 287, in fetch
    return self._fetch()
  File "/usr/lib64/python2.7/site-packages/neo4j/bolt/connection.py", line 327, in _fetch
    response.on_failure(summary_metadata or {})
  File "/usr/lib64/python2.7/site-packages/neo4j/v1/result.py", line 70, in on_failure
    raise CypherError.hydrate(**metadata)
neo4j.exceptions.DatabaseError: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details.

Here is a snapshot of debug.log file after trying to restart neo4j:

2019-08-23 23:01:25.580+0000 INFO [o.n.i.d.DiagnosticsManager]   Total size of mapped files: 477.17 MB
2019-08-23 23:01:25.580+0000 INFO [o.n.i.d.DiagnosticsManager] --- STARTED diagnostics for KernelDiagnostics:StoreFiles END ---
2019-08-23 23:01:25.580+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Fulfilling of requirement 'Database available' makes database graph.db available.
2019-08-23 23:01:25.580+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is ready.
2019-08-23 23:01:25.580+0000 INFO [o.n.k.i.DatabaseHealth] Database health set to OK
2019-08-23 23:01:26.351+0000 INFO [o.n.b.BoltServer] Bolt server loaded
2019-08-23 23:01:26.468+0000 INFO [o.n.b.BoltServer] Bolt enabled on 127.0.0.1:7687.
2019-08-23 23:01:27.763+0000 INFO [o.n.i.d.DiagnosticsManager] --- SERVER STARTED START ---
2019-08-23 23:01:28.753+0000 INFO [o.n.i.d.DiagnosticsManager] --- SERVER STARTED END ---
2019-08-23 23:01:35.913+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=131, gcTime=136, gcCount=1}
2019-08-23 23:01:43.286+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=100, gcTime=140, gcCount=1}
2019-08-23 23:01:55.271+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=118, gcTime=119, gcCount=1}
2019-08-23 23:03:02.050+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=123, gcCount=1}
2019-08-23 23:04:09.150+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=149, gcTime=152, gcCount=1}
2019-08-23 23:04:15.430+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=173, gcTime=257, gcCount=1}
2019-08-23 23:05:31.105+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=220, gcTime=227, gcCount=1}
2019-08-23 23:06:55.569+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=326, gcTime=328, gcCount=1}
2019-08-23 23:08:46.308+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=103, gcTime=0, gcCount=0}
2019-08-23 23:12:28.911+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=115, gcTime=181, gcCount=1}
2019-08-23 23:13:25.994+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=146, gcTime=171, gcCount=1}
2019-08-23 23:14:51.262+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=198, gcTime=249, gcCount=1}
2019-08-23 23:15:19.198+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=156, gcCount=1}
2019-08-23 23:17:37.486+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=157, gcCount=1}
2019-08-23 23:18:06.551+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=130, gcTime=141, gcCount=1}
2019-08-23 23:18:34.201+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=115, gcTime=160, gcCount=1}
2019-08-23 23:19:02.844+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=110, gcTime=172, gcCount=1}
2019-08-23 23:19:15.222+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=241, gcTime=309, gcCount=1}
2019-08-23 23:19:19.044+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=499, gcTime=506, gcCount=1}
2019-08-23 23:19:24.462+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=613, gcTime=575, gcCount=1}
2019-08-23 23:19:28.577+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=411, gcTime=365, gcCount=1}
2019-08-23 23:19:33.258+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=276, gcTime=305, gcCount=1}
2019-08-23 23:19:33.968+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=107, gcTime=168, gcCount=1}
2019-08-23 23:19:36.453+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=181, gcTime=0, gcCount=0}
2019-08-23 23:19:37.011+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=218, gcTime=0, gcCount=0}
2019-08-23 23:19:38.382+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=269, gcTime=265, gcCount=1}
2019-08-23 23:19:40.924+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=339, gcTime=331, gcCount=1}
2019-08-23 23:19:44.498+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=352, gcTime=380, gcCount=1}
2019-08-23 23:19:49.934+0000 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.16] version=16, last transaction in previous log=7870
2019-08-23 23:19:52.906+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1060, gcTime=1086, gcCount=1}
2019-08-23 23:19:58.107+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1043, gcTime=1056, gcCount=1}
2019-08-23 23:20:04.868+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=3204, gcTime=3257, gcCount=1}
2019-08-23 23:20:08.301+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2331, gcTime=2314, gcCount=1}
2019-08-23 23:20:19.771+0000 INFO [o.n.k.i.p.Procedures] Node Store Scan: Imported 11,684 records and 0 properties from 1615 KiB (1,654,380 bytes); took 0.089 s, 131,692.26 Nodes/s, 17 MiB/s (18,646,784 bytes/s) (per thread: 32,923.06 Nodes/s, 4552 KiB/s (4,661,696 bytes/s))
2019-08-23 23:20:19.915+0000 INFO [o.n.k.i.p.Procedures] [relationship-store-scan-2] LOADING 1% Memory usage: 3864 KiB
2019-08-23 23:20:20.363+0000 INFO [o.n.k.i.p.Procedures] Relationship Store Scan: Imported 4,432,666 records and 0 properties from 287 MiB (301,740,480 bytes); took 0.579 s, 7,661,493.04 Relationships/s, 497 MiB/s (521,533,223 bytes/s) (per thread: 1,915,373.26 Relationships/s, 124 MiB/s (130,383,305 bytes/s))
2019-08-23 23:20:20.364+0000 INFO [o.n.k.i.p.Procedures] [neo4j.BoltWorker-5 [bolt] [/127.0.0.1:47196] ] LOADING 100% Memory usage: 6284 KiB
2019-08-23 23:20:20.465+0000 INFO [o.n.k.i.p.Procedures] [algo-3] ParallelUnionFind 100%
2019-08-23 23:20:20.524+0000 INFO [o.n.k.i.p.Procedures] UnionFind: overall memory usage: 6375 KiB
2019-08-23 23:20:20.971+0000 INFO [o.n.c.i.ExecutionEngine] Discarded stale query from the query cache after 1129 seconds: MATCH (n)-[r]->() RETURN COUNT(r) as r
2019-08-23 23:20:27.094+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1871, gcTime=0, gcCount=0}
2019-08-23 23:20:35.282+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1634, gcTime=1623, gcCount=1}
2019-08-23 23:31:54.221+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=262, gcTime=258, gcCount=1}
2019-08-23 23:31:58.854+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=228, gcTime=228, gcCount=1}
2019-08-23 23:33:05.202+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=289, gcTime=301, gcCount=1}
2019-08-23 23:33:19.039+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=121, gcTime=72, gcCount=1}
2019-08-23 23:34:21.600+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=289, gcTime=286, gcCount=1}
2019-08-23 23:35:09.729+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=168, gcTime=126, gcCount=1}
2019-08-23 23:35:37.672+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=93, gcCount=1}
2019-08-23 23:35:53.499+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=308, gcTime=324, gcCount=1}
2019-08-23 23:36:52.506+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=78, gcCount=1}
2019-08-23 23:37:43.229+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=116, gcTime=144, gcCount=1}
2019-08-23 23:37:53.015+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=776, gcTime=429, gcCount=1}
2019-08-23 23:38:02.838+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=712, gcTime=641, gcCount=1}
2019-08-23 23:38:10.275+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=604, gcTime=634, gcCount=1}
2019-08-23 23:38:15.773+0000 ERROR [o.n.k.i.DatabaseHealth] Database panic: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details. No space left on device
java.io.IOException: No space left on device
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
	at org.neo4j.io.fs.StoreFileChannel.write(StoreFileChannel.java:160)
	at org.neo4j.io.fs.StoreFileChannel.writeAll(StoreFileChannel.java:73)
	at org.neo4j.kernel.impl.transaction.log.PhysicalLogVersionedStoreChannel.writeAll(PhysicalLogVersionedStoreChannel.java:67)
	at org.neo4j.kernel.impl.transaction.log.PhysicalFlushableChannel.prepareForFlush(PhysicalFlushableChannel.java:72)
	at org.neo4j.kernel.impl.transaction.log.PhysicalFlushableChannel.bufferWithGuaranteedSpace(PhysicalFlushableChannel.java:158)
	at org.neo4j.kernel.impl.transaction.log.PhysicalFlushableChannel.putLong(PhysicalFlushableChannel.java:121)
	at org.neo4j.kernel.impl.transaction.log.PositionAwarePhysicalFlushableChannel.putLong(PositionAwarePhysicalFlushableChannel.java:81)
	at org.neo4j.kernel.impl.transaction.log.PositionAwarePhysicalFlushableChannel.putLong(PositionAwarePhysicalFlushableChannel.java:30)
	at org.neo4j.kernel.impl.transaction.command.Command$PropertyCommand.writePropertyRecord(Command.java:424)
	at org.neo4j.kernel.impl.transaction.command.Command$PropertyCommand.serialize(Command.java:412)
	at org.neo4j.kernel.impl.transaction.log.entry.StorageCommandSerializer.visit(StorageCommandSerializer.java:43)
	at org.neo4j.kernel.impl.transaction.log.entry.StorageCommandSerializer.visit(StorageCommandSerializer.java:30)
	at org.neo4j.kernel.impl.transaction.log.PhysicalTransactionRepresentation.accept(PhysicalTransactionRepresentation.java:70)
	at org.neo4j.kernel.impl.transaction.log.entry.LogEntryWriter.serialize(LogEntryWriter.java:84)
	at org.neo4j.kernel.impl.transaction.log.TransactionLogWriter.append(TransactionLogWriter.java:43)
	at org.neo4j.kernel.impl.transaction.log.BatchingTransactionAppender.appendToLog(BatchingTransactionAppender.java:211)
	at org.neo4j.kernel.impl.transaction.log.BatchingTransactionAppender.append(BatchingTransactionAppender.java:119)
	at org.neo4j.kernel.impl.api.TransactionRepresentationCommitProcess.appendToLog(TransactionRepresentationCommitProcess.java:64)
	at org.neo4j.kernel.impl.api.TransactionRepresentationCommitProcess.commit(TransactionRepresentationCommitProcess.java:48)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.commit(KernelTransactionImplementation.java:716)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:586)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	at org.neo4j.kernel.impl.query.Neo4jTransactionalContext.close(Neo4jTransactionalContext.java:130)
	at org.neo4j.cypher.internal.runtime.interpreted.TransactionalContextWrapper.close(TransactionalContextWrapper.scala:74)
	at org.neo4j.cypher.internal.runtime.interpreted.DelegatingQueryTransactionalContext.close(DelegatingQueryContext.scala:341)
	at org.neo4j.cypher.internal.compatibility.v3_5.ExceptionTranslatingQueryContext$ExceptionTranslatingTransactionalContext.org$neo4j$cypher$internal$compatibility$v3_5$ExceptionTranslatingQueryContext$ExceptionTranslatingTransactionalContext$$sup2019-08-23 23:41:36.957+0000 WARN [o.n.b.r.DefaultBoltConnection] The client is unauthorized due to authentication failure.
2019-08-23 23:41:44.486+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [Neo.DatabaseError.Transaction.TransactionStartFailed]: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details., reference d615fd77-36c4-4be9-aa16-3fc32f0cd578.
2019-08-23 23:41:44.486+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [Neo.DatabaseError.Transaction.TransactionStartFailed]: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details., reference d615fd77-36c4-4be9-aa16-3fc32f0cd578. The database has encountered a critical error, and needs to be restarted. Please see database logs for more details.
org.neo4j.graphdb.TransactionFailureException: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details.
	at org.neo4j.graphdb.facade.spi.ClassicCoreSPI.beginTransaction(ClassicCoreSPI.java:193)
	at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.beginTransactionInternal(GraphDatabaseFacade.java:691)
	at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.beginTransaction(GraphDatabaseFacade.java:375)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachineV1SPI.beginTransaction(TransactionStateMachineV1SPI.java:139)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachineV1SPI.beginTransaction(TransactionStateMachineV1SPI.java:98)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.execute(TransactionStateMachine.java:293)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.run(TransactionStateMachine.java:269)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine.run(TransactionStateMachine.java:110)
	at org.neo4j.bolt.v3.runtime.ReadyState.processRunMessage(ReadyState.java:85)
	at org.neo4j.bolt.v3.runtime.ReadyState.processUnsafe(ReadyState.java:56)
	at org.neo4j.bolt.v3.runtime.FailSafeBoltStateMachineState.process(FailSafeBoltStateMachineState.java:48)
	at org.neo4j.bolt.v1.runtime.BoltStateMachineV1.nextState(BoltStateMachineV1.java:144)
	at org.neo4j.bolt.v1.runtime.BoltStateMachineV1.process(BoltStateMachineV1.java:92)
	at org.neo4j.bolt.messaging.BoltRequestMessageReader.lambda$doRead$1(BoltRequestMessageReader.java:89)
	at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:191)
	at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:139)
	at org.neo4j.bolt.runtime.ExecutorBoltScheduler.executeBatch(ExecutorBoltScheduler.java:171)
	at org.neo4j.bolt.runtime.ExecutorBoltScheduler.lambda$scheduleBatchOrHandleError$2(ExecutorBoltScheduler.java:154)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.internal.kernel.api.exceptions.TransactionFailureException: The database has org.neo4j.cypher.internal.compatibility.ClosingExecutionResult.org$neo4j$cypher$internal$compatibility$ClosingExecutionResult$$safely(ClosingExecutionResult.scala:167)
	at org.neo4j.cypher.internal.compatibility.ClosingExecutionResult.initiate(ClosingExecutionResult.scala:65)
	at org.neo4j.cypher.internal.compatibility.ClosingExecutionResult$.wrapAndInitiate(ClosingExecutionResult.scala:214)
	at org.neo4j.cypher.internal.compatibility.CypherCurrentCompiler$CypherExecutableQuery$$anonfun$execute$3.apply(CypherCurrentCompiler.scala:217)
	at org.neo4j.cypher.internal.compatibility.CypherCurrentCompiler$CypherExecutableQuery$$anonfun$execute$3.apply(CypherCurrentCompiler.scala:190)
	at org.neo4j.cypher.exceptionHandler$runSafely$.apply(exceptionHandler.scala:89)
	at org.neo4j.cypher.internal.compatibility.CypherCurrentCompiler$CypherExecutableQuery.execute(CypherCurrentCompiler.scala:224)
	at org.neo4j.cypher.internal.ExecutionEngine.execute(ExecutionEngine.scala:101)
	at org.neo4j.cypher.internal.javacompat.ExecutionEngine.executeQuery(ExecutionEngine.java:85)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachineV1SPI$BoltResultHandleV1.start(TransactionStateMachineV1SPI.java:189)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State.startExecution(TransactionStateMachine.java:506)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.execute(TransactionStateMachine.java:300)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.run(TransactionStateMachine.java:269)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine.run(TransactionStateMachine.java:110)
	at org.neo4j.bolt.v1.runtime.TransactionStateMachine.run(TransactionStateMachine.java:98)
	at org.neo4j.bolt.v1.runtime.ReadyState.processRunMessage(ReadyState.java:140)
	at org.neo4j.bolt.v1.runtime.ReadyState.processRunMessage(ReadyState.java:100)
	at org.neo4j.bolt.v1.runtime.ReadyState.process(ReadyState.java:61)
	... 11 more
2019-08-23 23:44:46.963+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-08-23 23:44:48.380+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-08-23 23:44:50.926+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1954, gcTime=1569, gcCount=1}
2019-08-23 23:44:51.602+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down 'graph.db' database.
2019-08-23 23:44:51.603+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-08-23 23:44:52.752+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-08-23 23:44:52.752+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-08-23 23:45:14.563+0000 INFO [o.n.g.f.m.e.CommunityEditionModule] No locking implementation specified, defaulting to 'community'
2019-08-23 23:45:14.774+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Creating 'graph.db' database.
2019-08-23 23:45:14.807+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Requirement `Database available` makes database graph.db unavailable.
2019-08-23 23:45:14.808+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-08-23 23:45:14.825+0000 INFO [o.n.i.d.DiagnosticsManager] --- INITIALIZED diagnostics START ---
2019-08-23 23:45:14.825+0000 INFO [o.n.i.d.DiagnosticsManager] Neo4j Kernel properties:
2019-08-23 23:45:14.825+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.connector.http.enabled=true
2019-08-23 23:45:14.825+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.security.auth_enabled=true
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] unsupported.dbms.directories.neo4j_home=/var/lib/neo4j
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.import=/var/lib/neo4j/import
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.heap.initial_size=32G
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.run=/var/run/neo4j
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.lib=/usr/share/neo4j/lib
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.connector.https.enabled=true
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.jvm.additional=-XX:+UseG1GC
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.tx_log.rotation.retention_policy=1 days
2019-08-23 23:45:14.826+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.security.procedures.unrestricted=apoc.,algo.*
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] *****************************************************************=
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.windows_service_name=neo4j
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.connector.bolt.listen_address=:7687
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.security.allow_csv_import_from_file_urls=true
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.pagecache.size=128G
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.data=/var/lib/neo4j/data
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.logs.gc.enabled=true
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] unsupported.dbms.edition=community
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.logs=/var/log/neo4j
2019-08-23 23:45:14.827+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.connector.bolt.enabled=true
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.certificates=/var/lib/neo4j/certificates
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.memory.heap.max.size=64G
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] dbms.directories.plugins=/var/lib/neo4j/plugins
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] Diagnostics providers:
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] org.neo4j.kernel.configuration.Config
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] org.neo4j.internal.diagnostics.DiagnosticsManager
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] SYSTEM_MEMORY
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] JAVA_MEMORY
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] OPERATING_SYSTEM
2019-08-23 23:45:14.828+0000 INFO [o.n.i.d.DiagnosticsManager] JAVA_VIRTUAL_MACHINE
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] CLASSPATH
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] LIBRARY_PATH
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] SYSTEM_PROPERTIES
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] TIMEZONE_DATABASE
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] LINUX_SCHEDULERS
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] NETWORK
2019-08-23 23:45:14.829+0000 INFO [o.n.i.d.DiagnosticsManager] System memory information:
2019-08-23 23:45:14.838+0000 INFO [o.n.i.d.DiagnosticsManager] Total Physical memory: 440.62 GB
2019-08-23 23:45:14.838+0000 INFO [o.n.i.d.DiagnosticsManager] Free Physical memory: 234.08 GB
2019-08-23 23:45:14.838+0000 INFO [o.n.i.d.DiagnosticsManager] Committed virtual memory: 75.27 GB
2019-08-23 23:45:14.838+0000 INFO [o.n.i.d.DiagnosticsManager] Total swap space: 4.00 GB
2019-08-23 23:45:14.839+0000 INFO [o.n.i.d.DiagnosticsManager] Free swap space: 2.51 GB
2019-08-23 23:45:14.839+0000 INFO [o.n.i.d.DiagnosticsManager] JVM memory information:
2019-08-23 23:45:14.839+0000 INFO [o.n.i.d.DiagnosticsManager] Free  memory: 31.95 GB
2019-08-23 23:45:14.839+0000 INFO [o.n.i.d.DiagnosticsManager] Total memory: 32.00 GB
2019-08-23 23:45:14.839+0000 INFO [o.n.i.d.DiagnosticsManager] Max   memory: 64.00 GB
2019-08-23 23:45:14.840+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2019-08-23 23:45:14.840+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2019-08-23 23:45:14.841+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Code Cache (Non-heap memory): committed=8.81 MB, used=8.78 MB, max=240.00 MB, threshold=0.00 B
2019-08-23 23:45:14.841+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=22.00 MB, used=20.65 MB, max=-1.00 B, threshold=0.00 B
2019-08-23 23:45:14.841+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: G1 Eden Space (Heap memory): committed=1.66 GB, used=16.00 MB, max=-1.00 B, threshold=?
2019-08-23 23:45:14.841+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: G1 Survivor Space (Heap memory): committed=32.00 MB, used=32.00 MB, max=-1.00 B, threshold=?
2019-08-23 23:45:14.842+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: G1 Old Gen (Heap memory): committed=30.31 GB, used=0.00 B, max=64.00 GB, threshold=0.00 B
2019-08-23 23:45:14.842+0000 INFO [o.n.i.d.DiagnosticsManager] Operating system information:
2019-08-23 23:45:14.842+0000 INFO [o.n.i.d.DiagnosticsManager] Operating System: Linux; version: 3.10.0-514.el7.x86_64; arch: amd64; cpus: 56
2019-08-23 23:45:14.842+0000 INFO [o.n.i.d.DiagnosticsManager] Max number of file descriptors: 40000
2019-08-23 23:45:14.843+0000 INFO [o.n.i.d.DiagnosticsManager] Number of open file descriptors: 133
2019-08-23 23:45:16.800+0000 WARN [o.n.k.r.LogTailScanner] Fail to read transaction log version 17. java.lang.IllegalStateException: Weird, inUse was read in as false but the record is Property[3130370,used=true,prev=-1,next=5545057041199923203,PropertyBlock[blocks=1,SHORT_STRING,key=1,value=231]]. At position LogPosition{logVersion=17, byteOffset=271044598} and entry version V3_0_10
java.io.IOException: java.lang.IllegalStateException: Weird, inUse was read in as false but the record is Property[3130370,used=true,prev=-1,next=5545057041199923203,PropertyBlock[blocks=1,SHORT_STRING,key=1,value=231]]. At position LogPosition{logVersion=17, byteOffset=271044598} and entry version V3_0_10
2019-08-23 23:45:16.802+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-08-23 23:45:16.802+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started

Now I am not being able to connec to cypher-shell too.

Change your code to do a multi-pass, the two MERGE operation on the same label and property cause Cypher to eagerly pass all rows through the first statement and the the second to avoid endless loops.

You should see that in your query plan.
Just do 3 passes over the data. You can even use a WITH distinct line[0] as name

session.run("USING PERIODIC COMMIT "
             "LOAD CSV FROM 'file:///'+{prot}+'_conflict_resolved.txt' AS line "
             "WITH distinct line[0] as name "
             "MERGE (:MyNode {Name:name}) ", prot=prot)
session.run("USING PERIODIC COMMIT "
             "LOAD CSV FROM 'file:///'+{prot}+'_conflict_resolved.txt' AS line "
             "WITH distinct line[1] as name "
             "MERGE (:MyNode {Name:name}) ", prot=prot)
session.run("USING PERIODIC COMMIT "
             "LOAD CSV FROM 'file:///'+{prot}+'_conflict_resolved.txt' AS line "
             "MATCH (n:MyNode {Name:line[0]}) "
             "MATCH (m:MyNode {Name:line[1]}) "
             "MERGE (n) -[:TO {dist:line[2]}] -> (m) ", prot=prot)

Question: do you really want to have potentially multiple TO relationships between m and n ? with different dist values? or just one?
Then you will be faster just using CREATE there if there are no duplicates.
or MERGE on :TO but without the property.

Are those directed or undirected rels? in the latter case you can leave off the arrow tip from the relationship merge.

Any updates here? I am seeing this stop-the-world pause message after several days of usage. Not doing an import/load of data.

stop-the-world pause messages are expected and simply represents the Java JVM scanning and performing a removal of unsused data structures. These messages can not be completely removed, its simply part of normal JVM operations.
Are your 'pauses' long, i.e. 10+ seconds?

Hi @dana.canzano , yes i understand they are normal and maybe I'm chasing a red herring, but it seems like they are clustering/happening a lot around the same time the neo4j database became unresponsive. The server is up, the docker image with neo4j seems fine, no obvious logs as to why it becomes unresponsive. After about 45 minutes of poking around I simply restarted Neo4J and it is back working normally.

To directly answer your question, there are some taking longer than 10 seconds though

how much total RAM is on the instance? how much is allocated to

dbms.memory.heap.init_size
dbms.memory.heap.max_size

are your perfoming large updates/deletes/creates in a single txn i.e. updating 10million nodes in a single txn or are you batchin via apoc.periodic.iterate

Thanks @dana.canzano. We are on AWS with an m4.4xlarge EC2 instance type with 64 GB of RAM.

Docker Compose details:

# /usr/local/bin/docker-compose -f /etc/neo4j/neo4j.yaml config
services:
  neo4j-server:
    container_name: neo4j
    environment:
      NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes"
      NEO4J_AUTH: ....
      NEO4J_dbms_backup_address: 0.0.0.0:6362
      NEO4J_dbms_backup_enabled: "true"
      NEO4J_dbms_directories_certificates: /data/certificates
      NEO4J_dbms_memory_heap_initial__size: 11264m
      NEO4J_dbms_memory_heap_max__size: 24576m
      NEO4J_dbms_memory_pagecache_size: 8192m
      NEO4J_dbms_mode: SINGLE
    image: neo4j:3.4-enterprise
    ports:
    - published: 7474
      target: 7474
    - published: 7687
      target: 7687
    - published: 6362
      target: 6362
    restart: always
    ulimits:
      nofile:
        hard: 40000
        soft: 40000
    volumes:
    - /var/lib/neo4j:/data:rw
    - /var/log/neo4j:/logs:rw

We do not have APOC installed. The database is used transactional for all sorts of things. Updates and creates tend to be narrowly focused (in my opinion). We do almost no deletes. There are lots of reads and writes happening though.

couple of issues

  1. if not mistaken you are using 3.4? Any reason to use such an old version and not a more current version?

  2. Memory configuration - Operations Manual admitted from 4.2 documentation but

The heap memory size is determined by the parameters dbms.memory.heap.initial_size and dbms.memory.heap.max_size. It is recommended to set these two parameters to the same value to avoid unwanted full garbage collection pauses.

and so one should set dbms.memory.heap.initial_size and dbms.memory.heap.max_size to the same value

We are still on 3.4 b/c we haven't been able to focus on upgrading yet; however, i'm hoping to do that soonish. I would like to go to 3.5 first perhaps, but i'm not sure on which upgrade paths are best supported from our current version.

I did read that note on the recommendation that the initial and max heaps should be identical. I will make this change.

Any other suggestions that I should take into consideration?

also dbms.memory.pagecahce.size=8G but yet max heap is almost 3 times that? How did you determine that max heap should be 24G? How large is the database? i.e du -hc *store.db* as run from the graph.db dirctory and only looking for the final line which should report the total size

I could not find a good explaination of what page cache size should be to be honest. We recently scaled our neo4j hardware and have been tinkering with some of these settings to try to get better performance.

# du -hc *graph.db* | tail
28K	graph.db/schema/index/lucene/57
960K	graph.db/schema/index/lucene/35/1
964K	graph.db/schema/index/lucene/35
1.9G	graph.db/schema/index/lucene
2.3G	graph.db/schema/index
2.3G	graph.db/schema
4.0K	graph.db/index
348K	graph.db/profiles
61G	graph.db
61G	total

https://neo4j.com/docs/operations-manual/3.4/performance/memory-configuration/#memory-configuration

Page cache
The page cache is used to cache the Neo4j data and native indexes. The caching of graph data and indexes into memory will help avoid costly disk access and result in optimal performance.

The parameter for specifying how much memory Neo4j is allowed to use for the page cache is: dbms.memory.pagecache.size.

given your the last line of du -hc reports 61G, then in a perfect world you would want dbms.memory.pagecache.size=61G which would allow for the entire graph to sit in RAM. However with your current setting of 8G, less than 1/7th of your graph is sitting in RAM and so most all queries are presumably getting data from the filesystem.
And given you only have 64G total RAM on the instance, then maybe set pagecache to 48G and init/max heap to 12G ( provided nothing else is running on the EC2 instance). This would leave 4G free but to be used by other processes/general OS health

So, the page cache can be larger than initial/max heap. The total of the two should be less than the total available RAM on the system? Is that an accurate statement?

There is nothing else running on the servers except for some small logging processes.

I can try these settings out. I've been looking for a way to better visualize/monitor Neo4J. Is there a recommended tool?

dbms.memory.pagecache.size represents the amount of RAM that is allocated to load the graph into RAM.

heap however is more a symptom of query concurrency/complexity. 100 concurrent match (n) return n limit 1; is going to consume a lot less heap than match (n) where n.status='active' collect(n.id) as idsToMatch with idsToMatch match (n2:People) where n2 in idsToMatch order by n2.name, n2.age limit 100; for example

yes dbms.memory.pagecache.size + dbms.memory.heap.max_size < total RAM on the box.

This is described in the documentation links previously provided.

Thanks for the clarification. I was handed this configuration originally where page cache was a fraction of the heaps. False assumption on my part that page cache was a subset of heap size. I really appreciate this help. Testing these updates.

Hello everyone, is there any update to this problem? I recently see the same error when loading millions of rows for edges. it didn't resolve even when I added edge index, breakdown the table, set the heap_size and pagecache to 16GB (only have 32G ram). Can someone please help on this?

2022-02-18 16:24:58.497+0000 INFO  [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, "select  distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id between 100000000 and 200000000")
yield row
` operation using iteration `
with row
where row.unit_concept_id is not null
and row.value_as_number is not null
match (p:Person {person_id:row.person_id})
match (m:Measurement {measurement_concept_id: row.measurement_concept_id}) 
merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id, 
measurement_date:row.measurement_date, 
unit_concept_id:row.unit_concept_id, 
unit_concept_name:row.unit_concept_name, 
value_as_number:row.value_as_number}]->(m)
` in separate thread
2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}
2022-02-18 16:28:13.398+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=255, gcTime=307, gcCount=1}

@mengjiakang2017

  1. you can not 100% eliminate messages like
2022-02-18 16:26:58.006+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=765, gcTime=803, gcCount=1}

they are going to happen. its java being java.

  1. but his line reports pauseTime=765, gcTime=803 and no matter which yo use the pause was for less than 1 second. If you loaded same error when loading millions of rows for edges and for millions and 1s pause?

  2. the heap_size and pagecache to 16GB (only have 32G ram)
    but its not simply take total RAM and allocate 50% to heap and 50% to pagecachce.

in a world where there was unlimited RAM one would configure pagecache to be sligly larger than the size of the database. heap however is more about query complexity and query concurency

Thank you @dana.canzano for helping out!

to answer 1, I'm pasting some more logs from debug.log I see (basically first VM stop the world error repeated-> conection time out then it stops running current query and go to my next cypher query in the cypher-shell.

2022-02-17 01:45:48.077+0000 INFO [o.n.k.a.p.GlobalProcedures] starting batching from `call apoc.load.jdbc($azure_url, ​

"select distinct  m.measurement_id, m.visit_occurrence_id, m.measurement_concept_id, v.visit_concept_id, v.visit_start_date, v.visit_end_date​

from omop.measurement m​

join omop.visit_occurrence v ​

on v.visit_occurrence_id = m.visit_occurrence_id")​

yield row​

` operation using iteration `​

match (m:Measurement {measurement_concept_id: row.measurement_concept_id})​

match (v:VisitOccurrence {visit_concept_id:row.visit_concept_id})​

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id, ​

visit_start_date:row.visit_start_date, ​

visit_end_date:row.visit_end_date,​

measurement_id:row.measurement_id}]->(v)​

` in separate thread​

2022-02-17 01:46:44.330+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=276, gcTime=307, gcCount=1}​

2022-02-17 01:47:21.996+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xafcb7c11, L:/172.17.0.2:7687 - R:/10.120.14.85:60148] readAddress(..) failed: Connection timed out​

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out​

2022-02-17 01:47:21.997+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x6f3b153e, L:/172.17.0.2:7687 - R:/10.120.14.85:60149] readAddress(..) failed: Connection timed out​

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out​

2022-02-17 01:48:46.708+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=578, gcTime=0, gcCount=0}​

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31g
dbms.memory.heap.max_size=31g
dbms.memory.pagecache.size=459300m

@mengjiakang2017

regarding

for 2, when I ran again neo4j-admin memrec after loading most of the nodes and edges I got super high pagecache recommendation - more than 450GB, is this expected?

a. how much total memory is on the instance? what does linux command free -g return?

b. how large is the graph on disk? At Neo4j startup your logs\debg.log will report for each database similar to

                          [system/035qstu4] Total size of mapped files:  218kb
                          [neo4j/74aue41] Total size of mapped files:  104GB
                          [test/ak73sce4] Total size of mapped files:  11GB              

indicating 3 databases, namely system, neo4j, and test and respectively on disk consuming 218kb, 104GB and 11GB.

in a perfect world and if you have sufficient RAM then dbms.memory.pagecache.size should be set to the summation of these values. And if you can define as such then the entire graph will reside in RAM. Not a requirement but a best practice

Thank you @dana.canzano .

for a. When I ran free -g, I got

 total        used        free      shared  buff/cache   available
Mem:            503         216         138           0         148         285
Swap:             7           0           7

for b. I'm still running a query for hours but the graph database size is about 2G based on previous experience.

also want to add that we are running neo4j in docker and I'm very unfamiliar with it - we've updated the neo4j.conf on the heap and pagecache but wanted to double-check if this is indeed applied to the container? Thanks!