Neo4j

m_hess · ‎10-19-2020

Hi,

after researching and a lot of trial and error, I was not able to figure out, what I'm doing wrong. So I ended up here writing my first post.

Here's what I'm trying to achieve. I have a graph with ~200M nodes and ~260M relationships. I want to introduce an inferred property like this:

CALL apoc.periodic.iterate(
  'MATCH (n)-[:HAS_LOCATION]->(t) WHERE n.coordinates IS NULL RETURN n,t',
  "SET n.coordinates=t.coordinates",
  {batchSize:10000, parallel:true})

This query crashes neo4j 4.1.1 community edition (with apoc-4.1.0.0-all.jar) every time after a few minutes. In the debug logs I can see entries like

2020-10-18 21:35:40.856+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=123, gcTime=174, gcCount=1}

at the beginning, ramping up to

2020-10-18 21:42:40.558+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=15462, gcTime=15561, gcCount=4}

right before the crash.

The machine has 64GB of RAM. We configured it like this:

dbms.memory.heap.initial_size=24100m
dbms.memory.heap.max_size=24100m
dbms.memory.pagecache.size=30100m
dbms.memory.transaction.global_max_size=10000m
dbms.memory.transaction.max_size=5000m

I tried to lower the batch size (to 500), it did not help at all. Is there any chance, that this is data-related? Any idea, what is actually consuming so much memory? Will a node with two outgoing 'HAS_LOCATION' relationships cause issues?

Also, I was trying to count the nodes, for which to apply the SET operation:

MATCH (n)-[:HAS_LOCATION]->() WHERE n.coordinates IS NULL RETURN count(*)

This also crashes neo4j. I don't expect it to return fast, as it's scanning all the nodes, but I don't see why it should consume so much memory.
When I rephrase this query as

MATCH (n) WHERE (n)-[:HAS_LOCATION]->() AND n.coordinates IS NULL RETURN count(n)

it completed once and returned 3724 (after 15 minutes). Another invocation of it crashed, too.

Any hints would be appreciated.
-- matt

terryfranklin82 · ‎10-19-2020

I haven't used a graph with that many nodes & relationships before, but as a starting point have you tried including some labels in your match statement, so that the query doesn't check all 200 million nodes?

View solution in original post

terryfranklin82 · ‎10-19-2020

I haven't used a graph with that many nodes & relationships before, but as a starting point have you tried including some labels in your match statement, so that the query doesn't check all 200 million nodes?

nghia71 · ‎10-20-2020

Hi,

I think you have a correct approach of using APOC periodic iterate.
1/ Perhaps you need to specify what kind of nodes you want. Otherwise you would get too many nodes.
2/ There would be many MyNodeN linked to MyNodeT, they cannot be updated simultaneously (parallel) at the same time.

How about try this one first:

CALL apoc.periodic.iterate(
'MATCH (n:MyNNode)-[:HAS_LOCATION]->(t:MyTNode) WHERE NOT EXISTS(n.coordinates) RETURN n',
"SET n.coordinates=t.coordinates",
{batchSize:100, parallel:false})

m_hess · ‎10-21-2020

I managed to update all nodes by using a batchSize of 10 and by using labels at both ends of the path. I had to pass all possible combinations of labels manually, but it worked.
In the end, I did not try out these things systematically. I did not retry all the cypher statements multiple times in order to verify whether the behavior is consistent. So, it's hard to tell what actually leads to these issues.

Thanks for your input!

nghia71 · ‎10-29-2020

Hi @m.hess,

Michael Hunger wrote a wonderful article. I think it can help you https://medium.com/neo4j/5-tips-tricks-for-fast-batched-updates-of-graph-structures-with-neo4j-and-c....

From my own perspective, the problem occurs when you have lots of nodes and relationships to be created/updated simultaneously. The best approach, for me, is to break the graph, that needs to be persisted, into connected components and then use apoc.periodic.iterate to run parallel update of those disjoint but connected components. Without conflict of shared nodes or relationships the update/create operation should work.

The question is, how to break the graph into connected components? If you have it in Neoj4 already, then some algorithms of GDS (https://neo4j.com/docs/graph-data-science/current/algorithms/) can help. If you have only raw data, I suggest NetworkX (https://networkx.org) that can help to identify these components.

Hope that help.

Nghia Doan

Neo4j

Updating many nodes in large graph consumes all memory and crashes