Neo4j

ben5 · ‎01-04-2021

I made a mistake and ingested 160 million relationships the wrong way (on 32 million nodes). The nodes are PubMed article_ids and the relationships are citations. I have (a:Article)-[:CITES]->(b:Article) where it should be (a:Article)<-[:CITES]-(b:Article).

I have tried the following:

MATCH (a:Article)-[rel:CITES]->(b:Article)
CALL apoc.refactor.invert(rel)
yield input, output RETURN COUNT(rel);

but keep getting (after about half an hour or more) "Server at localhost(127.0.0.1):7687 is no longer available".

I'm not sure how to deal with this error — is my large query crashing the server? I previously increased dbms.memory.heap.max_size to deal with an out-of-memory error.

My dedicated machine has 16 GB of RAM and the nodes consist of only article_id's (from 1-32 million).

If the apoc won't run, is there another way of doing this? For instance, I could create all the reverse relationships manually and then delete all the old CITES?

This works on my small test graph, is it a good idea to run it on such a large graph?

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:CITES]-(n);

Are we sure that the operation will be all-or-none (i.e. atomic)? The last thing i would want is for some unknown number of relationships to be reversed.

I guess I could do:

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:REFERENCES]-(n);

MATCH (m:Article)-[r:REFERENCES]->(n:Article)
DELETE r
CREATE (m)-[:CITES]->(n);

Edit: I am running the above on my large dataset, and the first statement has been running for over two hours, with the following CPU usage:

Edit2: Three and a half hours in, it crashed with "Server at localhost(127.0.0.1):7687 is no longer available"

My next solution was to divide to problem into batches. In Python:

batch_size = 5000
max_id = 33307598
driver = neo4j.GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
session = driver.session()
for batch in tqdm(range(max_id // batch_size )):   
query = ("MATCH (m:Article)-[c:CITES]->(n:Article) " +
            "WHERE m.ArticleId >= " + str(batch*batch_size) + " AND m.ArticleId < " + str((batch+1)*batch_size) + " " +
            "DELETE c " +
            "CREATE (m)<-[:REFERENCES]-(n);")
   print(query)
   result = session.run(query)  
session.close()
driver.close()

Unfortunately the first iteration of this took nearly a minute, meaning this process extrapolates to 100 hours. It would be faster just to reingest.

terryfranklin82 · ‎01-04-2021

Sounds like a job for apoc.periodic.iterate, let the library take care of batching (and optional parallel execution) for you.

There's several examples of how to use it on that page.

Neo4j

Reversing every relationship in a large graph