cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Reversing every relationship in a large graph

I made a mistake and ingested 160 million relationships the wrong way (on 32 million nodes). The nodes are PubMed article_ids and the relationships are citations. I have (a:Article)-[:CITES]->(b:Article) where it should be (a:Article)<-[:CITES]-(b:Article).

I have tried the following:

MATCH (a:Article)-[rel:CITES]->(b:Article)
CALL apoc.refactor.invert(rel)
yield input, output RETURN COUNT(rel);

but keep getting (after about half an hour or more) "Server at localhost(127.0.0.1):7687 is no longer available".

I'm not sure how to deal with this error — is my large query crashing the server? I previously increased dbms.memory.heap.max_size to deal with an out-of-memory error.

My dedicated machine has 16 GB of RAM and the nodes consist of only article_id's (from 1-32 million).

If the apoc won't run, is there another way of doing this? For instance, I could create all the reverse relationships manually and then delete all the old CITES?

This works on my small test graph, is it a good idea to run it on such a large graph?

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:CITES]-(n);

Are we sure that the operation will be all-or-none (i.e. atomic)? The last thing i would want is for some unknown number of relationships to be reversed.

I guess I could do:

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:REFERENCES]-(n);

MATCH (m:Article)-[r:REFERENCES]->(n:Article)
DELETE r
CREATE (m)-[:CITES]->(n);

Edit: I am running the above on my large dataset, and the first statement has been running for over two hours, with the following CPU usage:

2X_3_387353b7961b6bf388b981e7211dd59ebbeb54b0.png

Edit2: Three and a half hours in, it crashed with "Server at localhost(127.0.0.1):7687 is no longer available"

My next solution was to divide to problem into batches. In Python:

batch_size = 5000
max_id = 33307598
driver = neo4j.GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
session = driver.session()
for batch in tqdm(range(max_id // batch_size )):   
query = ("MATCH (m:Article)-[c:CITES]->(n:Article) " +
            "WHERE m.ArticleId >= " + str(batch*batch_size) + " AND m.ArticleId < " + str((batch+1)*batch_size) + " " +
            "DELETE c " +
            "CREATE (m)<-[:REFERENCES]-(n);")
   print(query)
   result = session.run(query)  
session.close()
driver.close()

Unfortunately the first iteration of this took nearly a minute, meaning this process extrapolates to 100 hours. It would be faster just to reingest.

1 REPLY 1

Sounds like a job for apoc.periodic.iterate, let the library take care of batching (and optional parallel execution) for you.

There's several examples of how to use it on that page.