Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-25-2021 03:42 AM
Hello,
I have a graph with ~14m nodes and ~48m relationships. I know some proportion of the relationships are duplicates and want to delete them. I've found various queries (e.g., here) on the forum.
My challenge is executing the query seems impossible -- it more or less destroys my dev machine. It looks like, from the query profile, that the problem is the AllNodeScan and then the relationship expansion. I would like to batch or limit the return, but am finding this very difficult - no matter where I try to put the "LIMIT" clause, e.g., below, the "EXPLAIN" diagram still shows that the whole graph gets scanned at the start.
I can't use Label scanning because the duplicate relationships are across different node types. But also, there must be some way to limit a scan? If I can just do that, I can use iterative commit to go through the graph in batches.
So the question is how can I adjust the query, as below, to limit to just scan X nodes?
MATCH (a)-[r]->(b)
with a, b, type(r) as tr, properties(r) as pr, count(properties(r)) as cpr limit 10
where cpr>1
return sum(cpr-1) as numDuplicateRelationships
Thanks!
Luke
03-25-2021 09:14 AM
Hi Luke,
As you've observed it is doing full graph scans because you are not using labels with (a) and (b), I don't know of a way around that (and how could cypher use anything but all nodes?)
APOC can programmatically construct dynamic cyphers en-mass, it sounds like it might come in handy here, it has solved a number of challenges for me
CALL apoc.cypher.run()
For example you could potentially, call db.labels() and try every label pair combination possible (programmatically constructed cypher)
Other thoughts
03-25-2021 01:32 PM
Hi Joel,
Thanks! These are all super helpful pointers. I'm going to try out apoc.cypher.run().
What happened here is there was an error in the ETL process. Now fixed, we think, for future runs, but given the amount already loaded (we caught this late), hoping to fix it by pruning rather than wiping and redoing the whole ETL. Also more generally I'm trying to get to grips with working with working on a graph of this size and complexity.
Thanks again - this is super helpful.
All the sessions of the conference are now available online