Neo4j

christian · ‎07-05-2022

It's been a while since we posted here and our graphs have grown in size ...

So much so that our data transformation queries (i.e. creating new nodes and relationships) on some of our graphs with millions of nodes and 100+ GB size are now running for 20+ hours or more (or not at all, i.e. server crashes). And that is for a dataset with just 200 employees so what will happen if we run our analysis on someone with 1,000 employees!? It's time to optimize 🙂

Below is our current list of things to try and test in order of priority we are thinking but we would love some input from the community on what to try first second etc plus any learnings you guys have with graphs that size.

Start using apoc.periodic.iterate with part 2 set to parallel:true (right now we use parallel:false) so we can start using more than one computer cores but to do that we need to ...
1. Restructure our the part 1 MATCH statement to avoid deadlocks which is easier said than done and maybe cannot be done at all in some cases and seems to not be guaranteed either or
2. Use a manual workaround to enforce a lock as an alternative to the above as that might be easier to implement in some cases
Make sure we are using the right index type for the WHERE statement in part 1 of apoc.periodic.iterate just to make sure that there are no inefficiencies there
Play with the batchSize of apoc.periodic.iterate part 2 and find the optimal size given the instance size, usually lower batch size means query takes longer and bigger batch size improves speed but not always
Restructure the WHERE statement of apoc.periodic.iterate part 1 to reduce the size of what gets sent to part 2 as part of the RETURN by using the following approaches
1. RETURN less rows and use an external tool such as Airflow to iterate through batches, i.e. use a date range in the WHERE statement for example
2. RETURN the node id only (instead of the entire node) and look that up again in part 2, this seems to reduce heap memory usage but not really speed up the query so maybe useful for cost optimisation later, i.e. run same queries with less RAM
Shift as much of the processing from apoc.periodic.iterate part 1 to part 2 as possible as it can be worked through in batches there and hence is less likely to crash the server as part 1 gets too big
Play with the instance size, i.e. more and more RAM, as well as Neo4j heap settings to see how that all affects the query times in combination with any of the above but we feel we don't know enough yet to get started on this and just going bigger and bigger might mask issues not solve them

Please note, we did find and read 2019-best-practices-to-make-large-updates-in-neo4, query-tuning, memory-configuration and memory-management which is very useful but it's a long list with lots of things to do so some guidance on where to start from the community would be great.

michael_hunger · ‎07-06-2022

I think you should share an example of the statement you're running, without that it's hard to give much advice.

Also you probably ran PROFILE already, sharing that would also help.

Did you check memory and IO(PS) setup of your machine?

Don't return anything from your query

I don't know if your first (driving) query returns nodes or rels - that doesn't work well anymore since neo4j 4.x
you need to return the id's of the nodes or rels and rebind them in the update query.

christian · ‎07-06-2022

Yep, makes sense, here you go

CALL apoc.periodic.iterate("
    MATCH (n1:User)<-[r1]-(n2:Object)  
    WHERE (r1.Property1 IS NOT NULL OR r1.Property2 IS NOT NULL)
    AND (DATE(r1.Timestamp) > DATE('$sdate') AND DATE(r1.Timestamp) < DATE('$edate'))
    RETURN n1, r1, n2
", "
    MERGE ...
    RETURN *
", {batchSize: 2000, parallel: false})

Regarding ...

PROFIL - need to make some changes so I can post here, will do
MEMORY/IO(PS) - we did but not sure we really get yet what to look out for here to be honest
return the id's of the nodes or rels and rebind them in the update query - on it, that is obviously one issue

christian · ‎07-13-2022

Actually the final solution might be way easier ... fark ...

We're using an older Neo4j version that was not yet supporting indexes on relationships !!! By upgrading to the latest version and setting indexes on the relationship properties we use my guys were able to reduce the query time for a batch from 8 hours to 69 minutes !!!

Neo4j

Approaches to scaling very large graph queries