Optimizing a query with a subgraph/subquery, only look at specific nodes
‎06-09-2021 10:50 AM
Hello,
I am working on a relatively complex query in my social graph db (~millions of nodes).
Part of the query determines all the nodes that are relevant to my query (typically <100 nodes).
For example,
match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...
all_related_users
are the only user nodes I care about this point, so later on I would like to perform something like this
with all_related_users
match (u3:User)-[:REL]->(u4:User)
where u3 in all_related_users
and u4 in all_related_users
...
However, this doesn't do exactly what I want it to do as when I look at the profiler, the MATCH
is run as an Expand(All)
incurring a huge amount of db hits and rows (50k) until the Filter
in the following is hit. This is obviously slow because it is doing a WHERE IN
for 50k rows.
Ideally what I want to do is after calculating all_related_users
I could tell Neo4j, "these are the only nodes I care about this point so only run queries on this extremely small subgraph". I've tried doing subqueries with CALL
but that doesn't seem to make any difference. Is there anyway to either fix my queries to only run queries on a very small set of nodes, or just create a temporary subgraph inside the query. I'm coming from SQL so maybe my understanding is off.
THANKS SO MUCH!
‎06-09-2021 03:34 PM
You're close, you can use UNWIND to project elements of a list into rows, so this should be a one-line change:
match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...
UNWIND all_related_users as u3
MATCH (u3)-[:REL]->(u4)
WHERE u4 in all_related_users
...
‎06-10-2021 10:58 AM
I've tried that but it seems like it still evaluates the match before running the filter.
Here's my code snippet
unwind all_mutual_contacts as u2
match (u2:User)<-[:CONTACT]-(cic:User) // get all incoming contacts
where cic in all_mutual_contacts
Looking at the query planner I see
Expand(All) ~42k hits
Filter ~2.4k hits
If my understanding is right this means where where cic in all_mutal_contacts
gets run 42k
times where the size of all_mutual_contacts
is 245
. Furthermore, if we already know all cic
, wouldn't this be an Expand(Into)
, thanks for all the help!
‎06-10-2021 12:29 PM
There's no way to filter before the expand. You have to find the connected nodes before filtering that the connected nodes are in the list.
Expand(Into) only works when we explicitly have both nodes as variables, so you would need to UNWIND the list a second time, generating a cross product, and that would result in far more expansions. You can try it if you like.
Also, pay attention to the rows in the query plan. The more rows there are, the more work needs to be done for the subsequent operation.
‎06-10-2021 02:36 PM
Thanks Andrew, the cross product and Expand(Into) does indeed create more expansions, I'm guessing because now the input is much higher. I will keep digging but thank you for your help!
I guess I will look into parallelizing my code in this case then. Thanks!