Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-15-2018 07:53 AM
Hi there,
My original use-case is fetching all mutual relationships between two groups of nodes.
One of the groups of nodes contains a supernode (i.e. a node with a very high degree).
Let's call n
the group of nodes that contains only regular nodes and sn
the group of nodes that contains a supernode.
My cypher query is the following:
MATCH (sn)-[r]-(n)
WHERE ID(sn) in [1, 999] // node 999 has high degree (1163)
AND ID(n) in [2, 3] // nodes 1, 2 and 3 have a low degree (3)
RETURN r
I'm currently using Neo4j 3.3.0 to benchmark this query, and I noticed something very odd: the performance of the query depends on the names of the variables I am using for sn
, n
and r
.
More particularly, the performances depend of the alphabetical order of the variables.
I have tried changing other things in the query:
ID(x) in []
statements(sn)-[r]-(n)
for (sn)-[r]-(n)
But after lots of testing, the only change that has a consistent impact on the performance of the query is the alphabetical order of the variable names used for sn
, n
and r
.
To test my query, I have the following process:
sn
, n
and r
to a different name (sn => v1
, n => v2
, r => v3
) and profile my query.sn => v2
, n => v3
, r => v1
) and profile againHere are the results for all possible permutations:
sn => v1
, r => v2
, n => v3
: FASTsn => v1
, n => v2
, r => v3
: FASTr => v1
, sn => v2
, n => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode 999
)r => v1
, n => v2
, sn => v3
: FASTn => v1
, r => v2
, sn => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode 999
)n => v1
, sn => v2
, r => v3
: SLOW (i.e. at least of DB hit for each relationship of supernode 999
)Profiling for each query:
profile match (v1)-[v2]-(v3) where id(v1) in [1, 999] and id(v3) in [2, 3] return v2
profile match (v1)-[v3]-(v2) where id(v1) in [1, 999] and id(v2) in [2, 3] return v3
profile match (v2)-[v1]-(v3) where id(v2) in [1, 999] and id(v3) in [2, 3] return v1
profile match (v3)-[v1]-(v2) where id(v3) in [1, 99] and id(v2) in [2, 3] return v1
profile match (v3)-[v2]-(v1) where id(v3) in [1, 999] and id(v1) in [2, 3] return v2
profile match (v2)-[v3]-(v1) where id(v2) in [1, 999] and id(v1) in [2, 3] return v3
Having to worry about the alphabetical ordering of variable names in a query makes it quite difficult to try and make queries fast
On the other hand, I might be on the wrong track, and I'm open to suggestions for alternative queries to get the same result in a more fast and stable way.
Best,
David
11-16-2018 07:02 AM
It's about the ordering it uses for expand into.
I think that was partially fixed in 3.4
In general it tries to go from the non-dense node side, but if both nodes have > 50 rels, both count as dense.
If you know which group is the one with the supernode then match the other group first
and then the 2nd group via expand.
It would also be good to use a direction + rel-type if possible.
MATCH (n)
WHERE ID(n) in [2, 3] // nodes 1, 2 and 3 have a low degree (3)
MATCH (sn)-[r]-(n)
WHERE ID(sn) in [1, 999] // node 999 has high degree (1163)
RETURN r
11-22-2018 07:16 AM
Thank you Michael!
We will try this workaround.
Were you able to confirm that this was fixed in Neo4j v3.4?
Best,
David
11-22-2018 04:32 PM
Best test it yourself with your specific dataset + query. And let us know if it persists.
All the sessions of the conference are now available online