Neo4j

davidrapin · ‎11-15-2018

Hi there,

My original use-case is fetching all mutual relationships between two groups of nodes.
One of the groups of nodes contains a supernode (i.e. a node with a very high degree).

Let's call n the group of nodes that contains only regular nodes and sn the group of nodes that contains a supernode.

My cypher query is the following:

MATCH (sn)-[r]-(n)
WHERE ID(sn) in [1, 999] // node 999 has high degree (1163)
AND ID(n) in [2, 3] // nodes 1, 2 and 3 have a low degree (3)
RETURN r

I'm currently using Neo4j 3.3.0 to benchmark this query, and I noticed something very odd: the performance of the query depends on the names of the variables I am using for sn, n and r.

More particularly, the performances depend of the alphabetical order of the variables.
I have tried changing other things in the query:

order of the ID(x) in [] statements
switching (sn)-[r]-(n) for (sn)-[r]-(n)

But after lots of testing, the only change that has a consistent impact on the performance of the query is the alphabetical order of the variable names used for sn, n and r.

To test my query, I have the following process:

I rename sn, n and r to a different name (sn => v1, n => v2, r => v3) and profile my query.
then, I change the alphabetical order (e.g. sn => v2, n => v3, r => v1) and profile again
etc.

Here are the results for all possible permutations:

sn => v1, r => v2, n => v3: FAST
sn => v1, n => v2, r => v3: FAST
r => v1, sn => v2, n => v3: SLOW (i.e. at least of DB hit for each relationship of supernode 999)
r => v1, n => v2, sn => v3: FAST
n => v1, r => v2, sn => v3: SLOW (i.e. at least of DB hit for each relationship of supernode 999)
n => v1, sn => v2, r => v3: SLOW (i.e. at least of DB hit for each relationship of supernode 999)

Profiling for each query:

profile match (v1)-[v2]-(v3) where id(v1) in [1, 999] and id(v3) in [2, 3] return v2
profile match (v1)-[v3]-(v2) where id(v1) in [1, 999] and id(v2) in [2, 3] return v3
upload limit
profile match (v2)-[v1]-(v3) where id(v2) in [1, 999] and id(v3) in [2, 3] return v1
profile match (v3)-[v1]-(v2) where id(v3) in [1, 99] and id(v2) in [2, 3] return v1
upload limit
profile match (v3)-[v2]-(v1) where id(v3) in [1, 999] and id(v1) in [2, 3] return v2
upload limit
profile match (v2)-[v3]-(v1) where id(v2) in [1, 999] and id(v1) in [2, 3] return v3
upload limit

Having to worry about the alphabetical ordering of variable names in a query makes it quite difficult to try and make queries fast

On the other hand, I might be on the wrong track, and I'm open to suggestions for alternative queries to get the same result in a more fast and stable way.

Best,

David

michael_hunger · ‎11-16-2018

It's about the ordering it uses for expand into.

I think that was partially fixed in 3.4

In general it tries to go from the non-dense node side, but if both nodes have > 50 rels, both count as dense.

If you know which group is the one with the supernode then match the other group first
and then the 2nd group via expand.

It would also be good to use a direction + rel-type if possible.

MATCH (n)
WHERE ID(n) in [2, 3] // nodes 1, 2 and 3 have a low degree (3)
MATCH (sn)-[r]-(n)
WHERE ID(sn) in [1, 999] // node 999 has high degree (1163)
RETURN r

davidrapin · ‎11-22-2018

Thank you Michael!
We will try this workaround.
Were you able to confirm that this was fixed in Neo4j v3.4?
Best,
David

michael_hunger · ‎11-22-2018

Best test it yourself with your specific dataset + query. And let us know if it persists.

Neo4j

Performance inconsistencies when fetching mutual relationships