cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Cartesian join in query plan after adding an index

I've run into a performance issue after adding an index to a integer field. The query plan changes from a simple scan to a merge of two index seeks but the merge operation is a cartesian join.

To recreate the problem:

CREATE CONSTRAINT ON (g:Gender) ASSERT g.Pk IS UNIQUE;
CREATE CONSTRAINT ON (p:Patient) ASSERT p.Pk IS UNIQUE;
with [{Pk:1, Name:'Male'},{Pk:2, Name:'Female'}] as genders
unwind genders as gender
merge (g:Gender{Pk:gender.Pk})
set g.Name = gender.Name;
with range(1,2) as genderkeys
unwind genderkeys as genderkey
match (g:Gender {Pk:genderkey})
with g,range(1,1000) as patientKeys
unwind patientKeys as patientKey
merge (p:Patient {Pk:patientKey})
set p.Age = toInteger(rand() * 100)
merge (p)-[:has_Gender]->(g);

Then run a simple query:

match (p:Patient)-[:has_Gender]->(g:Gender) where g.Pk = 1 and p.Age = 2 return count(p)

Then create an index on Age:
create index on :Patient(Age);

Then run the same query again:
match (p:Patient)-[:has_Gender]->(g:Gender) where g.Pk = 1 and p.Age = 2 return count(p)

The query plan changes to a cartesian join plan. I can understand part of why when the optimizer thinks that only one row will be returned by the index seek on Gender Pk, but if you profile it you'll see that more rows are returned for the gender index seek.

For this small test graph it doesn't matter but for large ones with millions of nodes the queries are much slower with the index.

1 ACCEPTED SOLUTION

The gender index is estimated to return one row, which makes sense since there are only two unique values in that index. A cartesian product of 1xn is still n (estimated rows from age index). In the profile it looks like the gender index returns 7 rows, but that is an illusion. The number of rows after the cartesian product is still 7, the same as the number of rows from the age index. The illusion is created by the fact that the left hand side of the cartesian product is displayed with the row count it actually has, while the right hand side is displayed with the cartesian product row count (in this case 7x1), because what is actually happening is for each row on the left (each age) we pull all rows from the right. It is basically a nested loop. If the right returned two rows, then for each age, we would pull those two, and that would lead to 14 rows.

View solution in original post

7 REPLIES 7

Don't use gender nodes, modeling a 2 element attribute as node is not recommended.

Put it into a property or better into a label.

How can more rows returned from the gender index seek that doesn't make any sense.
Can you share the PROFILE output?

2X_9_96bb8b44d5dbd383098cc28a643aefd0b3b0ad89.png
The plan shows 7 rows hit from both sides of the cartesian join.

I'm using gender as an example here, but I have other attributes with similar issues that have thousands of nodes in them. They query only gets worse as the number of matches increases. If I remove the age index then it's consistently filtering with patients with gender, but adding the age index makes it take many times longer. You can see this more clearly if you increase the number of patients involved.

2X_2_25e9f359d82ebace0db4a21994dceaa7eb3da43a.png
Here is the explain output showing the expected 1 gender node row from the gender node seek, but the profile one shows more rows in the actual output. This may be a red herring, just trying to understand why it's slower.

Can you show a real EXPLAIN and PROFILE outputs that you are referring to (you can redact bits if necessary)? How does the actual runtime change?
With the tiny examples it's not affecting the query performance.
Best expanded boxes.

It would important to understand the index selectivities that went into the decisions of the planner to go via Cartesian Product + Expand Into (which is the better choice if the product is small) vs. doing an Expand + Filter

The gender index is estimated to return one row, which makes sense since there are only two unique values in that index. A cartesian product of 1xn is still n (estimated rows from age index). In the profile it looks like the gender index returns 7 rows, but that is an illusion. The number of rows after the cartesian product is still 7, the same as the number of rows from the age index. The illusion is created by the fact that the left hand side of the cartesian product is displayed with the row count it actually has, while the right hand side is displayed with the cartesian product row count (in this case 7x1), because what is actually happening is for each row on the left (each age) we pull all rows from the right. It is basically a nested loop. If the right returned two rows, then for each age, we would pull those two, and that would lead to 14 rows.

Thanks, this explanation makes sense. For whatever reason adding the index helps now instead of hurts, so it's working like it should, so consider this issue closed until I can get a better case of where it goes wrong. I swear this was causing an issue before, but I just can't reproduce it.