Neo4j

cuneyttyler · ‎01-12-2023

I have the following query:

PROFILE MATCH(e:Entity {name:'Mona Lisa'})-[rel]->(e2:Entity) 
WITH *, collect( e2.name) as e2Nodes 
MATCH(e2)-[rele2ins:`https://www.wikidata.org/wiki/Property:P31`]->(e2ins:Entity) WHERE e2ins.wikidata_id in ['Q5']
WITH *, collect( e2ins.name) as e2InsNodes
return   e2Nodes,e2InsNodes

It has 1k db hits and it's plan is as below:

When I add a WHERE condition to this query as follows:

PROFILE MATCH(e:Entity {name:'Mona Lisa'})-[rel]->(e2:Entity) 
WITH *, collect( e2.name) as e2Nodes 
MATCH(e2)-[rele2ins:`https://www.wikidata.org/wiki/Property:P31`]->(e2ins:Entity) WHERE e2ins.wikidata_id in ['Q5']
WITH *, collect( e2ins.name) as e2InsNodes
return   e2Nodes,e2InsNodes

It's db hit count increases to 100k and the plan is as below:

Basically I'm returning entities(e2) have an incoming edge from 'Mona Lisa'. This is 1k db hits. Then I fetch entities with some specific relation(...P31) of these e2 entities. The total number of results is 227. However, there are 100k hits in between the steps when I filtered results by some attribute. Why is this the case?

Thanks.

NOTE: This applies to the following two queries below:

PROFILE MATCH(e:Entity {name:'Mona Lisa'})-[rel]->(e2:Entity)-[rele2ins:`https://www.wikidata.org/wiki/Property:P31`]->(e2ins:Entity)  
return   e2.name,e2ins.wikidata_id

PROFILE MATCH(e:Entity {name:'Mona Lisa'})-[rel]->(e2:Entity)-[rele2ins:`https://www.wikidata.org/wiki/Property:P31`]->(e2ins:Entity) WHERE e2ins.wikidata_id in ['Q5']
return   e2.name,e2ins.wikidata_id

cuneyttyler · ‎01-12-2023

As I've figured out, when you put a WHERE condition - no matter the start node is specified by an index lookup - it also fetches those nodes with the specified conditions in the where clause at the beginning of the query.

If the first query is collected beforehand, than it works as expected with few db hits.

PROFILE MATCH(e:Entity{name: 'Mona Lisa'})-[rel]->(e2:Entity)-[rele2ins:`https://www.wikidata.org/wiki/Property:P31`]->(e2ins:Entity) WITH e,e2,collect(e2ins) as e2InsNodes  UNWIND e2InsNodes as e2ins WITH * WHERE e2ins.wikidata_id = 'Q5' 
return  distinct e2.name,e2ins.wikidata_id

I'm not sure about the reason of this implementation and causes of it but I think this types of queries should be handled in a way that they start from the start node, expand and filter queries by WHERE clause. It would be a nice thing to have.

anthapu · ‎01-12-2023

In the new query planner it tries to optimize for speed rather than memory and db hits. If a node has an unique index it tries to get those nodes and traverse from both sides and reach the middle. This can have higher db hits, but the total time query taken can get improved.

I have talked about this aspect in my book Graph Data Processing with Cypher .

Neo4j

DB hits increase a lot when added a WHERE condition