cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Bad performance with OR operator

I am using neo4j-3.5 community version .And have built a huge graph using neo4j containing ~20 million nodes on movie data. I have also inserted genres & keywords and constructed the graph.

The following query takes > 5 seconds

MATCH 
    (p:`Program`),  
    (p:Program)-[genre:of_genre]->(g:Genre), 
    (p:Program)-[key_rel:associated_keyword]->(k:Keyword)
  WHERE   
    ((g.id IN [1010]) OR (k.id IN ['keyword_121'])) AND 
    ((p.show_type IN ['movie'])) AND 
    (p.imdb_score > 0)
  RETURN distinct p.id, p.imdb_score
  ORDER BY p.imdb_score desc
  LIMIT 50

whereas if i replace OR with AND, it takes < 100 ms

MATCH 
    (p:`Program`),  
    (p:Program)-[genre:of_genre]->(g:Genre), 
    (p:Program)-[key_rel:associated_keyword]->(k:Keyword)
  WHERE   
    ((g.id IN [1010]) AND (k.id IN ['keyword_121'])) AND 
    ((p.show_type IN ['movie'])) AND 
    (p.imdb_score > 0)
  RETURN distinct p.id, p.imdb_score
  ORDER BY p.imdb_score desc
  LIMIT 50

Indexes are there for genre id and keyword id.

The PROFILE response for 'OR' :

THE PROFILE response for 'AND'

Any better way to write the query with OR operator?

3 REPLIES 3

You can turn it into a UNION

MATCH 
    (p:`Program`),  
    (p:Program)-[genre:of_genre]->(g:Genre), 
    (p:Program)-[key_rel:associated_keyword]->(k:Keyword)
  WHERE   
    ((g.id IN [1010])) AND 
    ((p.show_type IN ['movie'])) AND 
    (p.imdb_score > 0)
  RETURN distinct p.id, p.imdb_score
  ORDER BY p.imdb_score desc
  LIMIT 50
UNION ALL
MATCH 
    (p:`Program`),  
    (p:Program)-[genre:of_genre]->(g:Genre), 
    (p:Program)-[key_rel:associated_keyword]->(k:Keyword)
  WHERE   
    ((k.id IN ['keyword_121'])) AND 
    ((p.show_type IN ['movie'])) AND 
    (p.imdb_score > 0)
  RETURN distinct p.id, p.imdb_score
  ORDER BY p.imdb_score desc
  LIMIT 50

@michael.hunger : I will not be able to use UNION ALL as I am building a conversational interface for movie data and the query is constructed in real time with different types of nodes and the expression changes for every query.
Is there any other change I could do to improve the performance?

I think something else you run into is the combinatorial explosion.
After all you just use your matches as conditions.

So usually you'd to put a with distinct p in between the branches but with the OR that doesn't work.

So instead you could use pattern comprehensions for that.

MATCH 
    (p:`Program`)
  WHERE   
    ((p.show_type IN ['movie'])) AND (p.imdb_score > 0)
    AND (
     size([  (p:Program)-[genre:of_genre]->(g:Genre) WHERE  (g.id IN [1010]) | true  ]) > 0
     OR  
    size([  (p:Program)-[key_rel:associated_keyword]->(k:Keyword)  WHERE k.id IN ['keyword_121'] ]) > 0)
    )

  RETURN distinct p.id, p.imdb_score
  ORDER BY p.imdb_score desc
  LIMIT 50

The next version of Neo4j which will be out as a milestone soon, will have existential subqueries (where

exists { MATCH () WHERE ... }
OR
exists { MATCH () WHERE ... }

which should exactly match your requirement.