cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Why can't I use a conditional on an aggregating function?

clem
Graph Steward

I thought this would work:

(get movies where the average birth year of the actors is greater than a certain value)

MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)
WHERE AVG(p.born) > 1967 
RETURN m.title, AVG(p.born)

but I get the (obscure) error:

Invalid use of aggregating function avg(...) in this context (line 2, column 7 (offset: 46))
"WHERE AVG(p.born) > 1967
       ^

I can work around it with this (which is a bit uglier)

MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)
WITH m, AVG(p.born) AS avgBorn
WHERE avgBorn > 1967
RETURN m.title, avgBorn
7 REPLIES 7

Hello @clem

Your second query is the way to do it.

Regards,
Cobra

Yes, I know that. What I want to understand is the logic as to why the first one doesn't work.

One thing that's not clear is the scope of the aggregation, the grouping key is not clear if we were to use it in a WHERE clause.

Would it be asking for the average of all persons who acted in all movies (so the same average would display on every result row)? Would persons be counted multiple times because they act in multiple movies, and should that duplication be captured in the average? Or is it asking for the average per movie? What if the pattern was more complicated, or there were more variables in scope?

Consider this MATCH and WHERE clause:

MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)-[:ATTENDED_COLLEGE_AT]->(college:College)
WHERE AVG(p.born) > 1967 

How could we tell what the grouping key is here? Is it per movie? Per college? Neither, and you want the average of all persons who acted in a movie and attended college? What if they acted in multiple movies, or attended multiple colleges?

When we restrict aggregations to WITH clauses, the grouping key becomes much clearer, as in the workaround you cited: you're getting, per movie, the average born year of that movie's actors, and then filtering that row based on whether the average is over 1967.

Thanks.... this explanation was using and a little trickier than I had imagined.

I also found this explanation: equivalent having clause. (I always found the SQL HAVING somewhat cryptic as well.)

I guess what made this tough for me is the error msg wasn't very informative...

So, is it that the GROUP BY concept is implicit in Cypher?

Thanks so much for the explanation!

clem
Graph Steward

For fun, I made an some other interesting queries:

Find movies who's average actor's birth year is less or equal than the average from the movie 'The Matrix', (but not including 'The Matrix')

MATCH (m1:Movie{title:'The Matrix'})<-[:ACTED_IN]-(p1:Person)
WITH AVG(p1.born) AS matrixAvg, m1
MATCH (m2:Movie)<-[:ACTED_IN]-(p2:Person)
WITH  AVG(p2.born) AS bornAvg, matrixAvg, m2, m1
WHERE bornAvg <= matrixAvg AND id(m2) <> id(m1)
RETURN m2.title, bornAvg

Or yet still more interesting, is to get the actor's average age for a movie at the time the movie was released and select for the average age less than the average age for 'The Matrix".

MATCH (m1:Movie{title:'The Matrix'})<-[:ACTED_IN]-(p1:Person)
WITH m1.released - AVG(p1.born) AS matrixAgeAvg, m1
MATCH (m2:Movie)<-[:ACTED_IN]-(p2:Person)
WITH  m2.released - AVG(p2.born) AS movieAgeAvg, matrixAgeAvg, m2, m1
WHERE movieAgeAvg <= matrixAgeAvg  AND id(m2) <> id(m1)
RETURN m2.title, m2.released, movieAgeAvg

Hello @clem

Thanks @andrew.bowman for the precise answer

Unless you want to create your own graph database engine some questions cannot be entirely understand by a Cypher user, and it's the point of having a short and sweet language as Cypher.

But roughly, Cypher works on two data domains at the same time, the graph itself, let's describe it as a bunch of physical links in the memory, and the projected data from this graph who are rows with data in it, you can visualise it as a table.

Some clauses or function are build to work with the graph domain and others are build to work with the projected data. If you think about it carefully, it doesn't make any sens to use avg in the graph domain because the data doesn't exists as a list or rows of something in the graph domain.

Read and try to understand the behaviour of the clause MATCH and what's a graph and so you will be able to understand the answer of @andrew.bowman

Thanks for the insight.

After years of mainly using procedural languages, thinking in declaratively can be a bit tricky. Adding Graph thinking on top of that adds an additional dimension.

I certainly find Cypher more intuitive than SQL but every so often, I get tripped up on the finer points.