cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Using WITH clause along with an aggregating function like count()

I'm having some trouble understanding the WITH clause, when used with aggregation functions like count().

I tried using this query from the Cypher ref card as an example -

MATCH (user)-[:FRIEND]-(friend)
WHERE user.name = $name
WITH user, count(friend) AS friends
WHERE friends > 10
RETURN user

What is the purpose of count(friend) in this query? Is it just counting all the end nodes which are connected to the given user with the FRIEND relationship? If so, why is the query returning the given user node? Because we already know who the starting user is.

I tried loading the "movies" graph in Neo4j, to run similar queries. Then, I modified the above query for the movies graph, like so -

MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h, count(m) as movies
WHERE movies > 3
RETURN h, movies

This is what the query returns -

╒════════════════════════════════╤════════╕
│"h"                             │"movies"│
╞════════════════════════════════╪════════╡
│{"name":"Tom Hanks","born":1956}│12      │
└────────────────────────────────┴────────┘

In theory, I understand what this query is doing - return a count of all movies where Tom Hanks "acted in", if they are greater than 3. But in terms of the query and the aggregation function, what is count(m) doing? Is it a count of all movie nodes where Tom Hanks is the start node with "ACTED_IN" relationship connecting the two nodes?

If so, why do the following queries not return anything?

MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h, r, count(m) as movies
WHERE movies > 3
RETURN h,r, movies
MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h,r, count(m) as movies
WHERE movies > 3
RETURN movies
MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h, m,count(m) as movies
WHERE movies > 3
RETURN h, m
MATCH (h:Person)-[r:ACTED_IN]->(m:Movie)
WITH h, r, m,count(m) as movies
WHERE movies > 6
RETURN h, m
MATCH p=(h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH p, count(r) as movies
WHERE movies > 3
RETURN p

Queries 1 through 4 are just different variations of the main query, where I just add different variables to the WITH clause, so they 're carried over to the next stage of the query. In Query 5, I try to count the relationships, which should ideally behave the same way as counting the destination (movie) node, but doesn't work.

I also tried fiddling with the FOLLOWS relationship in the movie graph, with this query -

MATCH p=()-[r:FOLLOWS]->(b:Person)
WITH count(b) AS popular
WHERE popular > 1
RETURN popular;

This query returns 3, because there are three FOLLOWS relationships in the graph -

══════════════════════════════════════════════════════════╕
│"p"                                                       │
╞══════════════════════════════════════════════════════════╡
│[{"name":"Paul Blythe"},{},{"name":"Angela Scope"}]       │
├──────────────────────────────────────────────────────────┤
│[{"name":"James Thompson"},{},{"name":"Jessica Thompson"}]│
├──────────────────────────────────────────────────────────┤
│[{"name":"Angela Scope"},{},{"name":"Jessica Thompson"}]  │
└──────────────────────────────────────────────────────────┘

But if I add the WITH clause and count() to this query, it doesn't return anything

MATCH p=(a)-[r:FOLLOWS]->(b:Person)
WITH a,count(b) AS pop
WHERE pop > 1
RETURN a,pop;

I think I'm missing some key piece of underlying information about how WITH and count() works.

3 REPLIES 3

why does

  1. MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
    WITH h, r, count(m) as movies
    WHERE movies > 3
    RETURN h,r, movies

not work? because for each relationship (which happens to be unique) from the Tom Hanks :Person node you are counting number of movies.
So with h, r,count(m) as movies is effectively evaluating to

Tom Hanks, releationship001, 1
Tom Hanks, releationship002, 1
Tom Hanks, releationship003, 1
Tom Hanks, releationship004, 1

and for example relationship001 refers to the relationship to the :Movie 'Apollo 13', and relationship002 refers to the :Movie 'Joe Versus the Volcano. Remove the rin thewith h,r,count(m)` and you will then get number of movies that Tom Hanks acted in.

Regarding
2. MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h,r, count(m) as movies
WHERE movies > 3
RETURN movies

same as #1

Regarding
3. MATCH (h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH h, m,count(m) as movies
WHERE movies > 3
RETURN h, m

effectively the same as #1 . the WITH h, m,count(m) as movies effectively evaluates to

Tom Hanks, Apollo13, 1
Tom Hanks, Joe Versus the Volcano, 1

Regarding
4. MATCH (h:Person)-[r:ACTED_IN]->(m:Movie)
WITH h, r, m,count(m) as movies
WHERE movies > 6
RETURN h, m

more or less the same as #1 and #3 as this then evaluates to

Tom Hanks, relationship001, Apollo13, 1
Tom Hanks, relationship002, Joe Versus the Volcano, 1
```

Regarding
5. ```
MATCH p=(h:Person{name:"Tom Hanks"})-[r:ACTED_IN]->(m:Movie)
WITH p, count(r) as movies
WHERE movies > 3
RETURN p
```

each P represents a unique Path and thus this evaluates to

TomHanks-ACTED_IN->Apollo13, 1,
TomHanks-ACTED_IN->Joe Versus The Volcano,1

Adding a bit more context, when you use aggregation functions, the non-aggregation variables become the grouping key of the aggregation, which is the context by which the aggregation is performed.

In your examples that aren't returning data, your aggregation (the count of movies) is with respect to the non-aggregation variables forming the grouping key.

WITH h, r, count(m) as movies

"For each person node (Tom Hanks) and the specific relationship r on this row, get the count of movies"
Since a specific relationship r can only point to one movie, the count will always be 1 per row.

WITH h, m,count(m) as movies

"For each person node (Tom Hanks) and the specific movie m on this row, get the count of movies"
Since m on this row will always be one specific movie (for example the results of WITH h, m), the count of that movie per row will always be 1.

And in your last example:

MATCH p=(a)-[r:FOLLOWS]->(b:Person)
WITH a,count(b) AS pop
WHERE pop > 1
RETURN a,pop;

Per single a node per row, you're getting the count of persons that they follow, and filtering to only those that follow more than 1 person, but there are no such entries in the movies graph. For the persons that do have outgoing follows relationships, they only have a single follows relationship to another person.

clem
Graph Steward

I got tripped up with aggregating functions too.

My example is simpler, so it may be easier to understand what works or doesn't work and why by looking at it:

I believe the problem you are facing, is you need to have only ONE variable that is being aggregated in the WITH statement. Then you can have one or more aggregation functions. Adding the r broke things.

Here, I'm aggregating per Person and I can get the number of movies per person or average year.

MATCH (h:Person)-[r:ACTED_IN]->(m:Movie)
WITH h, count(m) as movies, avg(m.released) AS releaseYearAvg
RETURN  h.name, movies, releaseYearAvg