Neo4j

e_boter · ‎05-13-2020

I have a question about exercise 5.11: ‘Retrieve the actors who have acted in exactly five movies, returning the name of the actor, and the list of movies for that actor.’ The solution in the course is as follows:

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WITH a, count(a) AS numMovies, collect(m.title) AS movies
WHERE numMovies = 5
RETURN a.name, movies

Why is the count on a:Person? It seems this query counts the number of Persons who have one or more :ACTED_IN relation with a Movie.

My solution is as follows:

MATCH (p:Person)-[acted:ACTED_IN]->(m:Movie)
WITH p, count(acted) AS numMovies, collect(m.title) AS movies
WHERE numMovies = 5 RETURN p.name, movies

I count on the relation :ACTED_IN. This seems more logical to me: for any given Person, I want to know how many times the relation :ACTED_IN occurs. Then I select the Persons who have exactly five of those relations.

Can anyone explain why the solution in the course has a count on a:Person?

elaine_rosenber · ‎05-13-2020

Both ways of doing the query are correct. You will find that with Cypher, there is typically more than one way to perform the same query. Where the differences are significant is when the query performance is a consideration. For this query, it should perform identically.

Elaine

llpree · ‎05-13-2020

While, like SQL, there are many ways to get answers from a graph, I think this is a fantastic question because it demonstrates why a graph is NOT SQL. Let me use this to drive a point home about the value a graph provides and why this answer is "right" or "the best".

To keep things simpler, I've pasted just a single return result, with Hugo as the "5-acted-in" actors. One of the key things a graph offers, is the ability to see & calculate data differently. Notice that visually, while you could chase down all the relationships to Hugo and eventually get "5", you'd have to "ask" each relationship which nodes they have each relationship with. And, because that's ALWAYS 2 nodes, you'd then have to keep track of both. Eventually, you get to "5". And BTW, this "pain" should remind you of those sorting algorithm courses

But, if you look at one of the most popular graph algorithms - degree centrality - which is the number of connections a node has - you'd just ask each node, get to Hugo and he'll tell ya: 5. And to the sort vs. this - that's Big0 of "n". So, both visually and from a calculations POV, it's just elegant.

My journey towards graphs led me to realize that once we begin to think outside the Cartesian walls, we'll begin to see why, while the number "5" is the same answer, it's really not on so many levels. Once we know the degree centrality, we begin to have access to a whole facet of mathematics, analytics and data science that Descartes was not even considering. Not saying he was not useful , just that graphs are another topic and your great question offers a clear picture into how something so "simple" represents something so powerful and elegant.

Hope this is helpful. We can always get those SQL and Excel values with Cypher, but that's not point of a graph solution. The training solution reflects this insight.

andrew_bowman · ‎05-15-2020

Can anyone explain why the solution in the course has a count on a:Person?

Personally I would argue that's a typo or error to be corrected. The query WILL give you the correct answer, but it does feel like there's a mismatch between the name of the variable, and the variable that's being counted.

As to WHY it still returns the correct answer, you have to understand how Cypher generates results.

When you do a MATCH, Cypher finds all possible paths that match the pattern, and emits a row for each, taking the variables from the path as needed. So if the same person acted in 5 movies, you would get 5 paths (rows), with the same person for all 5 rows, but a different relationship and movie for each row.

The count() aggregation here isn't looking for distinct nodes (you COULD have it count distinct nodes with count(DISTINCT a), which would be 1 for every row, since there is an a variable per row). So whether we count p, or acted or m, or anything else here, the count will be 5, since there are 5 rows for the given actor.

If it helps, do a MATCH to a person in the graph with 5 movies, and return the variables:

MATCH (a:Person {name:'Hugo Weaving'})-[:ACTED_IN]->(m:Movie)
RETURN a, m

In the Table results view, you should see 5 rows, with Hugo Weaving on all 5 as a, and a different movie per row.

In the same kind of query but with aggregations:

MATCH (a:Person {name:'Hugo Weaving'})-[:ACTED_IN]->(m:Movie)
RETURN a, count(a), collect(m.title)

The count() and collect() aggregations were per a (the non-aggregation variables become the grouping key of the aggregation), so the count() is asking:

"per a, what is the count of a?"
Well for Hugo Weaving, there are 5 entries for Hugo Weaving, so that's 5.

and for the collect:

"per a, collect the movies for a
Okay, we collect the 5 movies for Hugo Weaving.

Neo4j

Question about exercise 5.11