cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

In Exercise 4.5, Why do I get duplicates?

Hi all,
In Exercise 4.5 ( Retrieve all people that wrote movies by testing the relationship between two nodes), I've tried this query:

MATCH (p:Person) -- (m:Movie) WHERE ((p)-[:WROTE]->(m)) RETURN p.name, m.time

and I get

p.name m.title
"Aaron Sorkin" "A Few Good Men"
"Aaron Sorkin" "A Few Good Men"
"Jim Cash" "Top Gun"
"Cameron Crowe" "Jerry Maguire"
"Cameron Crowe" "Jerry Maguire"
"Cameron Crowe" "Jerry Maguire"

But in the exercise solution, there is no duplicate

MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie

What am I doing wrong?

3 REPLIES 3

Hello Patrick,

The MATCH statement, MATCH (p:Person)--(m:Movie) returns all rows where a person is related to a movie. So there is a row where Aaron Sorkin WROTE the movie and a row where Aaron Sorkin ACTED_IN the movie. From these rows, it tests if the p node has a WROTE relationship to the m node. I does for each row so as a result it returns two rows.

Later in the course, you will learn about DISTINCT which removes duplicate rows. For example, this query would remove the duplicates:

MATCH (p:Person) -- (m:Movie)
WHERE ((p)-[:WROTE]->(m))
RETURN DISTINCT p.name, m.title

Elaine

Hi Elaine,

Thanks for this explanation.

And what is, in term of performance, the best request between the exercise's one or the one with DISTINCT?

Patrick

Well it depends on what you want out of the query.

The original query in the exercise:

MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie

DISTINCT isn't required, because we're specifying paths with :WROTE relationships, and provided that our model only has a single :WROTE relationship between a person and movie (it does), there won't be duplicate rows.

Your query is different:

MATCH (p:Person) -- (m:Movie) 
WHERE ((p)-[:WROTE]->(m)) 
RETURN p.name, m.time

While you do filter out any paths where the person didn't write the movie, you're still getting a path per separate relationship type that exists between the person and the movie, but you don't make any use of these extra rows (no categorization by the type, no counting or aggregations). So while you could add a DISTINCT here to get rid of the duplicates, the bigger issue is that you've created a query that asks for more data than you need, requiring you to filter out the excess data at the end. It's better practice to fix your query such that you only get the exact data you need and nothing extra:

MATCH (p:Person)-[:WROTE]->(m:Movie) 
RETURN p.name, m.time