UNWIND makes no sense to me

While this is a completely pointless query, it does illustrate the point I am confused about. Consider the following pointless query:
MATCH p=(tom:Person {name:"Tom Hanks"})-[:ACTED_IN*2]-(:Person)
UNWIND [n IN nodes(p) WHERE n.name = 'Bill Paxton'] AS a
RETURN p

Why does UNWIND modify the returned p?

1 ACCEPTED SOLUTION

You have summarized it well.  

In general, a match produces a set of rows, the next match is executed for each row from the first match. If the second match is not dependent on the first match, you end up with a Cartesian product of the results of the two matches.  This pattern continues with each successive match. 

In this query, the unwind is performed for each row resulting from the first match. Each row is referenced as ‘p’.  The unwind can produce a number of rows for each row of ‘p. Each row resulting from the unwind of a single instance of ‘p’ will have the value of ‘p’ appended to it.  When the list for a ‘p’ returns zero elements, the unwind results in zero rows, so the corresponding ‘p’ value is lost. 

The following may demonstrate this using the following data:

create(n:User{name:'scott'})
create(n)-[:KNOWS]->(:User{name:'mary'})
create(n)-[:KNOWS]->(:User{name:'sam'})
create(n)-[:KNOWS]->(:User{name:'gregg'})
match(n)-[:KNOWS]->(m:User)
unwind[1,2] as x
return x, n.name, m.name

 The first match will result in three rows, one for each user whom Scott knows. For each of these rows, the unwind will create two rows, resulting in a total of six rows. The data for each row from the first match will be appended to the result of the unwind, which is two rows. The result is as follows. As you can see, the the row passed to the unwind is repeated for each value of the unwind. Its a Cartesian product in this case, since the unwind is not dependent on the results of the first match. 

Screen Shot 2023-02-06 at 11.31.34 PM.png

View solution in original post

3 REPLIES 3

The first match should return a row for each person that acted in a movie with Tom hanks.
next, fir each row you created a list of nodes for the current p value and filter to just those nodes where the persons nide is bill paxton.   This array should be either empty or come rain one node, which would be bill Paxton’s node. The unwind would result in one row for each element of the list, with the p value repeated for each element in the list. So, there will be no p value returned for each path where bill Paxton wasn’t the person node, since the unwind is over an empty list. The only result I suspect you get is just the paths where bill Paxton and Tom hands acted in the same movie. 

is the what you meant by modifying p? 

That kind of makes sense, but it is still a bit confusing.  The way I have been reading the query is as follows:

  • The first line establishes a value for p that represents all of the paths between Tom Hanks and other actors where they worked in the same movie.
  • The second line processes all of the paths in p to create a list of nodes in a where the node name equals Bill Paxton, but p still contains all of the original paths.
  • The third line returns the original paths in p.

What your explanation makes me realise I my mental model is completely broken.  What I now think is happening is as follows:

  • The first line matches paths in the database where Tom Hanks has worked in movies with other actors.  Each matching path is individually passed to subsequent query processing as the variable p.
  • Each path matched in the first line is processed by the second line and looks for nodes where name equals Bill Paxton. When there are no nodes then the query processing stops because you have a null result. When Bill Paxton is found, then the path that contained Bill Paxton is passed down to the next line of the query.
  • The third line simply returns the individual paths in p for which line 2 returned a non-null result.

The null result handling is kind of explained in https://neo4j.com/docs/cypher-manual/current/clauses/unwind/#unwind-using-unwind-with-an-empty-list - at least I think that is what the documentation means (in the context of my new "understanding").

You have summarized it well.  

In general, a match produces a set of rows, the next match is executed for each row from the first match. If the second match is not dependent on the first match, you end up with a Cartesian product of the results of the two matches.  This pattern continues with each successive match. 

In this query, the unwind is performed for each row resulting from the first match. Each row is referenced as ‘p’.  The unwind can produce a number of rows for each row of ‘p. Each row resulting from the unwind of a single instance of ‘p’ will have the value of ‘p’ appended to it.  When the list for a ‘p’ returns zero elements, the unwind results in zero rows, so the corresponding ‘p’ value is lost. 

The following may demonstrate this using the following data:

create(n:User{name:'scott'})
create(n)-[:KNOWS]->(:User{name:'mary'})
create(n)-[:KNOWS]->(:User{name:'sam'})
create(n)-[:KNOWS]->(:User{name:'gregg'})
match(n)-[:KNOWS]->(m:User)
unwind[1,2] as x
return x, n.name, m.name

 The first match will result in three rows, one for each user whom Scott knows. For each of these rows, the unwind will create two rows, resulting in a total of six rows. The data for each row from the first match will be appended to the result of the unwind, which is two rows. The result is as follows. As you can see, the the row passed to the unwind is repeated for each value of the unwind. Its a Cartesian product in this case, since the unwind is not dependent on the results of the first match. 

Screen Shot 2023-02-06 at 11.31.34 PM.png