cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

What does WITH do? Graph Academy Question

gq16
Node Clone

I'm currently doing the course "Graph Data Modeling Fundamentals"(https://graphacademy.neo4j.com/courses/modeling-fundamentals/) and I'm very confused about what exactly happens to the data after a WITH is used. I've read the documentation from the Neo4j Cypher Manual, but it isn't really clicking for me. I've been working in Graph Academy with the "Movie" Database and this cypher code is given in the course to create language nodes from what was previously the .languages property.

MATCH (m:Movie)

UNWIND m.languages AS language

WITH language, collect(m) AS movies

MERGE (l:Language {name:language})

WITH l, movies

UNWIND movies AS m

WITH l,m

MERGE (m)-[:IN_LANGUAGE]->(l);

MATCH (m:Movie)

SET m.languages = null

As you can see, WITH is used multiple times in the above query, but what does it really do? What is the significance of the first variable (before the comma) and the second one (after the comma) and how do they become related after the clause is executed?

1 ACCEPTED SOLUTION

Assume the following data for the discussion. It consists of two movies nodes, each with a list of three languages.

create(n:Movie{languages:['english', 'french', 'arabic']})
create(n:Movie{languages:['english', 'german', 'spanish']})

  1. MATCH (m:Movie)

The match clause searches for all nodes with label "Movie". This will produce a result row for each Movie node. Sorry, I used 'n' in my queries instead of 'm' that is used in the question's query.

  1. UNWIND m.languages AS language

Property m.language is a list of languages. The unwind converts this list of languages into rows, one element of the language list per row. This is done for each Movie node in the result set. The other elements on each row are repeated for each language element. The result is a follows. Notice how the two result rows from above got converted into three rows each. The node 'n' corresponding to the language is repeated for each new row. I outputted the node's 'id', so you can see which rows are from the same node.

  1. WITH language, collect(m) AS movies

The 'WITH' clause is used to pass data between stages of a multi-stage query. In this case, we also have an aggregate function (collect) in the with clause, so the data is getting grouped by the properties outside the collect method's argument. In this case, the data is getting grouped by 'language', so the result will be a row for each different value of language and the collect will gather all the movie nodes that have that value of language. As you can see, the 'movies' column contains a list of movies. The list are the movie nodes that contained that language. The 'English' language has two movies nodes in its list, since each movie had 'English' as a language.

  1. MERGE (l:Language {name:language})

The 'language' and 'movies' values are now available to the 'match' statement since they were passed in the 'with' clause. This merge is creating a new 'Language' node, with the name property each to the value of language. In our example, this will create five language nodes, one for each row of the above result.

  1. WITH l, movies

The 'with' clause is passing the newly created Language node and the corresponding list of movies for this language. A 'with' clause is required between certain sets of clauses, such as a 'merge' followed by a 'match'. See how each row is one new 'language' node and has that language's corresponding list of movies that contain that language.

  1. UNWIND movies AS m

The unwind is taking the list of movie nodes on each row and unwinding them into rows. Each movie node from the list will also have its corresponding language node as part of the row. Only the English language node has multiple movies, so unwind its movies resulted in the first two rows. The other language nodes had only one element their movies lists, so only one row resulted from each language node.

  1. WITH l,m

The 'with' clause is passing the language node and movie node through to the next stage of the query. This will be the 'l' and 'm' columns from the above result set. This is now available for further processing.

  1. MERGE (m)-[:IN_LANGUAGE]->(l);

The 'merge' is creating a relationship between the 'movie' node and 'language' node for each row passed by the 'with'. Since the line is terminated with a ';', this cypher is complete.

View solution in original post

7 REPLIES 7

Assume the following data for the discussion. It consists of two movies nodes, each with a list of three languages.

create(n:Movie{languages:['english', 'french', 'arabic']})
create(n:Movie{languages:['english', 'german', 'spanish']})

  1. MATCH (m:Movie)

The match clause searches for all nodes with label "Movie". This will produce a result row for each Movie node. Sorry, I used 'n' in my queries instead of 'm' that is used in the question's query.

  1. UNWIND m.languages AS language

Property m.language is a list of languages. The unwind converts this list of languages into rows, one element of the language list per row. This is done for each Movie node in the result set. The other elements on each row are repeated for each language element. The result is a follows. Notice how the two result rows from above got converted into three rows each. The node 'n' corresponding to the language is repeated for each new row. I outputted the node's 'id', so you can see which rows are from the same node.

  1. WITH language, collect(m) AS movies

The 'WITH' clause is used to pass data between stages of a multi-stage query. In this case, we also have an aggregate function (collect) in the with clause, so the data is getting grouped by the properties outside the collect method's argument. In this case, the data is getting grouped by 'language', so the result will be a row for each different value of language and the collect will gather all the movie nodes that have that value of language. As you can see, the 'movies' column contains a list of movies. The list are the movie nodes that contained that language. The 'English' language has two movies nodes in its list, since each movie had 'English' as a language.

  1. MERGE (l:Language {name:language})

The 'language' and 'movies' values are now available to the 'match' statement since they were passed in the 'with' clause. This merge is creating a new 'Language' node, with the name property each to the value of language. In our example, this will create five language nodes, one for each row of the above result.

  1. WITH l, movies

The 'with' clause is passing the newly created Language node and the corresponding list of movies for this language. A 'with' clause is required between certain sets of clauses, such as a 'merge' followed by a 'match'. See how each row is one new 'language' node and has that language's corresponding list of movies that contain that language.

  1. UNWIND movies AS m

The unwind is taking the list of movie nodes on each row and unwinding them into rows. Each movie node from the list will also have its corresponding language node as part of the row. Only the English language node has multiple movies, so unwind its movies resulted in the first two rows. The other language nodes had only one element their movies lists, so only one row resulted from each language node.

  1. WITH l,m

The 'with' clause is passing the language node and movie node through to the next stage of the query. This will be the 'l' and 'm' columns from the above result set. This is now available for further processing.

  1. MERGE (m)-[:IN_LANGUAGE]->(l);

The 'merge' is creating a relationship between the 'movie' node and 'language' node for each row passed by the 'with'. Since the line is terminated with a ';', this cypher is complete.

Thank you for your awesome answer, I understand the function of the WITH clause now. I am now wondering though, in your last step,

  1. MERGE (m)-[:IN_LANGUAGE]->(l);

based on the graphic above it in step 6, it seems to me that 6 "m" nodes will be created and will point to whatever "l" nodes are appropriate even though we actually have only two movie nodes. Or the "m" nodes still have the unique ID's 15 and 16 from step 1 and therefore although there are 6 rows of 'm', there is really only two nodes and each node has 3 relationships?

Yes, these are still the same movie nodes matched earlier. They have been passed through the chain of 'with' clauses (either as a member of a list or individually following an unwind).

The output following the with l, m shows six rows, but with the same two movies.

The output following the MERGE (m)-[:IN_LANGUAGE]->(l)clauses is as follows: As you currently concluded, there are just two movies nodes, with three relationships each to language nodes. I included the id of the relationships so you can see there are six total relationships.

I don't know what happened. My original posting had numbered the steps, but now they are all labeled as '1'.

gq16
Node Clone

So in the

WITH language, collect(m) AS movies

step, couldn't you ignore the collect function since the Merge clause that occurs afterwards already searches for duplicates nodes? So even though there would be two "english" under "language" it wouldn't matter once you use the MERGE to create the nodes?

You are correct again. The following query gave me the same results. This is because the 'unwind' keeps the association of the language's movie, so after the unwind you have rows that contain each language and its corresponding movie. The other key is the use of 'MERGE', as you already identified. Now, there may be a performance difference between the two queries if you have lots of movies and/or languages, because you are now relying on the merges to remove duplications.

match(m:Movie) 
unwind m.languages as language
MERGE (l:Language {name:language})
MERGE (m)-[r:IN_LANGUAGE]->(l)
return id(m), m, id(l), l, id(r) as relationship_id
order by id(m)

The ids are different, since I recreated the data:

I don't understand the control flow of Cypher.
STEP 1: MATCH (m:Movie)
STEP 2: UNWIND m.languages AS language
RETURN language

If I write out the execution flow of Cypher in python logic, will it look like this?
for m in Movies:
    for language in m.language:
        print(language)

However, I do not understand how the with keyword works from here:
WITH language, collect(m) AS movies

Your pseudo code gets the point, but there is more gong one with an ‘unwind’ clause. When you unwind a collection, you are creating a result row per element in the collection. Along with the individual elements, one per line, each line will include the other variables in scope being repeated for each line resulting from an unwind. 

for example, assume there are two movie modes resulting from the match, then the output of the match will be the following:

movie1
movie2

Each movie node contains a property called languages, which is a collection. The result from the unwind of m.language per movies mode is now more than the two movie modes. The result of the ‘match’ followed by the ‘unwind’ will now be the following, assuming movie one has two languages and movie two has three:

movie1, language1

movie1, language2

movie2, language2

movie2, language3

movie2, language4

One of the purposes of ‘with’ is to group rows to apply aggregate functions over the grouped rows. The grouping is determined by the ‘with’ parameters not included in the aggregate functions. In the above example, the grouping is done by the language and they are collection the movies nodes; there, all the rows with the same language will be grouped and the collection will be done over those rows. The same if the aggregate was sum, min, max, count, etc, those calculations with be done group by group. The result from our example will be. 

language1, {movie1}
language2, {movie1, movie2}
language3, {movie2}
language4, {movie2}


does that help?