Neo4j

oren · ‎03-19-2021

what is a row?
what's the difference between a list and a collection of rows?

This post originated in online chat. It was so good we wanted to share it more broadly, with consent from the OP.

terryfranklin82 · ‎03-19-2021

terryfranklin82 mentioned...

Rows represent returned instances of a match, e.g if 5 people live in a house and you run MATCH (h:House)<-[:LIVES_IN]-(p:Person) RETURN * you'll get 5 rows, each containing (h) and (p)

terryfranklin82 · ‎03-19-2021

terryfranklin82 mentioned...

As far as I know a list == a collection of rows, e.g. in the example above if you ran COLLECT(p) then you'd get a list containing 5 p nodes ([p, p, p, p, p])

oren · ‎03-19-2021

oren214 mentioned...

thank you.
but wait...
If a list is a collection of rows, what is being returned in your example? I can't treat it like a list. I have to convert it into one with COLLECT

terryfranklin82 · ‎03-19-2021

terryfranklin82 mentioned...

Yes, you're right. A list is a collection of "things" that you choose to collect()

oren · ‎03-19-2021

oren214 mentioned...

So what does a MATCH return?

terryfranklin82 · ‎03-19-2021

terryfranklin82 mentioned...

MATCH by itself doesn't return anything, it just checks for a pattern. RETURN will return data, and if you don't perform any aggregation then you'll probably just get back plain rows.

oren · ‎03-19-2021

oren214 mentioned...

Why not just return the rows as a collection so we can iterate and work with it more easily? I mean, I'm sure there's probably multiple excellent reasons, but i guess I'm just trying to say that this feels like it could be made easier to understand and that there are things about cypher that feel harder than maybe they need to be from a developer-user standpoint.

terryfranklin82 · ‎03-19-2021

terryfranklin82 mentioned...

You can do exactly that, just use aggregating functions like collect() in your RETURN statement. As far as making MATCH responsible for collecting rows, that would severely limit your ability to perform subsequent matches on the output (e.g. chaining multiple match statements)

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

To clarify, we have the concept of records, and you'll get one record per row. Depending on how the query is planned, rows are processed lazily and each operation is applied to each record, modifying or creating new records. It's easier to conceptualize these are "rows" though.

Rows are NOT the same as lists, and that's why it's recommended to reserve plural variable names for when you're actually working with lists, vs when you're not, because a list is referring to a collection, but a variable on its own references just a single thing for the row it's on.

oren · ‎03-19-2021

oren214 mentioned...

Why are we distinguishing between a collection of rows and a list, though? Why not just have a single collection type that allow me to both perform subsequent matches AND iterate over rows? I'm guessing that there are underlying mechanics that I'm just unaware of that answer my question, but could there not be a layer to abstract away these details so developers don't have to figure out if they are dealing with a collection of rows vs a different kind of collection of rows?

Also, a sidenote: it's weird that they are called rows since the whole thing about graphs is supposed to be that there are no tables. Makes me question if that's even true, which is silly.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Some examples.

MATCH (movie:Movie {title:'The Matrix'})<-[:ACTED_IN]-(actor:Person)
RETURN movie, actor

This returns 5 records/rows.

On all 5 rows, movie is The Matrix. But actor is different on each row, because each row was generated from a path that matched the pattern...there were 5 such persons who acted in the Matrix, and as each was discovered during the MATCH operation, a row was generated for the path for that matched path, but we were only interested in the movie and actor from each path. That's what generated the rows.

Now if we try to do some additional MATCH and try to do something like exclusion, but mistakenly think that actor represents a list of actors instead of the actor-per-row, we will run into trouble. Let's also rename actor to actors to represent this misinterpretation, as this is a common mistake:

MATCH (movie:Movie {title:'The Matrix'})<-[:ACTED_IN]-(actors:Person)
WITH actors
MATCH (other:Movie) 
WHERE NOT (actors)-[:ACTED_IN]->(other)
RETURN other.title

You might think "oh, this will get me movies that doesn't include any of the Matrix cast", and you would be wrong.

For example, this will return Cloud Atlas, but Hugo Weaving acted in it. It contains Johnny Mnemonic, but Keanu Reeves acted in it! In fact you will also see the other two Matrix sequels come up in the results.

Why? Because actors here doesn't represent the list of actors. It represents one actor for that row. And operations in Cypher execute per row.

So for the row where actors is Keanu Reeves, the MATCH executes, and finds all :Movie nodes such that actors for that row (Keanu Reeves) did not act in the movie. This includes Cloud Atlas, since Keanu Reeves didn't act it in. You will get a row per matched path of a movie where Keanu Reeves was not an actor. (Johnny Mnemonic is not included, because Keanu Reeves acted in it)

But we still have the remaining rows to process, one per actor.

The next row is processed, say for Hugo Weaving who played Agent Smith. The MATCH executes, finding all paths matching the pattern where actors for that row, Hugo Weaving, did not act in the movie. This includes Johnny Mnemonic, since Hugo Weaving didn't act in it. This does NOT include Cloud Atlas, since Hugo Weaving acted in it. A row is generated per result from these matched paths.

So why are the Matrix sequels present? Because one of the actors for the Matrix in our movies graph is actually not actually supposed to be there: Emil Eifrem, the Neo4j CEO, is in the data as an actor as a joke, but of course he didn't actually appear in the movie. He was not jokingly inserted into any of the Matrix sequels in the graph, but he was in the data for the Matrix, so we have one row from the initial MATCH where actors is Emil Eifrem. Since he isn't present for any other movie in the graph, you will get a row for all other movies in the graph since he never acted in them, including the Matrix sequels. That's why they show up.

And again, this is why using plurals for a variable when it's not a list is a bad idea, because if you treat it as a list when it's not, you will get wrong results. Stick with actor and it may be a bit more apparent that you will be getting back rows for where a singular actor (for that row) didn't act in the movie.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

If you wanted to get what you originally wanted, movies featuring none of the actors from the Matrix, you would have to work with a list of those actors, since that represents the set of them. We also have list predicates to help out.

We can start with this:

MATCH (movie:Movie {title:'The Matrix'})<-[:ACTED_IN]-(actor:Person)
WITH collect(actor) as actors
...

Here, we collect the actor nodes across all rows into a single list. We have 1 row at this point, containing the list of the actors. Here it makes sense to use the plural actors, since it represents the list of them.

We can continue this to find movies where none of the actors acted in the movie:

MATCH (movie:Movie {title:'The Matrix'})<-[:ACTED_IN]-(actor:Person)
WITH collect(actor) as actors
MATCH (other:Movie)
WHERE none(actor IN actors WHERE (actor)-[:ACTED_IN]->(other)
RETURN other.title

There are alternate approaches too, some are more efficient.

For example, per other movie, we could collect its actors, and make sure that none of the actors in the actors list from the Matrix appear in the list of actors for the other movie:

MATCH (movie:Movie {title:'The Matrix'})<-[:ACTED_IN]-(actor:Person)
WITH collect(actor) as actors
MATCH (other:Movie)<-[:ACTED_IN]-(actor)
WITH actors, other, collect(actor) as otherActors
WHERE none(actor IN actors WHERE actor IN otherActors)
RETURN other.title

Or we could skip working with actors completely, and in our first MATCH get the movies all of the Matrix actors have acted in, and use that as the movies to exclude from the subsequent MATCH.

MATCH (matrix:Movie {title:'The Matrix'})<-[:ACTED_IN]-()-[:ACTED_IN]->(movie)
WITH collect(DISTINCT movie) + matrix as excludedMovies
MATCH (other:Movie)
WHERE NOT other IN excludedMovies
RETURN other.title

We're collecting DISTINCT movies because there may be movies where multiple Matrix actors have acted, and it's a little simpler to have our excluded movies list only have each movie once, instead of having duplicates.

Why do we add Matrix to this list? Because the MATCH pattern would not include it otherwise (we can't traverse the same relationship twice per path, and since each actor only has a single :ACTED_IN relationship to each movie they acted in, there will be no path where you double-back on that :ACTED_IN relationship back to the Matrix...so we have to add the node into the list ourselves.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

In any case, for all of these alternate approaches, we are dealing with multiple items in a single list, and the usage of list predicates and the IN list membership predicate make that clear.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

As to your remark, it is more accurate to say we're working with "streams of records" instead of rows. But some decisions were made on the language, including what we show in the PROFILE and EXPLAIN query plans.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

As for why we can't use them interchangeably, it's that each operation in Cypher (with a few exceptions, like aggregations) execute per record/row...and because of that, the operation isn't treating it like a collection, because it's not a collection. the node (or relationship, or path, or projected value) variables on that row (or for that record) refer to a single thing, not any kind of grouping of things.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Things would get VERY muddled if you were to try to convolute the two. A simple case here (and I'm fudging the tense of the variables to show why this would be a bad idea):

MATCH (movies:Movie)<-[:ACTED_IN]->(actors:Person)-[:FRIENDS_WITH]->(friends:Person)
RETURN movies, actors, friends

How exactly would we return this data, if movies, actors, and friends were all treated like collections here?

If these are automatically treated as lists, then you would have a list of movies, a list of actors, and a list of friends...but how would we then associate which movies a person acted in? In these lists of friends, which friends belong to which actor? You've lost all association between which of the results are associated with items in the other results.

How would we determine which friends are those of Keanu Reeves vs Hugo Weaving's? How do we know which movies are those that Laurence Fishburne acted in, vs those where Keanu is starring?

And if you decide to get around this ambiguity, to somehow have one of these terms represent a singular thing, while the others are kept as lists (so a single actor, and for each actor a list of movies, and a list of the actor's friends), what would determine which of the variables is represented as a singular, and which are treated as lists? What if we added an additional dimension to the pattern, say the university each of the friends graduated from? How do we preserve the association between the friend and each friend's university, instead of having a list of friends and a list of universities with no association for which belongs to which? And this is all assuming that we don't have a more complex set of MATCHes which interrelate those results (friends who have acted in movies? universities that appear in movies? Something more complex than these?) How do we decide what to treat as lists vs what to not treat as lists to preserve associations, especially when we have to represent several sets of associations between our many variables at once?

Rows or records are necessary to preserve associations in the data during processing, and in what is returned. Lists are created explicitly, not generated automatically (with some exceptions...a single path has an ordered list of nodes and relationships, for example, because a path is a 2-d ordered set of nodes and relationships).

oren · ‎03-19-2021

oren214 mentioned...

First of all, bravo! Thank you for this wonderful explanation! Did you just write all that, or is some of taken from other places? You deserve the Graph Academy Award for Best Film (Query Explanation). You continually blow me away, @andrew.bowman with the level of support and explanation you provide us with.

oren · ‎03-19-2021

oren214 mentioned...

I think maybe part of my difficulty is/was centered around the concept of row vs record.
A record, after all, is a single "piece" of data, whereas a row is something different. A row can represent a sort of synthesis of selected data. In my mind, instead of a collection of individual records, I imagine a collection of recorded patterns, where each row can represent a single traversal that matches the requested pattern. In this way, could we not answer your questions? Could we not both iterate over the results, and additionally explore deeper and more complex facets of a particular element of the recorded pattern?

oren · ‎03-19-2021

oren214 mentioned...

I'm using the word pattern, but maybe I mean path...

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

As mentioned, a record and a row are referring to the same thing in Cypher. We don't have tables, so we're never referring to a table structure kept as part of the durable graph data.

"Record" is more correct for the reasons you mentioned, and I'd argue that we should probably change the language in our query plans to use "records" and not "rows", that's something I can push for, and it aligns with our usage of "record" in other parts of the Neo4j Browser. It makes sense to standardize the language we use.

That said, when we return these, the most intuitive return format looks a lot like a row. And these rows as a whole appear to make up a table, so we won't completely escape the similarities. We even have a Table result view in the browser.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

And for clarification, a pattern is like what you provide for a MATCH, it's the "this is what you're looking for" instruction to the db.

A path is a piece of the graph data that matches the desired pattern. An instance of a pattern, if you like.

oren · ‎03-19-2021

oren214 mentioned...

Yeah, I'm not trying to complain (too much) about semantics. I get why "rows" is totally relevant and helpful, especially for folks coming from table based databases. It's familiar.

oren · ‎03-19-2021

oren214 mentioned...

so couldn't we get a collection of paths instead of a collection of records? wouldn't that open some doors?

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Sure...if you collect() the paths. But now you have a single list of paths. How do you refer to the elements in each path that you are interested in?

oren · ‎03-19-2021

oren214 mentioned...

well, in the browser when you look at the table view, you see what is basically a collection of hashes. We all know how to dig into a hash, and no reason we couldn't have a dot syntax for that too.

oren · ‎03-19-2021

oren214 mentioned...

Same with text view

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Sure...but we basically have a single column for all the paths. But people don't want to work with a list of paths, they want to use separate variables referring to the parts of the paths (usually nodes) that they are interested in. And they want to perform aggregations, and calculations, and projections, so you're usually not going to be working with pure paths at the end.

oren · ‎03-19-2021

oren214 mentioned...

I mean, now that I think about it, it could really be represented the way we represent a graphql response...

oren · ‎03-19-2021

oren214 mentioned...

I guess that's dgraph's approach, if I understand what they are doing over there (I don't).

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

GraphQL responses aren't that much different than Cypher results...it's just a JSON representation of the results. But returning just a giant JSON structure of results isn't really the best when you're anticipating a lot of results, or if you're trying to do something like export to a CSV

oren · ‎03-19-2021

oren214 mentioned...

but I'm not saying a giant json structure of results. I'm saying a collection of json structures of paths

oren · ‎03-19-2021

oren214 mentioned...

and I would want a convenience syntax so I don't have to feel like I'm dealing with json

oren · ‎03-19-2021

oren214 mentioned...

anyhow. I've wasted enough of your time with my ponderings. I'm so grateful for your explanations and time.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

If you want that you can get that.

MATCH path = (:Movie)&lt;-[:ACTED_IN]-(:Person)
RETURN path

Or relationships(path), or nodes(path) if you're interested in just the nodes, or just the relationships.

oren · ‎03-19-2021

oren214 mentioned...

It's been extremely helpful

oren · ‎03-19-2021

oren214 mentioned...

I guess I was just saying I wish that was the default instead of an option because FOR ME it makes it easier to think about

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Got it. It does depend upon what is important to each user, for what they want returned. For some paths are the most important. And it also varies for the kinds of queries you want to run.

We're flexible, but that flexibility introduces complexity, no real way around that.

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

Glad to help!

andrew_bowman · ‎03-19-2021

andrew.bowman mentioned...

I guess put a better way, if the result you desire are the paths, then work with the path variables in addition to the important components of the path.

But for many many queries, paths are just a means to an end, not the end itself. For example, if you only need, per movie, the movie title and the names of actors (and you have no need to visualize it graphically), then the paths don't matter in the end.

If you want to calculate the average ratings given to a movie, when returning the final results you don't care about all the tens of thousands or more :REVIEW relationships between people and the movie, you just want to output the movie, and the calculated average review score. Having to output all path results that were referenced to calculate the average isn't useful, and will make the query slower and output far more than what's needed.

Neo4j

What is a row?