cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Negate Variable Length Path Query

rcfro2
Node Clone

dataset *very similar to Movie dataset provided by Neo4j:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name
match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [:ACTED_IN] -> (m) and p<>x
return distinct x.name

These two, given the path, I would expect to be the same but the first query returns a smaller set of results. Why ?

Specifically saying WHERE NOT (p) - [:ACTED_IN] -> (m) should be equivalent to saying WHERE NOT (p)-[:ACTED_IN]->()<-[:ACTED_IN]-(x)

27 REPLIES 27

Not it isn't because the predicate is executed per row and not for all possible movies.

So in each row you have a concrete set of: p,x,m which the predicate is executed against.
While in the first case you have a general expression if they haven't ever jointly acted in any movie, not just m

But what's the difference there? The ALL possible movies are only from a set of movies in which they acted together. First one is where NOT any movie p and x acted in which could be some movie, m, right?

In the second case, why isn't that held true for movies m ? It again refers to movies that X acted in or, in other words, movies that X acted in that P acted in as well. Can you please clarify this? this seems oddly confusing

In short can you please provide a more clear explanation? Maybe a tangible example of where X acted in Movies M and then where not P acted in those movies VS where not P and X acted in movies together? those statements sound equivalent.

What's the question we would ask that would make us arrive at one implementation vs another?

Also, the two queries below are replicas of that use case and they give the same exact result:

match (p:Person{name:'Keanu Reeves'}) - [:ACTED_IN] -> (x)
where not (:Person{name:"Hugo Weaving"})-[:ACTED_IN]->(x)<-[:ACTED_IN]-(p)
return collect(distinct x.title)

vs

match (p:Person{name:'Keanu Reeves'}) - [:ACTED_IN] -> (x)
where not (:Person{name:"Hugo Weaving"})-[:ACTED_IN]->(x)
return collect(distinct x.title)

why is that ?

Because it's not a set if m is bound to a single movie. Then only that bound value will be checked against, and the predicate will be executed per line, not globally.

If you want to check against the whole movie set of a contact you have to use a list quantor like none, all, any
with a inline pattern (pattern comprehension) or size with a pattern comprehension.

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where p<>p AND 
  none(m IN [(x) - [:ACTED_IN] ->(m) | m] where (p) - [:ACTED_IN] -> (m)] 
return distinct x.name

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where p<>p AND 
  size( [(x) - [:ACTED_IN] ->(m) where (p) - [:ACTED_IN] -> (m) | true]  )  =  0
return distinct x.name

Why is m bound to a single movie, when the following is a list of all movies ? **not list, i mean rows. sorry.

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN]->(m)
return m.title

what's the reasoning behind this? and to what extent and in what way is the bounding to some arbitrary movie determined?

And is it to say the general form of
(p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
implies that it negates all cases of when X and P act in a movie together similar to the check against the whole set query you provided?

The query you provided does not return a list of movies. It returns a stream of records (or rows) of titles of movies that matched the pattern.

Unless you explicitly collect() something, or you use pattern comprehensions, or you otherwise invoke a function or procedure that results in a list, then you're not working with lists.

Let's go back to the queries in your original question.

First, a reference query whose results we may refer to later: Let's get all coactors of Tom Hanks:

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
return distinct coactor.name

Again, this isn't a list...it's a stream of rows. Each row is an individual result, the coactor that was matched to in the pattern.

If you look at the results, you'll find that Charlize Theron is a coactor of Tom Hanks (if you perform some additional queries you would see that they co-acted in "That Thing You Do").

Now let's only take the first part of those queries, the MATCH but not the WHERE, and return the contact and the movie:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person), 
(x) - [:ACTED_IN] -> (m)
return x.name, m.title

Again, there are no lists here. The return is a stream of rows, where each row refers to a separate person and movie node that matched the pattern.

Note some of the matches that may be near the top:

╒════════════════════════╤═════════════════════════════════╕
│"x.name"                │"m.title"                        │
╞════════════════════════╪═════════════════════════════════╡
│"Charlize Theron"       │"That Thing You Do"              │
├────────────────────────┼─────────────────────────────────┤
│"Charlize Theron"       │"The Devil's Advocate"           │
├────────────────────────┼─────────────────────────────────┤

Each line corresponds with a path found in the graph that fits the pattern, where each node variable refers to an individual node, not a collection.

Looking at JUST these two rows, we can see that indeed Charlize Theron is someone Tom Hanks has had contact with (2 :HAS_CONTACT hops). Charlize has acted in movies. Since this graph only has two movies she has acted in, both were matched.

So in the first row, x matched to Charlize Theron, and m, a movie she acted in, was matched to That Thing You Do. In the next row, we still have Charlize Theron for x, and m in this case is the other movie she acted in, The Devil's Advocate.

If we applied the relevant part of the WHERE clause in the first query here:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
this would eliminate Charlize Theron, because this pattern exists in the graph, Tom Hanks has acted in something that Charlize Theron has acted in (as mentioned above...they are coactors through That Thing You Do). Both rows with Charlize Theron would be filtered out. The key thing to see is that m has nothing to do with that predicate (excluding anyone that has acted in something Tom Hanks has acted in).

However, if we only used the WHERE clause of the second query:
where not (p) - [:ACTED_IN] -> (m)
then only one row would be filtered out: the row with Charlize Theron and That Thing You Do, since Tom Hanks acted in That Thing You Do (but he didn't act in The Devil's Advocate). This predicate deals with m specifically. For the movie node matched by m, we're excluding the row if Tom Hanks acted in that movie.

Charlize Theron would remain, because Charlize Theron acted in a movie that Tom Hanks did not act in.

Right - I didn't mean lists in the proper sense i.e. sense it's used to mean an array or set of items, just that it returns some number of items.

Back to the original question -
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

filters out The Devil's Advocate. Why? The pattern of P (Tom Hanks) acting in something that X (Charlize Theron) has acted in doesnt apply to The Devil's Advocate specifically. It is arbitrairly true that Tom Hanks and Charlize Theron have acted in a movie together, but not specifically The Devil's Advocate...so why is it removed?

Remember, you're working with streams of rows. The movies are not isolated and alone. Each row at that point of the query consists of variables: p, x and m.

So when we say that rows are being filtered, we're not talking about filtering out just movies, or just people. We're talking about filtering the streams of rows. Each row is a set of one p, one x, and one m.

This predicate: where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) is specific to the p and x variables, and has nothing to do with m. Tom Hanks is every p on every row (which isn't that interesting). For the two rows with Charlize Theron, since Charlize and Tom have coacted in something, those rows will be filtered out. That includes the m on each of those rows.

It may help to review this knowledge base article on pattern negation.

Ok but doesnt the acted_in relationship refer to the thing that the relationship points to? It's odd that in the case of
match (p) - [:ACTED_IN] -> (l) <- [:ACTED_IN] - (x)
return distinct l.title

refers to all the movies that they both acted in, regardless of the variable L being there - that empty node referring to all the nodes that they both acted in, and the L acting as the variable for that reference.

In short, how does that line read in plain english?
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

I would expect it means - WHERE P AND X HAVE NOT ACTED IN ANY MOVIES TOGETHER. Is that false?

In fact, my main concern is that what you wrote here seems rather arbitrary and/or unspecific:
"For the two rows with Charlize Theron, since Charlize and Tom have coacted in something, those rows will be filtered out. "

Yes they have acted in something, but if each path is evaluated individually, then this negation would not apply to the Devil's Advocate it would seem...

I think the root of the misunderstanding is the concept of rows. Rows consist of sets of variables (in this case p, x, and m). The Devil's Advocate is only here because it was a movie Charlize Theron acted in, and so is present for a single row with variables: p - Tom Hanks, x - Charlize Theron, m - The Devil's Advocate.

That is a single row in the stream. The filter applies to rows in the stream. This predicate:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
means where this pattern doesn't exist. But for both of those rows with Charlize Theron as x, that pattern does exist, and while you COULD interpret it as "WHERE P AND X HAVE NOT ACTED IN ANY MOVIES TOGETHER", it's probably more correct to say "where p and x have not acted in something together", as there's no restriction for the label of the node in the middle (if this graph had plays, for example, or broadway shows, then it's possible there's a play or broadway show they acted in together).

But what it comes down to is that for each row (of p, x, and m) in the stream, such a pattern must not exist in the graph. But Charlize and Tom have coacted in something together, so those rows are filtered out. That's an important distinction to make. We're not filtering out the people (as in, separate from the movies). We're filtering out rows, because the people for the variables of those rows (Tom Hanks for p and Charlize Theron for x) do not pass the predicate in the WHERE clause. When those rows are filtered out, that will of course also include the m on each of those filtered out rows.

"where p and x have not acted in something together" - I agree. But the question is this: why is this i.e. the "something" not evaluated on a per row basis? So lets take this row:

Charlize Theron - The Devil's Advocate

It's clear that this row does not fit the condition of "something" in which Tom and Charlize have acted together in.

The something, insofar as I understand variables and references to items or things or lists, should be evaluated on a per row basis as other predicates are in which we have no aggregation.

In short, I agree there is something in the number of rows returned in which Charlize and Tom have acted in together, so that holds true. But, on a per row basis, it is not the case they acted in that movie so, then, why does it apply? That's my real confusion. If evaluated per row, the empty node i.e. the "something" should refer to the thing both Tom and Charlize acted in e.g. The Devil's Advocate. But, according to what you have said, it doesn't. Why?

Ah, okay.

You seem to be assuming that when we do:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
that the blank node must be m, but that's not true.

If we did use this instead:
where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)
then we would get the same results as your second query.

Note the difference. In the first one we're just looking for the existence of a pattern through some node (no restrictions). In the second, that middle node must be the m node for the row.

The idea for both of these is: this pattern must not exist in the graph. So, as soon as we find a single instance of such a pattern, then we can stop looking and filter out the row.

As for the idea of the "something" (the "blank" node) not being evaluated on a per row basis, it actually is.

For each row in the stream, expansions are performed between p and x (it's up to the query planner on which one to start out, or if we expand both and node-hash-join in the middle) to see if they are connected by such a pattern, and if it finds that the there is a connection and the pattern exists, then the row is filtered out.

So there is a lot of work being done to fulfill that predicate, as the only way to ensure that such a pattern doesn't exist between the p and x nodes is to expand between those nodes for each row. (that's why I like the approach of collecting coactors first then using a WHERE NOT x in y kind of check, it avoids the need to keep expanding out to see if p and x are coactors)

So these two are not equivalent:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

?

In the first case, removing all rows from the list if ANY row contains X or P (but in this case X, as P is in every row as we previously mentioned).

In the second case, it evaluates on a per row basis, such that it only removes rows where both X and P act in that movie, correct?

Correct, those two are not equivalent, as there is a preexisting m on each row that would add a restriction to the pattern.

Again, these ARE being evaluated on a per row basis. The difference is in what variables and restrictions apply.

In the first predicate the only variables in play are p and x. If such a pattern exists (with no restriction on the node in the middle), then the row will be filtered out. This executes per row, so it will execute for both rows where p = Tom Hanks and x = Charlize Theron (and yes m is present there too and different on each of those rows), and both of those rows will be filtered out.

In the second predicate, p, m, and x are in involved. The pattern provided must specifically have the m node in the middle. This will execute per row, so it will execute for both rows where p = Tom Hanks and x = Charlize Theron, and since the pattern is only present for the row with m = That Thing You Do, THAT row will be filtered out, and the row with p = Tom Hanks and x = Charlize Theron and m = The Devil's advocate will remain.

Operations always execute per row, that's important to remember. The difference here is just in the pattern we're looking for, whether the node in the middle can be any node, or if it has to be the m node for the row.

Maybe it will help if I show some alternatives. Here are some predicates we could use, starting with the two we've seen already and adding some other possibilities.

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (:Movie {title:'Cloud Atlas'}) <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (:BroadwayPlay) <- [:ACTED_IN] - (x)

For all of these, the goal is just to ensure that such a pattern does not exist, given the restrictions we've added to the pattern. In the last two it's easy to see the restrictions here. For the one with m, that m exists for the row, so the restriction is on a specific node. For the "blank" node we're just looking for the pattern without any restriction on the connecting node in the middle.

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

again, why does it remove specifically the Devil's Advocate? I am sorry for being naive or not understanding here but the relationship or path seems to be
(p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

means that P and X if evaluated per Row, and the empty node or P refers to Tom, then the relationship between those 2 variables in the row must maintain in The Devil's Advocate. The only path I see is

Charlize (X) - [:ACTED_IN] -> The Devil's Advocate () <- [:ACTED_IN] - (p)

isnt the empty node refer to the thing Charlize acted in, which in this case the Devil's Advocate? It seems to be removed because the path of Charlize, Tom and The Thing You Do is the only thing where it applies.

Again, apologies for my misunderstanding.

It does not specifically remove The Devil's Advocate. The Devil's Advocate gets filtered out because its row got filtered out, because the other values for that row (p = Tom Hanks and x = Charlize Theron) didn't pass the predicate.

Remember, the two rows that we're zeroing in on for this are these (projecting out just the name and title):

╒════════════════════════╤═════════════════════════════════╕
│"x.name"                │"m.title"                        │
╞════════════════════════╪═════════════════════════════════╡
│"Charlize Theron"       │"That Thing You Do"              │
├────────────────────────┼─────────────────────────────────┤
│"Charlize Theron"       │"The Devil's Advocate"           │
├────────────────────────┼─────────────────────────────────┤

(there is of course p in scope as well, but that's always Tom Hanks so that's not so interesting).

For these two rows, The Devil's Advocate is only on one of them.

By filtering based on this predicate:

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

we find that Tom Hanks acted in something with Charlize Theron, so that pattern exists, and so each row with Charlize Theron for x will be filtered out, so both of those get filtered out (which happens to include the row with m = The Devil's Advocate). The english equivalent is "where Tom Hanks and x never acted together in something", and since Tom Hanks and Charlize Theron have acted together in something, all rows with x = Charlize Theron get filtered out.

If we used the other predicate:

where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

The english equivalent is "where Tom Hanks and x didn't act together in movie m (which x HAS acted in)". This implies that the row (with x) will be kept if x acted in some movie that Tom Hanks didn't act in. Or, put even another way, it implies that for actor x, x will be filtered out completely ONLY if every movie they have acted in has been a movie where they've coacted with Tom Hanks. That gets evaluated per row. For the row where x = Charlize Theron and m = That Thing You Do, since they coacted together in That Thing You Do, that row gets filtered out. For the row where x = Charlize Theron and m = The Devil's Advocate, since Tom Hanks didn't act in that movie that row gets left in, so when you ask for the remaining actors at the end of the query Charlize Theron is one of those that returns in this query, but not the other.

What it comes down to is that by changing the predicate, it means you're completely changing what you're asking the queries to do. "filter out the contacts who have coacted with Tom Hanks in something" is completely different than "filter out the contacts who have coacted with Tom Hanks in every movie the contact has acted in".

Yes, by the Devil's Advocate gets filtered out I meant that row gets filtered out...

OK and that's my issue - "Tom Hanks acted in something with Charlize Theron, so that pattern exists, and so each row with Charlize Theron for x will be filtered out, so both of those get filtered out (which happens to include the row with m = The Devil's Advocate)."

Why ? Actually, the full pattern only holds true for "That Thing You Do". The weird part. This other line you write doesn't make a ton of sense:
"where Tom Hanks and x never acted together in something", and since Tom Hanks and Charlize Theron have acted together in something, all rows with x = Charlize Theron get filtered out."

Why do you end up filtering all rows out in which Charlize Theron is X, and not filtering out every row in which the whole path exists, as it does with the row that contains the Devil's Advocate ?

This sentence you wrote would still hold true if you kept "The Devil's Advocate" --
"where Tom Hanks and x never acted together in something"

Where Tom Hanks AND x never acted together in something. With that AND statement, I would expect that both conditions hold true with an individual row, such as "The Devil's Advocate", but it's clearly not true. Something, being the movie, only has Charlize Theron as an actor/actress, not Tom Hanks.

What am I missing here?

Again, appreciate your help.

Why ? Actually, the full pattern only holds true for "That Thing You Do".

There. That's all that's needed. Just the fact that any path exists with the restrictions given is all that's needed. The only restrictions for that path are just that p and x have such a path between them, and we aren't interested in what that node in the middle is (we don't care about any of its labels, its properties, or even if it's the same node as a previously matched variable, so it doesn't have to be m). Just that there exists any path adhering to the pattern provided (involving the end nodes of p and x) is enough to filter out the row.

By examining the graph we can determine that the middle connecting node is That Thing You Do, but the actual identity of the node, whether it's that node or The Devil's Advocate or not doesn't matter. Why doesn't it matter? Because we haven't applied any restrictions on the pattern to say that it matters in some way or is restricted in what that node can be. All that matters is that such a pattern exists in the graph, because that WHERE clause means that we don't want rows where that pattern exists: "exclude results where the nodes for x and p are related in this way" is basically what we're asking. It has nothing to do with m, because m isn't present in that pattern.

The pattern in the WHERE clause doesn't have to do with m at all. We could easily use a different predicate that also has nothing to do with p in addition to m:

WHERE NOT (x)-[:ACTED_IN]->(:Movie{title:'Cloud Atlas'})
now the restriction is on whether or not x was an actor in Cloud Atlas, and all rows where the actor acted in Cloud Atlas would be filtered out. Nothing to do with m, and nothing to do with p.

We could instead have a predicate that doesn't even deal with any of the variables at all:
WHERE NOT ()-[:ACTED_IN]->(:Movie{title:'Cloud Atlas'})
The restriction here is that there must not be any actors in Cloud Atlas (no restrictions on what that blank node is, only that there is a node that has an :ACTED_IN relationship outgoing toward Cloud Atlas). Nothing to do with p or m or x. Since Cloud Atlas is in our graph, and has actors, this pattern exists, this pattern will be tested for every row (it doesn't matter that this pattern isn't dependent on any variables at all, filter operations, like all operations, execute per row), and the pattern will still exist for every row, so all rows will be filtered out.

We could simply omit the pattern completely:
WHERE false
That would evaluate per row, and because the WHERE clause fails all nodes would be filtered out.

There is no requirement that the contents of a WHERE clause have to use any variable at all, or any pattern match. There is no requirement that a pattern match, if present, has to involve variables we've matched to previously.

Does that help?

First, thank you so much for going back on fourth on this. And I have asked other people at Neo and I think I'm getting conflicting information so I'm a bit confused. But that's neither here nor there.

The real problem is this - you find the path to be TRUE in one case, and while you say the identify of the node doesnt matter, but it is because of the identify of a specific node which identifies a single path i.e. Charlize ---- That Thing You Do ---- Tom Hanks, that if evaluated against
(p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
it is true. So it is odd that the then when evaluated against a different path, a COMPLETELY different path such as the one where you have
Charlize --- The Devil's Advocate, and nothing for Tom which is what P refers to, that it still removes the row. The hole path includes at least Tom, Charlize and some intermediate node. In the case of Charlize and The Devils Advocate, you dont have Tom...

Again apologies and many thanks.

More than likely it's not necessarily conflicting information, but alternate approaches. Graph databases, and Cypher, allow multiple ways to do what you want to do. I've mentioned a couple approaches already. It's very important to be clear about what it is you want to do and what output you are trying to capture, because small changes to patterns (such as the presence or lack of a variable in a pattern) can completely change the meaning and behavior of the query. It's important to be precise.

So to address this, at the time that I added the results returning the name of x and the title of m, my goal was to show that m is an element of each row, and so Charlize shows up twice (once per movie that she acted in, because you matched out to movies where x acted in m). It was an example to demonstrate that, and the return explicitly only returned x's name and m's title. That was just to show a bit of what's going on under the hood, and p wasn't an interesting part of what I wanted to show.

But let's look at the original query again:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name

At the point right after the match, but before the WHERE clause filtering, p, x, and m are in scope, and are the fields per row. So p is present, (as the scope of variables per row hasn't been changed at that point), and is Tom Hanks. You can see this explicitly if I RETURN early and project out the names and titles of the nodes involved:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
return p.name, x.name, m.title

Ok. So...I think I get it but it is poorly worded.

The first query (where not (p)-:ACTED_IN->(m) ) actually checks against a specific movie with respect to P, if that path exists, you remove the row.

The second query check to see with respect to the actor, if P and X, for example, have acted together - hence why you said the movie doesnt matter. So really, this is much easier to see if you return a list of actors, and collect(m.titles). AT any rate, it checks against the actor, and if that actor has ever acted with this other actor, it removes the row, hence why both movies get removed because the actor no longer exists. Right?

I think you're almost there.

On the first section, you are correct, because p and m are in the pattern you're checking for, the path must include both of those nodes.

The second query doesn't use m at all, so the only pattern looked for is one involving p and m, and whether there is some path that fits the pattern where they have acted in some thing together.

For collections, I do find it easier to work with, and when the collection is small, this can be very efficient. You're referring to this example query I provided above, right?

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
with p, collect(distinct coactor) as coactors
match (p) - [:HAS_CONTACT*2] -> (x:Person)
where not x in coactors and p<>x and (x)-[:ACTED_IN]->()
return distinct x.name

(I added in the and (x)-[:ACTED_IN]->() to ensure the contacts we consider are actors, otherwise non-actors could be recommended)

Conceptually I think this works better. We find Tom's coactors, and since we collected it we can treat it as a collection (if we never aggregate we can't do that, that's one of the things that tripped you up before). At that point it is very easy to check if a potential recommendation was a coactor or not, we just see if they are in that coactor collection, and if they are the row gets eliminated.

This would apply even if we added in the additional part of the pattern from the coactor to movies they've acted in (it doesn't make sense to do this, of course, as this isn't needed to determine whether the two persons are coactors, but I want to make it clear that the entire row gets eliminated, no matter what m is):

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
with p, collect(distinct coactor) as coactors
match (p) - [:HAS_CONTACT*2] -> (x:Person)-[:ACTED_IN]->(m)
where not x in coactors and p<>x
return distinct x.name

So was I missing something? Again I think this is easier if it's clear that from the beginning the second query is really asking if Tom and Charlize have ever worked with one another, if so, dont include Charlize Therone (which is why both the rows that mentioned her in previous query are removed - it's not about the movie, it's about whether these two actors have worked together). I think that was the major conceptual issue at the beginning.

Yes, I think you've got it.

The overall idea is that operations produce rows of results, and WHERE clauses operate on each row. The predicates in the WHERE clauses may involve all, some, or even none of the variables for those rows, but in any case if the WHERE clause is false, then the row is filtered out. Also (and I think this may help out for you) any path predicate is independent of the path used in the previous MATCH. Maybe you were under the impression that a path predicate could only operate on the path from the MATCH? If so that misunderstanding could easily explain your line of reasoning.

As for your follow-up question about omitting the relationship type, you're correct, it would expand out to all outgoing relationships from p regardless of type and see if there's any path that matches. You can even use a shorthand of (p)-->() to indicate a relationship when you don't care about the type or properties, and don't need to bind a variable to it.

Ah perfect. Wow. Many thanks. I know this has been a ton of back and fourth but it was supremely helpful and very much appreciated! I look forward to exploring graphs more 🙂

Also if we replace the specified relationship, with none, does it iterate through the various relationships?

I mean -
match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name

It seems that it should if it truly considers all relationships paths to some other node where a node X has acted in it

I hope the above has been helpful.

But let's try to finish off by fixing this.

Most of this mess is caused by this part of the match:
(x) - [:ACTED_IN] -> (m)

The above introduces another variable m into our rows, and we end up with a row per p/x/m combination (as you saw, Charlize Theron appeared on multiple rows, once per movie she acted in, and the same applies to all the other actors in the rows at that point, appearing multiple times, once per movie they acted in.

So not only does this inflate the number of rows, but it's also unnecessary. We don't need to match out to m to filter out based on whether or not the contact is a coactor of Tom Hanks.

All we need is this:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name

Or, if we wanted to be more efficient and cut down on the expansions needed to figure out if they're coactors, we could do the same thing we did in the other thread and collect coactors and filter based on list membership:

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
with p, collect(distinct coactor) as coactors
match (p) - [:HAS_CONTACT*2] -> (x:Person)
where not x in coactors and p<>x
return distinct x.name