cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Return unique list with multiple relationships

rcfro2
Node Clone

match (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1),
(t)-[:HAS_CONTACT]->(t1),
(t1)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
where not (w2)-[:WROTE]->(m1) and m2<>m1 and w2<>w1
return t.name as director, collect(distinct w1.name) as UsedWriters,
collect(distinct w2.name) as SuggestedWrtiers
limit 3

Why is it that the list of w2 and w1 are the same in some cases when I explicitly say w2<>w1 ?

19 REPLIES 19

Can you provide sample data, and show us some results, such as the rows before the collection?

More than likely the rows before the collect are correct, but once you collect both lists that predicate can no longer apply since you've combined results.

Hm even in removing the UsedWriters collect list, I still get unexpected results.

See -
match (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1),
(t)-[:HAS_CONTACT]->(t1),
(t1)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
where not (w2)-[:WROTE]->(m1) and m2<>m1 and w2<>w1
return t.name as director, collect(distinct w2.name) as SuggestedWrtiers
limit 3

and the :HAS_CONTACT relationship is a :KNOWS relationship between people who have worked together. Look at the second row in image attached - 2X_8_8583afe559fed8854b361fb7c756245b3c4e41dc.png

why is Lana mentioned again ?

You didn't provide any restrictions on t and w2 being the same. If you need that, you need to add that restriction to your query:

WHERE t <> w2

Ok. But even when I add that, there are writers who have written movies that T directed. Why is that the case?

See image -

rcfro2
Node Clone

Also, this seems to produce the right results, just a lot of conditions to check for -

match (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1),
(t)-[:HAS_CONTACT]->(t1),
(t1)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
where not (w2)-[:WROTE]->(m1) and m2<>m1 and w2<>w1 and t<>w2 
and t<>w1 and not (t)-[:WROTE]->(m1) and not (t1)-[:WROTE]->(m1) 
and not (t)-[:WROTE]->(m2) and not (t1)-[:WROTE]-(m2)
return t.name as director, collect(distinct w2.name) as SuggestedWrtiers```


Is there a simpler way?

Right, but you're working with individual nodes, not collections of them. What you seem to be after is: "writers who wrote a movie and that movie wasn't directed by t.

But that's not what your query is doing. Your query has m2<>m1, this is comparing individual movies. Look at the row of your query before collection, you'll see that for that row where "David Mitchell" is w2, m2 may be Cloud Atlas, but m1 will be a different movie, maybe one of the Matrix movies, so m1<>m2.

To get around this, you need to collect the movies so you can treat them as a group, rather than dealing with single pairs of movies.

We may need to do the same with writers.

Let's reaffirm what you're actually trying to do here. My guess is you want to match to a director who directed movies and had writers for those movies. For that director, all the movies they directed, and all the writers who wrote those movies, you want a director they've had contact with who directed a movie that t didn't direct, and was written by a writer who never wrote for one of the movies t directed.

If that's accurate, you need to collect the movies and writers for t and ensure the subsequent matches aren't present in those collections:

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1)
WITH t, collect(DISTINCT m1) as directedMovies, collect(DISTINCT w1) as UsedWriters
MATCH (t)-[:HAS_CONTACT]->()-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT m2 in directedMovies and NOT w2 in UsedWriters
WITH t, UsedWriters, collect(distinct w2.name) as SuggestedWriters
LIMIT 3
RETURN t.name as director, [writer in UsedWriters | writer.name] as UsedWriters, SuggestedWriters

rcfro2
Node Clone

The query is the following:

Suggest new writers to directors via their friends who are also directors
i.e. director friends they know (:HAS_CONTACT). Make sure that the directors have
never worked with these writers, that they (the directors) do not know them (the writers)
immediately, and that the director and writers are different people. Return the
directors and unique suggested writers.

So the query you wrote is what I am looking for in terms of suggesting "new writers" to a director.

Sounds about right. The key takeaway is that node variables will refer to a single specific node per row, so if you need to do something to address them as a set (like "the writer can't be one that has written for the director"), you need to collect nodes and treat them as a collection (WHERE NOT w2 in UsedWriters instead of WHERE w2<>w1). Alternately you can use a pattern predicate: (WHERE NOT (t)-[:DIRECTED]->()<-[:WROTE]-(w2)) but this can be more costly, easier to get the collection of nodes that you want to ensure are excluded, and filter with that collection.

Anything else you need in the query, or does this fulfill all you need?

It does. But I'm wondering why the query you wrote and the one I wrote at the end dont return the exact same results after I order by.

See below. It returns almost identical results, except my set is a subset of the results your query returns. Why ?

match (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1),
(t)-[:HAS_CONTACT]->(t1),
(t1)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
where not (w2)-[:WROTE]->(m1) and m2<>m1 and w2<>w1 and t<>w2 
and t<>w1 and not (t)-[:WROTE]->(m1) and not (t1)-[:WROTE]->(m1) 
and not (t)-[:WROTE]->(m2) and not (t1)-[:WROTE]-(m2)
return t.name as director, collect(distinct w2.name) as SuggestedWrtiers
order by t.name desc
limit 3```

Hard to say without working with the same data set. Can you direct me to the data set you're using?

Okay, looks like you introduced one more restriction in that t2 should not have written m2. If we include that in my query then our results should be identical:

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1)
WITH t, collect(DISTINCT m1) as directedMovies, collect(DISTINCT w1) as UsedWriters
MATCH (t)-[:HAS_CONTACT]->(t2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT m2 in directedMovies and NOT w2 in UsedWriters and NOT (t2)-[:WROTE]->(m2)
WITH t, UsedWriters, collect(distinct w2.name) as SuggestedWriters
ORDER BY t.name DESC
LIMIT 3
RETURN t.name as director, [writer in UsedWriters | writer.name] as UsedWriters, SuggestedWriters

Almost. If you remove the LIMIT, your query returns one additional result. See images attached -2X_b_befb93023049fbddb8124a5f34ecd797d1642563.png

And yours below
2X_4_44a2e1926b4686b5ec21cbc1766fc9cb82daf085.png

I forgot to include your other added restriction that we want to exclude directors t that wrote the movie they directed, we can add that to the query in the WHERE clause on line 2:

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1)
WHERE NOT (t)-[:WROTE]->(m1)
WITH t, collect(DISTINCT m1) as directedMovies, collect(DISTINCT w1) as UsedWriters
MATCH (t)-[:HAS_CONTACT]->(t2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT m2 in directedMovies and NOT w2 in UsedWriters and NOT (t2)-[:WROTE]->(m2)
WITH t, UsedWriters, collect(distinct w2.name) as SuggestedWriters
ORDER BY t.name DESC
RETURN t.name as director, [writer in UsedWriters | writer.name] as UsedWriters, SuggestedWriters

Andrew - thank you so much. Makes sense now - I should have caught that. I guess my only question now is which way would be the more "graphy" way of thinking about this problem ? Fundamentally, it's a "recommendation" and both the one I wrote and the one you have are not trivial, so in moving forward, which kind of query would be the best way to characterize this problem ?

I'm assuming the way you wrote as it does consider everything in terms of a list, which might be more efficient.

Well, it is a still a non-trivial question. Let's try restating the requirement in full:

For all directors that have directed a movie they haven't written, find directors they know that have directed movies (that they haven't written) where none of the writers of those movies have worked with the original director. Suggest those writers.

One interesting thing here...because of the patterns that we've been using, we've only been considering directors of movies with writers. There are actually quite a few other relevant results if we also consider movies that don't have writers, as this brings quite a few directors into consideration that were previously excluded.

So to amend my previous query, we can use an optional match to the writers of a movie and collect them:

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)
WHERE NOT (t)-[:WROTE]->(m1)
OPTIONAL MATCH (m1)<-[:WROTE]-(w1)
WITH t, collect(DISTINCT m1) as directedMovies, collect(DISTINCT w1) as UsedWriters
MATCH (t)-[:HAS_CONTACT]->(t2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT m2 in directedMovies and NOT w2 in UsedWriters and NOT (t2)-[:WROTE]->(m2)
WITH t, UsedWriters, collect(distinct w2.name) as SuggestedWriters
ORDER BY t.name DESC
RETURN t.name as director, [writer in UsedWriters | writer.name] as UsedWriters, SuggestedWriters

Now we can make this writer-centric and do away with the movie filtering if we want. The movie filtering was mostly there so we could rule out writers early. That is, if m2 is one of the movies in the collection of m1, we already know that the director has worked with all of the writers. That's an optimization thing.

So let's see that query without doing the filtering on movies:

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)
WHERE NOT (t)-[:WROTE]->(m1)
OPTIONAL MATCH (m1)<-[:WROTE]-(w1)
WITH t, collect(DISTINCT w1) as UsedWriters
MATCH (t)-[:HAS_CONTACT]->(t2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT w2 in UsedWriters and NOT (t2)-[:WROTE]->(m2)
WITH t, UsedWriters, collect(distinct w2.name) as SuggestedWriters
ORDER BY t.name DESC
RETURN t.name as director, [writer in UsedWriters | writer.name] as UsedWriters, SuggestedWriters

That might work a bit better as far as trimming down the complexity.

If you wanted the graphiest query, though, we could avoid the collection and list filtering, but we'd pay a cost with additional expansions per potentially suggested writer to see if they wrote for a movie directed by the original director. For such a small query though the performance difference should be negligible.

MATCH (t:Person)-[:DIRECTED]->(m1:Movie)
WHERE NOT (t)-[:WROTE]->(m1)
WITH DISTINCT t
MATCH (t)-[:HAS_CONTACT]->(t2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
WHERE t <> w2 and NOT (t2)-[:WROTE]->(m2) and NOT (t)-[:DIRECTED]->()<-[:WROTE]-(w2)
WITH t, collect(distinct w2.name) as SuggestedWriters
ORDER BY t.name DESC
RETURN t.name as director, SuggestedWriters

Wouldn't
(t2)-[:WROTE]->(m2)

only mean with respect to one movie ? Wouldnt we need the general form of:
(t2)-[:WROTE]->()<-[:DIRECTED]-(t2)

to ensure it's the result contains NO movies that he wrote and directed ?

I believe it works correctly as-is.

The only way we get from t2 to m2 is if t2 directed that movie, and we have a condition in the WHERE clause where we don't want any result where t2 wrote m2 (in addition to directing it).

While you're correct that this is with respect to one movie (or rather, each movie m2), we've already established that t2 directed m2, and the WHERE condition applies to m2 (as opposed to some other movie), so the query will exclude all movies where t2 directed and wrote the movie.

I realize this is an old thread, but I was wondering if I could rewrite this query as the following:

match (d1:Person)-[:DIRECTED]->(m1:Movie)<-[:WROTE]-(w1),
(d1)-[:HAS_CONTACT]->(d2),
(d2)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
where not (w2)-[:WROTE]->()<-[:DIRECTED]-(d1:Person) 
return d1.name as director, collect(distinct w1.name) as `worked with writers`, 
collect(distinct w2.name) as `suggested writers`
limit 4

or with the optional match

match (t:Person)-[:DIRECTED]->(m1:Movie),
(t)-[:HAS_CONTACT]->(t1),
(t1)-[:DIRECTED]->(m2:Movie)<-[:WROTE]-(w2)
optional match (m1)<-[:WROTE]-(w1)
where not (w2)-[:WROTE]->()<-[:DIRECTED]-(t:Person) 
return t.name as director, collect(distinct w1.name) as `worked with writers`, 
collect(distinct w2.name) as `suggested writers`
limit 3

I'm wondering if you fundamentally need the WITH or other restriction constrictions