Neo4j

fanka_bacheva · ‎11-30-2019

Hello everyone!
I have a database with users, books and user catalogs with books. Users can follow each others.
I want to create a recommendation algorithm that shows the top books added to another catalogs
So.. here are the two queries. The only one difference is that I am returning "bkIds". I am receiving totally different results.
Any ideas why?
Thanks!
Here are the queries:

match (b:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
where r.userId = 9
with collect(b.id) as booksIds

match (bk:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
where not (ID(bk) in booksIds) and bk.title <> "" and bk.title <> ""
with bk.title as BookTitle, collect(c.name) as BookCategories, count(r) as relCount, booksIds as bids

UNWIND BookCategories as Categories

return BookTitle, count(Categories) as CategoriesCount, bids
order by CategoriesCount desc
limit 20

match (b:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
where r.userId = 9
with collect(b.id) as booksIds

match (bk:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
where not (ID(bk) in booksIds) and bk.title <> "" and bk.title <> ""
with bk.title as BookTitle, bk.id as bkIds, collect(c.name) as BookCategories, count(r) as relCount, booksIds as bids

UNWIND BookCategories as Categories

return BookTitle, bkIds, count(Categories) as CategoriesCount, bids
order by CategoriesCount desc
limit 20

ameyasoft · ‎11-30-2019

You should UNWIND booksids and use them in the second match. You can do without collection. I am assuming id is a property of Book node.

Here is my query (without booksids collection):

match (b:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
where r.userId = 9

match (bk:Book)-[r:BOOK_ADDED_TO_CATALOG]->(c:Catalog)
Where bk.id <> b.id and bk.title <> “”
with bk.title as BookTitle, collect(c.name) as BookCategories, count(r) as relCount, b.id as bids

andrew_bowman · ‎12-03-2019

The thing that you're missing is on how aggregations work, in that when you perform aggregation, the non-aggregation variables become the grouping key providing the context for the aggregation.

There are actually two points where you aggregate with notable differences between the two queries.

Here's the first part...in your first query you have:

with bk.title as BookTitle, collect(c.name) as BookCategories, count(r) as relCount, booksIds as bids

Your aggregations (collection of catalog names, count of relationships) are being performed against the grouping key of: bk.title and booksIds (as bids). We can disregard bids, since that's already an aggregation and the same for all rows, so essentially your collection of category names and count of relationships is per each distinct book title. So at that point in the query, for each row you'll get a distinct BookTitle, bids list, and the list of catalog names associated with that BookTitle, and the count of relationships for books with that particular BookTitle.

By contrast, the second query has:

with bk.title as BookTitle, bk.id as bkIds, collect(c.name) as BookCategories, count(r) as relCount, booksIds as bids

With this one, bk.id as bkIds is new, and changes the grouping key. Now the collection and the count will be done per book title and book id (if both book title and book id are essentially unique, then the aggregation results will be identical to the previous query at this point, with the exception of the new variable. If book id is essentially unique but book title is not, then your results will start to differ at this point).

In the next aggregation in the queries, for the first query you have:

return BookTitle, count(Categories) as CategoriesCount, bids

Per row you will have a BookTitle, the bids list, and the count of categories (though you could have taken the count of book categories in your first aggregation, instead of collecting their names, avoiding the need to do the UNWIND and count()).

In the second query you have:

return BookTitle, bkIds, count(Categories) as CategoriesCount, bids

Your count() aggregation is not per book title, but per book title and the book id, so the context of what you're taking the count of is different.

Another thing to note is that this part will likely not work as expected:

where not (ID(bk) in booksIds)

Unless you're setting the id() of the node to its id property, you're not comparing the same things. Either make sure you're using the graph id() in both places, or that you're using the id property in both places, but don't mix and match them, as these aren't the same things.

Neo4j

Query showing different results on almost the same query