Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-11-2019 03:32 PM
Neo4j Version: 3.5.0
Trying to get sum on a node property
MATCH (a:Author)-[p:Published]->(b:Book)
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count
WHERE total_read_count < 500
RETURN author_id, total_read_count
have indexes on
Number of Author Nodes: 7696051
Number of Book Nodes: 1099438
The Query takes 30 seconds to give the complete result.
Any way to get it in less time by optimizing the query?
10-11-2019 04:20 PM
if i understand correctly a single Author publishes a book and we are looking for all Authors for which their book has less than 500 reads for all the authors books? Is this correct?
If thats the case then the first thing we should do is throw out any book that has more than 500 reads.
For example if
Author: Stephen King publishes Book: The Green Mile read_count: 4,029,232
Author: Stephen King publishes Book: The Green Mile Sequel read_count: 372
then this author would not appear in the result set since the sum of all book_reads for this author is
4,029,604.
So we should be able to
Match (b:Book) where b.read_count<500 with b
match (b)<-[p:Published]-(a:Author)
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count
WHERE total_read_count < 500
RETURN author_id, total_read_count
10-12-2019 04:24 AM
@dana.canzano Looking for Authors whose total_read_count including all their books is still less than 500 reads.
10-12-2019 06:06 AM
yes.. and
Match (b:Book) where b.read_count<500 with b
match (b)<-[p:Published]-(a:Author)
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count
WHERE total_read_count < 500
RETURN author_id, total_read_count
should satisfy your question? no ???
10-13-2019 03:43 AM
@dana.canzano yes my bad, this should satisfy my question but still the response time is above 30seconds.
10-11-2019 04:51 PM
What @dana.canzano said and create an index for :Book(read_count)
also don't group by author.author_id but by a
and access the author_id later.
Also what is your memory config?
10-12-2019 04:33 AM
@michael.hunger
Index is already there
INDEX ON :Book(read_count)│[Book]│[read_count]│ONLINE│node_label_property │100.0 │{"version":"1.0","key":"lucene"}
Grouping by a instead of author.author_id didn't help in its response time
Memory Config
Pagecache: 50g
heap: 40g
10-12-2019 05:09 AM
Which neo version? Please upgrade to latest.
Can you drop and recreate the index so you have a native index?
And show the profile of Danas suggestion?
10-13-2019 03:41 AM
@michael.hunger neo4j version 3.5.0
Will try creating a native index
Attaching the profile of @dana.canzano suggestion
this query took this much time Started streaming 110570 records after 35529 ms and completed after 35659 ms, displaying first 1000 rows
10-14-2019 05:44 AM
although i cant say whether an upgrade would help but as you are running 3.5.0 which is not the most recent 3.5.x release, have you tried with a newer release, for example 3.5.11
also, im interested to know how you came about with the configuration settings of
Memory Config
Pagecache: 50g
heap: 40g
typically we do not see customers with such a large heap. Can you provide more details on how these values were determined?
10-14-2019 07:35 AM
Assuming author_id
is unique on :Author nodes, do your aggregation on the node itself, not its property:
...
WITH a, SUM(b.read_count) AS total_read_count
...
After the aggregation you can project out a.author_id AS author_id
. That should help as you'll only perform property access once author is distinct in your query.
As for the approach of filtering out books with a read count greater than 500, I don't think that's the way to go, or rather if you want to do this kind of filtering, it's incomplete...although this does get us authors who have published books with a read count less than 500, it only does the aggregation over these books, so if an author has 2 books, one with a read count of 1000, and one with a read count of 200, its book with the higher read count will be thrown out, so it's as if it only has the 200 read count book so it will be incorrectly returned in the query.
The approach that I think was supposed to happen here was to match to authors with books >= 500 read count, and filter out the authors, not the books, so that when we continue the query to find authors of books we do not count the authors who have already been filtered out:
Match (b:Book)<-[:Published]-(a:Author)
where b.read_count >= 500
with collect(DISTINCT a) as filteredOutAuthors
match (b)<-[:Published]-(a:Author)
WHERE NOT a IN filteredOutAuthors
WITH a, SUM(b.read_count) AS total_read_count
WHERE total_read_count < 500
RETURN .author_id AS author_id, total_read_count
That said, the resulting list of filteredOutAuthors may be too large for the heap, and may be inefficient for filtering, so this may not be faster. Give it a try though and see how it fares.
There's an eager aggregation here for the count as well.
You might see if using a slightly different approach using pattern comprehensions and APOC functions (for summing values across a list) might perform any better, as this should allow streaming of the query, as it avoids the eager aggregation from sum():
MATCH (a:Author)
WITH a, apoc.coll.sum([(a)-[:Published]->(b:Book) | b.read_count]) as total_read_count
WHERE total_read_count < 500
RETURN a.author_id AS author_id, total_read_count
10-14-2019 11:23 AM
This query took more time compared to previous ones, 58seconds to be precise
Will go ahead with the first one itself, since that is taking almost 28-30 seconds and will cache the response
If I create a property total_read_count in Author node then will it help in some way instead of doing aggregation?
Thanks for all the help @dana.canzano @michael.hunger @andrew.bowman
10-14-2019 11:26 AM
@dana.canzano Sure will see if we can update
I don't know why we have a large heap size, since it was already implemented. Will let you know when I come to know about it.
11-14-2019 02:56 PM
Yes try to use the latest version and Neo4j enterprise.
It should use the index for accessing and loading read_count
and please share a PROFILE with all blocks expanded (there is a little expand box in the bottom right)
11-14-2019 03:15 PM
I recreated your dataset 1M books 1M authors with random read_counts
ran the query that @dana.canzano
profile Match (b:Book) where b.read_count<500 with b
match (b)<-[p:Published]-(a:Author)
WITH a, SUM(b.read_count) AS total_read_count WHERE total_read_count < 500
WITH a.author_id as autor_id, total_read_count
RETURN count(*)
it finished in: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 2500776 total db hits in 2532 ms.
I think your issue is not the summing but returning hundreds of thousands of authors to the browser.
If you don't do that it works fine.
In your real app you should of course stream-process the results.
All the sessions of the conference are now available online