Neo4j

chris3 · ‎03-05-2020

Do we really need to perform 2 queries when doing pagination:

to get the total count
to get the actual result with ORDER BY and SKIP / LIMIT

For example: MATCH (u:User) WHERE ... RETURN u ORDER BY u.username SKIP 100 LIMIT 25
already has to get all the (filtered) nodes, sort them, then return just the nodes of the "page" (SKIP/LIMIT) as shown by a PROFILE query:

I - and possibly millions of other users out there - ALWAYS need to show the total count of matching nodes to the user ... which is the number of rows after the (last) Filter step (4099 in this case).

Wouldn't it be possible for the DB to return this count as part of the query-stats?
Or is this possible with some secret parameter already?
Or is there a performant workaround that I am unaware of?

Cheers, Chris

Thomas_Silkjaer · ‎03-05-2020

Do two completely separate queries and don't do ordering on the count query.
Make sure you use indexes. Depending on the data size, you should add an index to the properties you are filtering on OR to the property you are ordering by.

chris3 · ‎03-05-2020

I am - of course - already executing 2 (optimized) queries BUT it seams so stupid since the result of the count query is already part of the "real" query - that's why I am asking and proposing that the number should be returned in the query-stats.

Thomas_Silkjaer · ‎03-05-2020

But if you’re only returning 25, it doesn’t know the complete count

I was only commenting on the indexes, since the profile output does not look very optimized…

chris3 · ‎03-06-2020

"But if you’re only returning 25, it doesn’t know the complete count"... it HAS to know the total count in the query plan (see 4099 rows in the screenshot of my original post) in order to sort and then skip/limit and THAT number is what we need

peter_haiduk · ‎08-10-2020

Ah, I totally see what you're up to.
I'm completely new to Neo4j, but I might have found a solution for you.

The key problem we're trying to solve here is "dragging" along a mid-query value to the final RETURN statement. The obvious solution would be to have multiple RETURNs, which, unfortunately, is not possible.
So the unsatisfying solution to still have two RETURNs is two do two queries xD

Unless.. we use aggregation / lists.
I have to admit this is not a general solution to the "multiple RETURNs problem", there are many many more cases where this would help and aggregation can't help out, but in your case it does the job.

MATCH (u:User)
WHERE u.age > 21
WITH u.username as username
ORDER BY username
WITH collect(username) AS usernames		// <== usernames is now a list :p
RETURN size(usernames) AS userCount, usernames[0..25] as usernames

I am not familiar with the implementation, but maybe maybe this query isn't even that unperformant 😄 (despite dealing with huge arrays instead of natural streams of records xDD).
Would be nice to hear what your benchmarks are saying @chris3

peter_haiduk · ‎08-13-2020

@chris3 @Thomas_Silkjaer
I found another way 😄

My new query should be really really efficient, because in contrast to my first suggestion it does not aggregate a huge list at any point of time, but instead uses ordinary records/rows that are truncated with LIMIT, just as @chris3 initially requested.

MATCH (u:User)
WHERE u.age > 21
WITH count(u) AS userCount

// now second match (cross product with the single userCount number record)
MATCH (u:User)
WITH u.username as username, userCount
// each username row/record now contains the same global userCount number as a second column

ORDER BY username
LIMIT 25

// from now on we only deal with 25 records (definitely computationally cheap)
RETURN
  collect(DISTINCT userCount) AS userCount,
  collect(DISTINCT username) AS usernames

Notice that userCount still is a list here, but this shouldn't be a problem 😉
If you wanted you could unwrap it 😛

// ...

LIMIT 25

WITH
  collect(DISTINCT userCount) AS userCountArr,
  collect(DISTINCT username) AS usernames
UNWIND userCountArr AS userCount
RETURN userCount, usernames

Hope this helps. I have a lot of fun with Cypher Queries ;D

chris3 · ‎08-18-2020

thanks for your answers!

I am using spring-data-neo4j (OGM) and kinda need a result with full nodes (1 per row => List)

I still think that the query plan could/should return this "count before skip/limit" 😉

peter_haiduk · ‎08-18-2020

I am trying to figure out why my proposal queries don't work in your setup. Unfortunately I'm not familiar with SpringData Neo4j.

Do you mean that in your setup the users MUST be returned as individual rows and not as a list (single row)?

Then really the only way is to use my second proposal and strip the collect() aggregation part at the end.
Then each username row would come with the same total count. You then could simply read this number from any row received. Straight-Forward would be the first one.

MATCH (u:User)
WHERE u.age > 21
WITH count(u) AS userCount

MATCH (u:User)

RETURN u.username as username, userCount    // <== every row stores this same number
ORDER BY username
LIMIT 25

I know it would be nice to have multiple RETURN statements that commit to a Response Data Object. But I can assure you that this is currently only possible by running two separate queries.

It don't know if this is applicable to your stack, but for me there would be two possible ways to run these two queries in one database roundtrip:

APOC's .runMany() procedure
- https://neo4j.com/labs/apoc/4.1/cypher-execution/cypher-multiple-statements/
- (I have not tested it yet)
Use a Transaction API
- eg. of the HTTP Endpoint: https://neo4j.com/docs/http-api/current/actions/
- eg. of a Driver Library: https://github.com/neo4j/neo4j-javascript-driver#explicit-transactions
- (I have not tested them yet)

chris3 · ‎10-20-2021

Hi @peter_haiduk

I know it's over 1 year later but I just wanted to THANK YOU for your reply!
In the meantime I have been working on optimizing some queries in our production systems and I just wanted to let you know that your solution works and also works for OGM.

The only thing I don't like is that every row of the result has the count column (which requires extra domain objects for OGM). "Facets" (see MongoDB) would be nice.

Neo4j

Get COUNT before LIMIT