Neo4j

leelandclay · ‎11-25-2019

I'm trying to find a way to randomize a dataset (which I can do), but then paginate it so that I don't have to send the entire list down to the client.

Currently, I have an API that simply takes in the filter criteria, creates a cypher query and returns the dataset. The client passes in a SKIP and LIMIT in order to keep up with where they are. The dataset is User Profiles. It's been asked of me if it's possible to randomize the list. I did some testing and adding a rand to the query works good if I want to pull the entire dataset...but when I try to use SKIP and LIMIT I start to get duplicates.

MATCH (p:Person)
WITH p, rand() AS r
ORDER BY r
RETURN p.userId, p.firstName
SKIP 1 LIMIT 25

One option that I had thought about was randomizing the entries on the client side so that the server would send the results exactly the same but the user would see their LIMIT quantity randomized. The problem is that right now, I have mobile clients only downloading 4 profiles at a time (which would be pretty obvious that the randomizing is only on the LIMIT query.

I was about to just change the number of profiles downloaded on mobile devices to something higher so it wouldn't be so obvious...but I wanted to see if there was something already built in that could do something like this for me. Especially considering that if that's possible, I can implement it on the server side and not require an app update.

Leeland

david_allen · ‎11-25-2019

Random ordering and pagination don't go together, because you'll always end up with the problem of having duplicates in pages depending on the random ordering.

This in turn happens because you recompute the random numbers every time you run the query.

So you need something pseudo-random, but that is strongly bound or computed from something on the node that doesn't change. If you can get a pseudo-random number that is deterministically computed from the node ID for example, then the order will appear random, but it will always come out "the same random order" on recomputation.

Check out APOC's hashing and md5 functions (call apoc.help("md5")). If I were in your shoes I would probably try something like ordering by the MD5 hash of the node ID. This will never change (because the node ID doesn't change) but will appear random because the md5 hash itself is chaotically distributed across the hash space.

View solution in original post

david_allen · ‎11-25-2019

Random ordering and pagination don't go together, because you'll always end up with the problem of having duplicates in pages depending on the random ordering.

This in turn happens because you recompute the random numbers every time you run the query.

So you need something pseudo-random, but that is strongly bound or computed from something on the node that doesn't change. If you can get a pseudo-random number that is deterministically computed from the node ID for example, then the order will appear random, but it will always come out "the same random order" on recomputation.

Check out APOC's hashing and md5 functions (call apoc.help("md5")). If I were in your shoes I would probably try something like ordering by the MD5 hash of the node ID. This will never change (because the node ID doesn't change) but will appear random because the md5 hash itself is chaotically distributed across the hash space.

leelandclay · ‎11-25-2019

Thanks. I will run some test in the morning to see if that will give the results I need.

leelandclay · ‎11-25-2019

I couldn't sleep so I did some quick testing and while the md5 definitely changes the order, it's not quite what I need. The issue is still that each time the user refreshes the page, they will see the list in the same order.

However, that did lead to an idea. If I get a timestamp when the user first loads the page and pass that up to the server, I can add that into the md5 call to make it reorder each time they refresh the page.

The query I came up with is:

MATCH (p:Person)
RETURN p.userId, id(p) as nodeId, apoc.util.md5([id(p), 1574739489620]) as hash
ORDER BY hash
SKIP 0 LIMIT 4

The second entry in the md5 call is the timestamp I added in. I will require an update on the client side...but I'm not seeing a way to do it without making changes on the client.

Thanks for cluing me into the md5 call. That's exactly what I needed!

david_allen · ‎11-26-2019

Yep, that's a great adaptation! The principle is still the same: a pseudo-random ordering token that is deterministically computed. If adding a session-specific timestamp (or any other piece of information) helps, all the better. Glad it worked.

Neo4j

Randomize and Paginate a dataset