Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
06-20-2019 11:36 AM
Hello,
I'm trying to get a random subset of nodes returned (I'm downsampling my data here) and I would like it to be repeatable. Could there a way to set a seed in the rand() function?
My code is:
MATCH (doc:Document)
return doc.title, rand() as rand
ORDER BY rand ASC Limit 10
Also, is there a more efficient way with fewer db hits to get 10 random documents, as this way it goes to all the Document nodes to then just pick 10 at the end?
06-20-2019 11:57 AM
Try this:
match (:Document) with count(*) as docCount
match (doc:Document)
where rand() < 10.0/docCount
return doc.title
Note that this does not always give you 10 docs back, so you might have a larger treshold and a limit 10
at the end.
That statement does not require large intermediary datastructures but it still iterates all the documents.
Another approach preventing a full label scan is below. First we need the highest ID in use, then try to find a a random node by id that has the right label:
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Primitive count") YIELD attributes
WITH attributes.NumberOfNodeIdsInUse.value as maxId
UNWIND range(0,100000) as x
MATCH (d:Document) where id(d) = toInteger(rand()*maxId)
return d limit 10
06-21-2019 10:20 PM
Thank you for the response 🙂 With PROFILE I found that the first two are basically equal with 38k db hits, but the last one has fewer db hits only for very low limits, fewer than ~15 in my case (38k documents), but then gets much higher with a limit higher than that.
If I use one of the first two versions, is there a way to make the results repeatable with a seed of some sort, or where is the right place to request that as a feature?
06-22-2019 04:36 PM
I don't think that would be possible, since ids can be reused as nodes are deleted and new nodes added. That alone would defeat any ability to have a seed that can repeat results based on graph id lookup.
06-23-2019 09:08 PM
Ah, yah, that does make sense, thanks for the reply 🙂
All the sessions of the conference are now available online