cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Returning a random subset of nodes (ideally repeatable)

oleg_neo4j
Graph Buddy

Hello,
I'm trying to get a random subset of nodes returned (I'm downsampling my data here) and I would like it to be repeatable. Could there a way to set a seed in the rand() function?
My code is:

MATCH (doc:Document)
return doc.title, rand() as rand
ORDER BY rand ASC Limit 10

Also, is there a more efficient way with fewer db hits to get 10 random documents, as this way it goes to all the Document nodes to then just pick 10 at the end?

4 REPLIES 4

Try this:

match (:Document) with count(*) as docCount
match (doc:Document)
where rand() < 10.0/docCount
return doc.title

Note that this does not always give you 10 docs back, so you might have a larger treshold and a limit 10 at the end.
That statement does not require large intermediary datastructures but it still iterates all the documents.

Another approach preventing a full label scan is below. First we need the highest ID in use, then try to find a a random node by id that has the right label:

CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Primitive count") YIELD attributes
WITH attributes.NumberOfNodeIdsInUse.value as maxId
UNWIND range(0,100000) as x
MATCH (d:Document) where id(d) = toInteger(rand()*maxId)
return d limit 10

oleg_neo4j
Graph Buddy

Thank you for the response 🙂 With PROFILE I found that the first two are basically equal with 38k db hits, but the last one has fewer db hits only for very low limits, fewer than ~15 in my case (38k documents), but then gets much higher with a limit higher than that.
If I use one of the first two versions, is there a way to make the results repeatable with a seed of some sort, or where is the right place to request that as a feature?

I don't think that would be possible, since ids can be reused as nodes are deleted and new nodes added. That alone would defeat any ability to have a seed that can repeat results based on graph id lookup.

oleg_neo4j
Graph Buddy

Ah, yah, that does make sense, thanks for the reply 🙂