Neo4j

shwang1 · ‎01-06-2023

I have a query that I am trying to run to count user nodes that have paths to certain segment (aka categories) nodes and also do not have a path to other segment nodes. My production data set will have billions of users and thousands of segments with each user having paths to sometimes hundreds of segments. I'm trying to test this out on a test dataset (about 50k users and 50k segments where there is only one path from user to corresponding segment of the same id; i.e. segment_1 only has one incoming path: user_1 -> segment_1) and look at the profiling information for my query below:

MATCH (n:User)
where ((n)-[:IS_MEMBER]->({segmentId: 1}) OR (n)-[:IS_MEMBER]->({segmentId: 2}))
return count(DISTINCT n);

From the query plan I see that the above query reads all the paths that I currently have in the database (50K paths)! I would have thought that the query planner would somehow detect that the only nodes of interest were segment 1 and 2 and get the incoming edges for those nodes to do a count, but it doesn't seem to be doing that. Is there something I can do to avoid scanning all the paths? It would help for more complex where conditions and scaling at the production level of data. I'm also attempting to get as close to realtime speeds for some of the smaller segments if possible.

dana_canzano · ‎01-06-2023

@shwang1

possibly. If you had an index on segmentId but to create an index you need to specify label and property, i.e

create index on :Person(segmentId);

**Note this is no different than a RDBMS and SQL where an index is created on table and column and not simply column

you can not create an index for just a property. And for example for the index to be used the query would need to specify the label and thus

MATCH (n:User)
where ((n)-[:IS_MEMBER]->(:Person {segmentId: 1}) OR (n)-[:IS_MEMBER]->(:Person {segmentId: 2}))
return count(DISTINCT n);

glilienfield · ‎01-06-2023

I see your point. That is going to happen the way the query is written. The query will match every User node and then expand to find the related segments nodes in order to filter the user nodes.

You can approach it from the other direction, so that the segment nodes are found first, then expanded to get the nodes to count. Try the following to see if it is more efficient. I assumed the related nodes have a 'Segment' label. Change this to the correct label.

Also, as @dana_canzano recommended, you should add an index to speed up finding the Segment nodes.

match (s:Segment{segmentId: 1})
match (s)<-[:IS_MEMBER]-(n:User)
with collect(n) as segment_1_users
match (s:Segment{segmentId: 2})
match (s)<-[:IS_MEMBER]-(n:User)
with segment_1_users, collect(n) as segment_2_users
return size(apoc.coll.toSet(segment_1_users+segment_2_users))

create index segment_segment_id if not exists for (n:Segment) on n.segmentId

glilienfield · ‎01-07-2023

Just realized you need to use ‘optional match’ on lines 2 and 5 incase no users are members of those segments. This will avoid the query stopping and returning no result. Instead the corresponding collection will be empty, as it should and the results will be correct.

shwang1 · ‎01-07-2023

Thank you all! I'll try this out next week!

glilienfield · ‎01-07-2023

Your welcome. Btw- the use of adding the combined data to a set is to remove duplicate users for those users in both segments. The count is the number of unique users in either segment. It does not count a user twice if they are in both segments. If you want the total number users in segment1 and segment2, then you can just remove the apoc.coll.toSet method and return size(segment_1_users)+size(segment_2_users) instead.

Neo4j

Is there a way to only reference the incoming paths into a few nodes instead of scanning all paths?