cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to get n nodes per label without expanding all

Imran
Node Clone

I am trying to limit the number of related nodes in my result per label. Let's say I have a node type A that is related to nodes with different labels (X, Y, Z, etc.). (a:A)-[:RELATED_TO]->(b), here b can have either one of the (X, Y, Z, etc.) labels. For each A I want to fetch maximum 5 related nodes of each label. Sometimes one A node can be RELATED_TO thousands of (X, Y, Z, etc.) nodes.

I tried the below query which gives the desired result, but is not optimal because it expands all of a's relationships:

MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
WITH labels(b)[0] as label, collect(b)[0..5] as nodes
UNWIND nodes as n
RETURN n

Another way I have tried is matching separately per label with limit and using union:

MATCH (a:A)-[:RELATED_TO]->(b:X) WHERE a.someCondition
RETURN b limit 5
UNION
MATCH (a:A)-[:RELATED_TO]->(b:Y) WHERE a.someCondition
RETURN b limit 5
(... and so on for other labels I want)

The problem with the second approach is that I am performing the same match multiple times for each part of the union.
Is there a better way to get the desired results without having to expand all of A's relationships?

1 ACCEPTED SOLUTION

Note that with Neo4j 4.1, you can use subqueries for this, within which you can use LIMIT that is scoped only to the subquery (and not applying it to all rows of the outer query).

For example:

UNWIND ['lbl1', 'lbl2'] as label
CALL { 
  WITH label
  MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
  RETURN b
  LIMIT 10
}
RETURN b

View solution in original post

16 REPLIES 16

ameyasoft
Graph Maven
Try this:

MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
with a, count(labels(b)) as cnt
with a, cnt where cnt = 5
return distinct labels(b) as lbl, cnt

asperlinga
Graph Buddy

or think about CASE statement

(DB Northwind)

MATCH (a:Customer)-[:PURCHASED]->(b:Order)
where a.companyName starts with "A"
RETURN
CASE WHEN count(labels(b))>0 THEN
"Country "+a.country+" Order "+b.orderID END as result

Imran
Node Clone

Thanks guys. I am not interested in getting the nodes with a certain number of labels, rather a certain number of nodes per label.
Eg. for a path match (a:A)-[:RELATED_TO]->(b) return b, there can be thousands of potential b's. Each b has exactly one label. Lets say the above query returns 90k nodes -> 30k with label X, 30k with label Y, 30k with label Z. I would like to get 5 of each, without having to incur the cost of expandAll on those 90k nodes

Since you have to do collect there is no way to do without doing full traversal in Cypher.

Best way is to use a store procedure or if you know label distribution of b you can try

In that case you can try this

UNWIND ['lbl1', 'lbl2'] as label
MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
WITH a, b
LIMIT 10
RETURN b

I think this will return 10 results of just 'lbl1' and won't return 5 of each

I guess you are limited to a union query or store procedure here.

Imran
Node Clone

I was able to come up with this using apoc that seems promising. I would love to get opinions on whether you guys see any issues with it:

MATCH (a:A) where a.someCondition
CALL apoc.cypher.run('
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:X) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Y) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Z) RETURN b limit 5
', {a:a})
YIELD value RETURN value.b

It seems to return 5 nodes per label without doing full traversal. However, I think it will do full traversal if there are less than 5 nodes for a certain label.

With apoc you can try this

UNWIND ['lbl1', 'lbl2'] as label
CALL apoc.cypher.run('MATCH (a:A)-[:RELATED_TO]->(b:'+label+') RETURN b limit 5', {}) yield value
return value

It returns at the most 5 of each label values

This seems perfect. Thanks! Wondering if there is a way to avoid full traversal when less than 5 nodes are present of a specific label. I guess the only way would be to store counts of related nodes on A for each label type, and do a conditional. Might be a bit of an overkill to avoid an occasional full traversal

Note that with Neo4j 4.1, you can use subqueries for this, within which you can use LIMIT that is scoped only to the subquery (and not applying it to all rows of the outer query).

For example:

UNWIND ['lbl1', 'lbl2'] as label
CALL { 
  WITH label
  MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
  RETURN b
  LIMIT 10
}
RETURN b

Thanks. Haven't tested it on Neo4j 4.1, but just curios about the expected behavior of this subquery. If there are 50,000 nodes of 'lbl1' and 5 nodes of 'lbl2', will it traverse all 50,005 nodes to get 10 of each? If yes, any suggestions of possible optimizations?

No, I wouldn't think so. The WHERE label in labels(b) part shouldn't be considered by the planner for where to start looking. It should start with a label scan of :A nodes, expanding :RELATED_TO relationships to try to find a connected node of lbl1 or lbl2. Once 10 matches are found it will stop looking.

Oh okay. In that case, it's possible that it might return 10 nodes of 'lbl1' and 0 nodes of 'lbl2'. My goal is to get at most 5 nodes for each label.

Ah, I'm a bit mistaken. Because of the UNWIND, the MATCH will find LIMIT 10 per label. If you want it 5 each, make that LIMIT 5.

Oh yeah, got it. Because of unwind the subquery will run once for each label.
My question was regarding the label scan where there are less than 10 nodes for the label (using limit 10). Let's say it's running the subquery for 'lbl2' with limit set to 10. Let's assume that :A has 50,000 neighbors none of which has 'lbl2'. Will the query end up traversing all of the 50,000 neighbors to know that there are no 'lbl2' nodes (that's how it works in the 3.x versions I think), or does Neo4j maintain some sort of count of neighbors per label internally, and will avoid a full traversal?

The behavior is the same. Nodes don't know anything about their neighbors, excepting the relationship types and directions. It would have to expand to and filter all its neighbors in its attempt to find the limited number of results.

If you used more specific relationship types, then you could make this more efficient. If the relationships were :RELATED_TO_LBL1 and :RELATED_TO_LBL2, then it would only select those relationships to traverse (when a node is dense its structure for holding relationships changes, making selection of relationships efficient and avoiding the need to filter all relationships). Since there would be no :RELATED_TO_LBL2 relationships, it wouldn't need to expand or filter anything, just immediately return no results for that subquery invocation.