Neo4j

sebastien-ext · ‎06-01-2021

Hi everyone,
I'm facing difficulties with the cardinality as I try to improve my skills using Neo4j with more complex queries.
Of course I already read this topic but I still don't realy understand how I can get what I want.

 MATCH (n:AP), (m:Customer), (o:Customer)
 MATCH pl=(o)<-[:PRESIDENT]-(n)-[:SHAREHOLDER]->(m)

My nodes labelled AP have many relationship types with the Customer nodes but I only want to get the AP nodes that have at least 1 PRESIDENT and 1 SHAREHOLDER relationship with the Customer nodes. I only want the AP nodes which have both but unfortunately some have both, some have only 1 of the two types.
I don't know if I did a mistake using two Customer nodes because m and o nodes can be the same node.

After these two lines I tried many queries using DISTINCT, collect() but it seems I miss the point.

Also I have a bonus question : how to get some nodes from half of the query (i.e. the first 3 nodes n) with all the m and o nodes that match the path with the relationship type ?

Thank you !

andrew_bowman · ‎06-01-2021

Hello,

Yes there's potentially a cardinality issue here. If I were to guess, you may have performed the first MATCH almost as if it was a declaration, as in other programming languages. That isn't needed here, and in some circumstances can be problematic.

If you did not have the second MATCH, then the first MATCH would have led to a cartesian product between every :AP, :Customer, and :Customer node, resulting in a number of rows equal to the product of the number of nodes in each of these 3 cases. For example, if there were 1000 :Customer nodes and 500 :AP nodes, then you would have 1000 * 1000 * 500 rows as a result due to the cartesian product, all combinations of these 3.

Your second MATCH, thankfully, provided a pattern for which these nodes are supposed to relate. The planner is able to understand that and plan based upon fulfillment of the path, avoiding the cartesian product. However, if there were any elements in the Cypher restricting the ordering (such as a WITH between them introducing a new variable), then the planner would have had no choice but to perform the cartesian product operation from the first MATCH before it could consider the second MATCH.

Going forward, use the labels inline in your pattern, as here is no need to declare your variables:

MATCH pl=(o:Customer)<-[:PRESIDENT]-(n:AP)-[:SHAREHOLDER]->(m:Customer)
...

Now regarding your use case, if o and m should be the same customer (:AP is president and shareholder of the same customer), then only one variable is needed, and you can move part of your pattern into the WHERE clause:

MATCH pl=(o:Customer)<-[:PRESIDENT]-(n:AP)
WHERE (n)-[:SHAREHOLDER]->(o)
...

If it doesn't need to be the same company (:AP is the president of any customer and the shareholder for any customer) then we can move both into the WHERE clause:

MATCH (n:AP)
WHERE (:Customer)<-[:PRESIDENT]-(n) AND (n)-[:SHAREHOLDER]->(:Customer)
...

If :PRESIDENT and :SHAREHOLDER relationships only point to :Customer nodes, then you can omit the :Customer label from those patterns and the query will become even more efficient (since we can tell from the n node which relationships it has, and if we don't need to filter on the node at the other end, then there's no need to expand to the other node).

For how you could get 3 n nodes and its connecting nodes, assuming there is no ordering, we can build off of the last query: find an :AP node with a :PRESIDENT and a :SHAREHOLDER relationship, then apply the LIMIT, then with that limited result set either expand the pattern and collect, or (more efficient) use a pattern comprehension to get the results of an expansion into lists:

MATCH (n:AP)
WHERE (:Customer)<-[:PRESIDENT]-(n) AND (n)-[:SHAREHOLDER]->(:Customer)
WITH n
LIMIT 3
RETURN n, [(n)-[:PRESIDENT]->(c:Customer) | c] as presidentForCustomers, [(n)-[:SHAREHOLDER]->(c:Customer) | c] as shareholderForCustomers

View solution in original post

clem · ‎06-01-2021

I'm not sure what you want exactly.... do you just want the AP nodes?

Something like this?

MATCH someexpression-(n:AP)-someotherexpression
RETURN n

You don't need the first part of the query. (In actual fact, it will make the system perform poorly...). Neo4J will figure out the right thing to do based on the second MATCH.

MATCH (o:Customer)<-[:PRESIDENT]-(n:AP)-[:SHAREHOLDER]->(m:Customer)

The conceptual hurdle that you need to make, is that Neo4J's data lay out, is very efficient, so starts with whatever node makes sense (maybe the AP, or maybe the Customer), finds the relationships and the nodes (with the right label) and does the matches. It then throws out whatever doesn't fit. And it does it very fast.

If you're do want o and m not to be the same node, then you can use a WHERE clause to specify that.

MATCH (o:Customer)<-[:PRESIDENT]-(n:AP)-[:SHAREHOLDER]->(m:Customer)
WHERE id(o) <> id(m). // this uses the internal Neo4J id.
RETURN n // or you can also return o and m

If you have your own unique id, you can use that attribute instead:

WHERE o.myID <> m.myID

I hope that helps.

sebastien-ext · ‎06-02-2021

Thank you for your answer.
Sorry for not beeing very clear.
I want the whole pattern. As my AP nodes can have many kind of relationship types with the Customers, I insisted on the matching with only the nodes that have both relationship types.

Thank you for the tips with the Neo4j ids !

andrew_bowman · ‎06-01-2021

Hello,

Yes there's potentially a cardinality issue here. If I were to guess, you may have performed the first MATCH almost as if it was a declaration, as in other programming languages. That isn't needed here, and in some circumstances can be problematic.

If you did not have the second MATCH, then the first MATCH would have led to a cartesian product between every :AP, :Customer, and :Customer node, resulting in a number of rows equal to the product of the number of nodes in each of these 3 cases. For example, if there were 1000 :Customer nodes and 500 :AP nodes, then you would have 1000 * 1000 * 500 rows as a result due to the cartesian product, all combinations of these 3.

Your second MATCH, thankfully, provided a pattern for which these nodes are supposed to relate. The planner is able to understand that and plan based upon fulfillment of the path, avoiding the cartesian product. However, if there were any elements in the Cypher restricting the ordering (such as a WITH between them introducing a new variable), then the planner would have had no choice but to perform the cartesian product operation from the first MATCH before it could consider the second MATCH.

Going forward, use the labels inline in your pattern, as here is no need to declare your variables:

MATCH pl=(o:Customer)<-[:PRESIDENT]-(n:AP)-[:SHAREHOLDER]->(m:Customer)
...

Now regarding your use case, if o and m should be the same customer (:AP is president and shareholder of the same customer), then only one variable is needed, and you can move part of your pattern into the WHERE clause:

MATCH pl=(o:Customer)<-[:PRESIDENT]-(n:AP)
WHERE (n)-[:SHAREHOLDER]->(o)
...

If it doesn't need to be the same company (:AP is the president of any customer and the shareholder for any customer) then we can move both into the WHERE clause:

MATCH (n:AP)
WHERE (:Customer)<-[:PRESIDENT]-(n) AND (n)-[:SHAREHOLDER]->(:Customer)
...

If :PRESIDENT and :SHAREHOLDER relationships only point to :Customer nodes, then you can omit the :Customer label from those patterns and the query will become even more efficient (since we can tell from the n node which relationships it has, and if we don't need to filter on the node at the other end, then there's no need to expand to the other node).

For how you could get 3 n nodes and its connecting nodes, assuming there is no ordering, we can build off of the last query: find an :AP node with a :PRESIDENT and a :SHAREHOLDER relationship, then apply the LIMIT, then with that limited result set either expand the pattern and collect, or (more efficient) use a pattern comprehension to get the results of an expansion into lists:

MATCH (n:AP)
WHERE (:Customer)<-[:PRESIDENT]-(n) AND (n)-[:SHAREHOLDER]->(:Customer)
WITH n
LIMIT 3
RETURN n, [(n)-[:PRESIDENT]->(c:Customer) | c] as presidentForCustomers, [(n)-[:SHAREHOLDER]->(c:Customer) | c] as shareholderForCustomers

sebastien-ext · ‎06-02-2021

Thank you for your answer.

Yes you are right, I wrote the first line as a variable declaration and I understand now why it is not necessary.

It's getting clearer now.

What are the keypoints to avoid cardinality ?

andrew_bowman · ‎06-02-2021

Testing your queries with PROFILE can be helpful, as that lets you observe the number of rows flowing between the operations. Since most operations execute per row, paying attention to the number of rows, especially when they spike, can help you figure out where tuning might be helpful.

Some useful things to keep cardinality low are to use index lookups when possible, filter and LIMIT early (where it makes sense to do so), and defer property projections (when possible) until after aggregations, filters, and LIMITs.

Neo4j

Issues with cardinality