Neo4j

jose · ‎04-22-2019

Hello,
I am using Neo4j 3.5.4, macOS/Unix version, in the cypher-shell.

I would like to get some help in understanding why these queries create list results that are formatted differently. I ask because I’ve written many queries that return list results in Format 1 with queries written like Code 2. Given the similarity between Code 1 & 2, I don’t understand why one returns one list format and the other another list format.

// ========================= Code 1 ============
</> // INPUT
MATCH (a1:SOURCE)<-[r1:A_IS|B_IS {OUTPUT:"UP"}]-(b:TARGET{ID:"303"})
optional match (b:TARGET)-[r2]->(a2: SOURCE)

   // BODY
   with a1, a2, r1 
   
   // OUTPUT
   return a1.NODE_NAME as NODE1, collect(distinct a2.NODE_NAME) as NODE2List  ORDER BY NODE1; </>

// ———————— Output Format 1 —————————

+------------------------------------------+
| NODE1 | NODE2List |
+------------------------------------------+
| "AAA" | ["BBB", "CCC", "DDD"] |
+------------------------------------------+

// ========================= Code 2 ============
</> // INPUT
MATCH (a1:SOURCE)<-[r1:A_IS|B_IS {OUTPUT:"UP"}]-(b:TARGET{ID:"79"})
optional match (b:TARGET)-[r2]->(a2: SOURCE)

   // BODY
   with a1, a2, r1, collect(distinct a2.NODE_NAME) as NODE2List
   
   // OUTPUT
   return a1.NODE_NAME as NODE1, NODE2List  ORDER BY NODE1; </>

// ———————— Output Format 2 —————————

+-----------------------------+
| NODE1 | NODE2List |
+-----------------------------+
| "AAA" | ["BBB"] |
| "AAA" | ["CCC"] |
| "AAA" | ["DDD"] |
+-----------------------------+

andrew_bowman · ‎04-22-2019

Hello,

The reason you see a difference is because the two different collects() are performed with respect to different grouping keys.

When you aggregate, the combination of non-aggregation variables becomes the grouping key, the thing that you are collecting with respect to.

In code 1, your aggregation in the return is:
return a1.NODE_NAME as NODE1, collect(distinct a2.NODE_NAME) as NODE2List

You're collecting with respect to NODE1, the projection of a1.NODE_NAME. As there is only a single distinct value of a1.NODE_NAME, the collection happens with respect to this, the entire list for this single value (though it would be more efficient, if NODE_NAME is meant to be unique on :SOURCE nodes, to just collect with respect to a1 and delay the property access until later).

In code 2, your aggregation is:
with a1, a2, r1, collect(distinct a2.NODE_NAME) as NODE2List
Your grouping key is the distinct combination of a1, a2, and r1. Collecting the unique NODE_NAME property of a2 along with a2 itself...this alone guarantees you will have a separate row per a2. It will help if you see what is returned at this point to help you understand how this aggregation looks along with its grouping key. Look at the result for this:

MATCH (a1:SOURCE)<-[r1:A_IS|B_IS {OUTPUT:"UP"}]-(b:TARGET{ID:"79"})
optional match (b:TARGET)-[r2]->(a2: SOURCE)
RETURN a1, a2, r1, collect(distinct a2.NODE_NAME) as NODE2List

On each row you will see a distinct combination of a1, a2, and r1, and the NODE2List collection will always be a single element with the NODE_NAME property value of the a2 node for that row. Since there will only ever be a single a2 node per row, the collection of a2's NODE_NAME property will always be the property of that single node for that row. In order to collect over multiple values, you have to ensure your grouping key is correct, where the nodes you want to aggregate over either aren't in scope when you aggregate, or that they themselves are aggregated in some additional aggregation.

The easiest way to go about this is to think about what you're really trying to get at the end: The list of node names for all of the a2 nodes for each a1 node. Whatever is on that far side of the for each should be your grouping key, in this case a1.

jose · ‎04-23-2019

Andrew buddy, I get it! Thanks so much for your response.

jose · ‎04-23-2019

What I am shooting for actually is the most compact way of seeing the a2 information. I was hoping that Output Format 1 would be the most compact display of information. In some cases it is. However, if the list is big enough, then it doesn't seem to matter much. Super long horizontal list vs. super long vertical list.

Neo4j

Why Do Similar List Queries lead to different List Output?