Neo4j

ukirik · ‎02-04-2021

If I have a query along the lines of MATCH (g:Gene)-[r]-() RETURN DISTINCT type(r), count(r) I can get a breakdown of the connections a gene node has, what type of relationships and how many of these.

Say I have a list of genes with which I want to do the same query, to get a pivot table, relationship types in rows, genes in columns, each individual cell in the table being the count of one type of relationship that gene has. How could I do this? I tried to connect two queries with a UNION ALL:

MATCH (g:Gene {name: "G1"})-[r]-(n)
RETURN DISTINCT type(r),  count(r) as g1
UNION ALL MATCH (g:Gene {name: "G2"})-[r]-(n)
RETURN DISTINCT type(r),  count(r) as g2

this returns an error about column names not being the same. If I make sure both returns have the same columns:

MATCH (g:Gene {name: "G1"})-[r]-(n)
RETURN DISTINCT type(r), null as g2, count(r) as g1
UNION ALL MATCH (g:Gene {name: "G2"})-[r]-(n)
RETURN DISTINCT type(r), null as g1, count(r) as g2

then I get duplicate rows, values being null for one gene and then the other. I also tried using UNWIND with a list then I get the two tables returned concatenated vertically instead (i.e. rows duplicated).

What I am trying to do should not be that difficult, what I am doing wrong?

dana_canzano · ‎02-04-2021

Though it may not be entirely critical to the issue at hand, what version of Neo4j is in play here?

Regarding the error with

MATCH (g:Gene {name: "G1"})-[r]-(n)
RETURN DISTINCT type(r),  count(r) as g1
UNION ALL MATCH (g:Gene {name: "G2"})-[r]-(n)
RETURN DISTINCT type(r),  count(r) as g2

this is expected but can be avoided if you run

RETURN DISTINCT type(r),  count(r) as g1
UNION ALL MATCH (g:Gene {name: "G2"})-[r]-(n)
RETURN DISTINCT type(r),  count(r) as g1

admittedly your output may not allow you to differentiate data. Maybe use

MATCH (g:Gene {name: "G1"})-[r]-(n)
RETURN 'G1' as gene, DISTINCT type(r),  count(r) as g1
UNION ALL MATCH (g:Gene {name: "G2"})-[r]-(n)
RETURN 'G2' as gene, DISTINCT type(r),  count(r) as g1

but further I think we dont need the DISTINCT on type(r) but instead can simply use type(r). This is so because count(r) is going to result in aggregation. For example with test data

i// create 3 Gene nodes
create (n:Gene) set n.name='G1';
create (n:Gene) set n.name='G2';
create (n:Gene) set n.name='G3';

// Gene with name G1 will have 4 relationships to Gene named G2.  The 
// relationship are as follows
// 2 named R1
// 1 named R2
// 1 named R3
match (n:Gene {name:'G1'}), (n2:Gene {name:'G2'}) create (n)-[:R1]->(n2);
match (n:Gene {name:'G1'}), (n2:Gene {name:'G2'}) create (n)-[:R1]->(n2);
match (n:Gene {name:'G1'}), (n2:Gene {name:'G2'}) create (n)-[:R2]->(n2);
match (n:Gene {name:'G1'}), (n2:Gene {name:'G2'}) create (n)-[:R3]->(n2);

// Gene with name G2 will have 3 relationships to Gene named G3.  The 
// relationship are as follows
// 3 named R1
match (n:Gene {name:'G2'}), (n2:Gene {name:'G3'}) create (n)-[:R1]->(n2);
match (n:Gene {name:'G2'}), (n2:Gene {name:'G3'}) create (n)-[:R1]->(n2);
match (n:Gene {name:'G2'}), (n2:Gene {name:'G3'}) create (n)-[:R1]->(n2);

and now

match (n:Gene)-[r]->() return distinct type(r), count(r);
+--------------------+
| type(r) | count(r) |
+--------------------+
| "R3"    | 1        |
| "R2"    | 1        |
| "R1"    | 5        |
+--------------------+

3 rows available after 15 ms, consumed after another 5 ms

and if I remove DISTINCT I get the same output.

and maybe what you are really after is

 match (n:Gene)-[r]->() return n.name,type(r), count(r);
+-----------------------------+
| n.name | type(r) | count(r) |
+-----------------------------+
| "G1"   | "R3"    | 1        |
| "G1"   | "R2"    | 1        |
| "G1"   | "R1"    | 2        |
| "G2"   | "R1"    | 3        |
+-----------------------------+

ukirik · ‎02-04-2021

Hi Dana, this is Neo4j 4.0.5 I believe.. Does it matter in this context?

If I understand your reply correctly, it is not possible to get "wide" data rather than "long" data, borrowing the terminology from the tidyverse? Ideally I was hoping to get a table that contains three columns, type(r), counts for G1 and counts for G2

dana_canzano · ‎02-04-2021

version.. in this case doesnt not matter except as you initially reported that an error was encountered I wanted to make sure this was / wasnt reproducible on said version

regarding your desired output are you expecting output similar to

RelationshpName     G1         G2  
R1                                 2            3
R2                                 1            0
R3                                 1            0

but where RelationshipName can be any value on the Y axis and the list on Genes on the X axis can be N number of Genes ??

ukirik · ‎02-05-2021

Yes precisely, that was the idea. Maybe not the approach?

Neo4j

How to match multiple nodes and return them each as a column (UNION doesn't work as expected)