Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-08-2020 01:25 AM
Hi,
I’m trying to get a ratio of relationship count of each node, to the highest number of relationships for any more. I’ve tried a few things, but the closest I’ve got is this:
MATCH (n:Allele)
WITH n, size((n)<-[:HETEROZYGOUS]-()) AS hetCount, 2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
(size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) as totCount
where totCount > 0
RETURN n, hetCount, homCount, totCount, max(totCount) as outOf, totCount / max(totCount) as ratio ORDER BY totCount ASC LIMIT 100
This is giving me everything I want, except the maximum, and therefore the correct ratio (I imagine it could be optimised as well!). My plan is to use this to set the ratio as a property on :Allele which will then show the rarity of the allele, by number of relationships and can be rerun and indexed everytime new Allele nodes are added.
I have ~1.5m nodes and ~800m relationships that need to be counted so this is quite a large query, so am not sure if a gds algorithm would be a suitable alternative. Any help would be much appreciated.
Thanks
Solved! Go to Solution.
08-19-2020 09:57 AM
MATCH (n:Allele)
# HERE is your trouble. Let me explain...
# WITH n, max(n.hetCount + n.homCount as totCountMax
WITH max(n.hetCount + n.homCount as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC
If you think about Cypher MATCH results as a spreadsheet (you should check the sheet results in these cases), you'll get one row for every MATCH. Aggregate functions will have an inherent GROUP BY on any variables you're including.
I've built a little NeoConsole to play with: http://console.neo4j.org/r/3bsdhd
x.val = 1
y.val = 2
z.val = 4
Now, what happens below, is that Cypher is trying to keep the b
contents in the result, so the SUM
aggregate function won't collapse the values of l.val beyond that. Essentially turning whatever you KEEP IN CONTEXT (b
here) as an implicit GROUP BY.
MATCH (b:Box)-[]-(l:Label)
RETURN b, SUM(l.val) as sum;
However, if you omit that variable to get an aggregate of all the things...
MATCH (b:Box)-[]-(l:Label)
RETURN SUM(l.val) as sum;
You get the aggregate you need, the trick is then in using it across all the b
nodes here...
MATCH (b:Box)-[]-(l:Label)
WITH SUM(l.val) as sum
MATCH (b:Box)-[]-(l:Label)
# now you have SUM and b, and l in context...
RETURN b, l, sum
08-13-2020 03:37 PM
One of these will probably do it:
...but first, I need to reformat your Cypher so I can make better sense of it.
MATCH (n:Allele)
WITH
n,
size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
(size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-()))) AS totCount
WHERE
totCount > 0
RETURN
n, hetCount, homCount, totCount,
max(totCount) AS outOf,
totCount / max(totCount) AS ratio
ORDER BY totCount ASC
LIMIT 100
Your root issue here is the max
in the return clause has an implicit GROUP BY
each row. To fix that, I'm sad to say, you're gonna have to run the query twice. Once to get the real totals, and a second time to get the values you're looking for, that needs the total and max to do the math.
MATCH (n:Allele)
WITH max((size((n)<-[:HETEROZYGOUS]-()) + 2*(size((n)<-[:HOMOZYGOUS]-())))) AS totCountMax
MATCH (n:Allele)
WITH
n, totCountMax,
size((n)<-[:HETEROZYGOUS]-()) AS hetCount,
2*(size((n)<-[:HOMOZYGOUS]-())) AS homCount,
hetCount + homCount AS totCount
WHERE
totCount > 0
RETURN
n, hetCount, homCount, totCount,
totCountMax AS outOf,
totCount / totCountMax AS ratio
ORDER BY totCount ASC
LIMIT 100
Rather than doing all the path matching like this, it might be worth mutating the data a little, to make it all run more efficiently.
MATCH (n:Allele)
SET n.hetCount = size((n)<-[:HETEROZYGOUS]-())
SET n.homCount= 2*size((n)<-[:HOMOZYGOUS]-())
;
# Then, this should run better
MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount) as totCountMax
. . .
https://neo4j.com/developer/cypher/aggregation-returns-functions/
08-14-2020 02:09 AM
Thanks, @tony.chiboucas. I'll try that later once my current query is done - using a desktop so having to use swap for my queries on such a large dataset is seriously impacting performance. Might need to take some lessons on APOC soon as well!
08-14-2020 05:49 AM
Hi @tony.chiboucas. Just to clarify, the data transformation would set properties on the nodes, right? This is fine, but I was initially worried about the increased storage required across the large number of nodes (although I realise there aren't actually that many relative to relationships, which I had issues with properties on previously). I assume if this is the case, I could set a clause in another query, just to specifically remove these if I want to just set the ratio and not the counts?
08-14-2020 10:12 AM
Yes, that would add properties to :Allele
nodes. Total increase in size-on-disk will be less than 5MB.
Absolutely. I would caution that it might be worth keeping the data, just to make recalculation much faster and easier.
One additional note: pulling all the relationships and counting them, will take a lot of time with such a large number of rels. A little apoc.periodic.iterate
could go a long way here. Once you've counted the relationships, subsequent queries should be much faster.
08-19-2020 02:02 AM
Hi @tony.chiboucas, just about to try this now. Is it worth indexing the hetCount and homCount properties for this?
08-19-2020 05:19 AM
Hi @tony.chiboucas. I've set the hetCount and homCount now, and thanks, even without indexing this made the second query much quicker. However, I've used the following query and the outOf column is just showing the max for n, not the max for all of n and making ratio
always 1:
MATCH (n:Allele)
WITH n, max(n.hetCount + n.homCount as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC
What am I doing wrong?
08-19-2020 09:57 AM
MATCH (n:Allele)
# HERE is your trouble. Let me explain...
# WITH n, max(n.hetCount + n.homCount as totCountMax
WITH max(n.hetCount + n.homCount as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
WHERE totCount > 0
RETURN n, n.hetCount, n.homCount, totCount, totCountMax as outOf, totCount / totCountMax AS ratio
ORDER BY totCount ASC
If you think about Cypher MATCH results as a spreadsheet (you should check the sheet results in these cases), you'll get one row for every MATCH. Aggregate functions will have an inherent GROUP BY on any variables you're including.
I've built a little NeoConsole to play with: http://console.neo4j.org/r/3bsdhd
x.val = 1
y.val = 2
z.val = 4
Now, what happens below, is that Cypher is trying to keep the b
contents in the result, so the SUM
aggregate function won't collapse the values of l.val beyond that. Essentially turning whatever you KEEP IN CONTEXT (b
here) as an implicit GROUP BY.
MATCH (b:Box)-[]-(l:Label)
RETURN b, SUM(l.val) as sum;
However, if you omit that variable to get an aggregate of all the things...
MATCH (b:Box)-[]-(l:Label)
RETURN SUM(l.val) as sum;
You get the aggregate you need, the trick is then in using it across all the b
nodes here...
MATCH (b:Box)-[]-(l:Label)
WITH SUM(l.val) as sum
MATCH (b:Box)-[]-(l:Label)
# now you have SUM and b, and l in context...
RETURN b, l, sum
08-23-2020 07:29 AM
Thanks. I went for this in the end, although don't think I necessarily needed the return:
MATCH (n:Allele)
WITH MAX(n.hetCount + n.homCount) as totCountMax
MATCH (n:Allele)
WITH n, totCountMax, n.hetCount + n.homCount AS totCount
SET n.frequency = ((totCount * 100.0000) / totCountMax)
RETURN n, n.hetCount, n.homCount, totCount, totCountMax n.frequency AS frequency
ORDER BY totCount ASC
All the sessions of the conference are now available online