Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-17-2020 11:47 AM
Hi,
I've been working on a project with Neo4j involving genetics. My current data model is set up as follows:
(:Allele)<-[:HETEROZYGOUS|:HOMOZYGOUS]-(:Subject)-[:had_result]->(:Phenotype)
Where each :HOMOZYGOUS
relationship type represents 2 relationships with the same :Allele
, whereas a :HETEROZYGOUS
is a single relationship to two different :Allele
s.
I've created additional labels on :Allele
, to identify those nodes which are shared across less than 1% of the population as :Rare
. I've also created :High
and :Low
labels on :Phenotype
s to identify where :Phenotype {value}
are higher, or lower than two standard deviations of the mean for each different :Phenotype
(each different :Phenotype
also has its own specific label type, eg :BMI
, :Insulin,
etc).
I'm now trying to use Cypher to evaluate probabilities using Fisher's Exact Test. This is a standard statistical test which assesses the different possibilities. What I'm looking to do is identify the following:
Subject has High/Low Phenotype | Subject has Normal Phenotype | |
---|---|---|
Subject has the Allele: | b | a |
Sibject doesn't have the Allele: | d | c |
This is the Cypher that I have pulled together:
MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v:Rare)<--(a:Subject)-->(p:Phenotype)
WHERE a<>b AND n:High OR n:Low
MATCH (n)<--(d:Subject)-[r]->(u:Rare)
MATCH (u:Rare)<--(c:Subject)-->(p:Phenotype)
WHERE v<>u AND c<>d AND n:High OR n:Low
RETURN labels(v), labels(u), v.pos, v.bp, v.SNPid. u.pos, u.bp, u.SNPid,
CASE WHEN r.type='HOMOZYGOUS' THEN (2*count(r))
WHEN r.type='HETEROZYGOUS' THEN count(r) END AS alleleFreq,
a, b, c, d, labels(p)
ORDER BY alleleFreq LIMIT 1000
There are somewhere around 1.5m :Allele nodes, ~1800 :Subject nodes, with ~800m relationships connecting them all together. This is an intensive query, which assesses each node pairing individually, so appreciate there's a lot to work through.
I'm only working on a desktop with 8GB RAM, so have had to use considerable swap space and so far this has been running for about a week!... I was hoping to get some advice on the following:
Any help would be much appreciated!
Dave
08-17-2020 12:12 PM
Please post your question(s). Thanks.
08-17-2020 12:15 PM
Sorry @ameyasoft, I'm too used to pressing Shift + Enter when typing these days! The question is ready now!
08-17-2020 02:35 PM
Thanks for your reply. Few questions:
Rare, High, and Low are these node labels or node properties?. Also, you mentioned High and Low as labels for Phenotype. Please help me understand your data model.
Thanks
08-18-2020 12:44 AM
Sure, I have a lot of multi-labelled entities. Rare
is an additional label for Allele
nodes, and High
and Low
are additional labels for Phenotype
nodes. I made this decision because I thought it would be easier to manage with such a large dataset, but perhaps this isn't an ideal way of managing the data and it would have been better represented with relationship taxonomies.
In summary, Allele
s have the following properties:
pos - represents the position of the allele on each chromosome
bp - represents the base pair of that specific allele/SNP (Single Nucleotide Polymorphism)
SNPid - if this is a know SNP, it gives the reference for it
Each Allele
node can have the following additional labels:
chromosome_n - where n represents the number of the chromosome the allele is a part of. Ideally, I would have created a taxonomy through relationships, but for this, I thought it would be easier to use labels.
Reference or Variant - this represents whether the allele is part of the genome reference, or is a variant from this (perhaps it is a less common allele than most of the population). There is also a relationship of alternative_of between each Reference
and Variant
for the same.
Rare - if the allele has a relationship with fewer than 1% of the population (of Subject
s) then it will have this label.
Each subject has two alleles at each position. Therefore Allele
nodes are connected to Subject
nodes by two different types of relationship:
HETEROZYGOUS - represents a single connection, when the two alleles are different at each position on a chromosome.
HOMOZYGOUS is a single relationship between a Subject and Allele and a single position on a chromosome but represents two of the same relationship.
Subject
nodes have one property:
subject_id - represents the anonymised unique identifier for each sample subject.
Subject
nodes also have additional labels:
cohort - To enable stratification of different cohorts, each cohort will have its own label.
sex - signifies the sex of the subject
Subject
nodes are connected to Phenotype
nodes by only one relationship type;
had_result - represents the results of phenotype tests for each Subject (where available)
This had_result
relationship also has a property of at_age
, which shows the age that the measurement/value was taken where available.
Phenotype
nodes have a single property:
value - this is a single value figure, representing whichever measure is taken for that phenotype.
Phenotypes have a large variety of additional labels which specify the phenotype taxonomy:
Phenotype Group - This might be Anthropometric Measurement or Haemotology and represents the grouping of the phenotype.
Phenotype Type - This might be Total Lean Mass or Haemoglobin and is the specific phenotype that the value property represents. There are approximately 50 different phenotype types in this dataset
In addition, there is also the High
or Low
label on Phenotypes
with extreme values outside of 2sd of the mean for each Phenotype Type and across 4 different age ranges. Those that are higher than 2sds are given the label High, and those lower than 2sds are given the label Low.
Hopefully, that gives you a good idea of the data model I'm working with. Please feel free to ask any questions.
08-18-2020 08:38 AM
Thanks for the detailed explanation. This is very helpful.
08-18-2020 09:23 AM
Please let me know which version of Neo4j you are using.
08-18-2020 09:50 AM
Try this:
OPTIONAL MATCH (n1:High)
WITH collect(distinct n1) as h1
OPTIONAL MATCH (n2:Low)
WITH collect(distinct n2) + h1 as h2
UNWIND h2 as n
MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v)<--(a:Subject)-->(p:Phenotype)
WHERE id(b) <> id(a)
WITH p, n, a, b, v, r
MATCH (n)<--(d:Subject)-[r]->(u:Rare)
MATCH (u)<--(c:Subject)-->(p)
WHERE id(d) <> id(c)
WITH p, n, a, b, v, u, d, c, r
RETURN labels(v), labels(u), v.pos, v.bp, v.SNPid. u.pos, u.bp, u.SNPid,
CASE WHEN r.type='HOMOZYGOUS' THEN (2*count(r))
WHEN r.type='HETEROZYGOUS' THEN count(r) END AS alleleFreq, labels(p)
ORDER BY alleleFreq LIMIT 100
a, b, c, d represent nodes and your return is a table with node properties. RETURN a, b, c, d will result in a graph with those nodes.
Also try each portion of the code to see if you are getting the expected results.
Like after the OPTIONAL MATCH/ UNWIND h2 as n RETURN(n). Check the output.
08-18-2020 10:13 AM
Thanks @ameyasoft. I’m using Neo4j 3.5.14.
Will this be beneficial on memory usage or performance, or both? I only ask so I know whether to reduce the heap size to something nearer my actual memory size, or if it will still require a similar amount of memory?
08-18-2020 10:19 AM
Memory and performance both. AND n:High OR n:Low I think this does all nodes scan and that's why I used two optional match each one with different label and collected the nodes. In my opinion this should increase the performance
08-18-2020 10:23 AM
Great, I’ll stop the previous query (it’s been going for about about 10 days now), reset the heap to something smaller and let you know how I get on!
08-25-2020 12:54 AM
Hi @ameyasoft, this is still taking quite a bit of time. Is there a feasible way to use APOC with this to make it a lot quicker, or doesn't this query lend itself to that?
08-25-2020 09:18 AM
OPTIONAL MATCH (n1:High)
WITH collect(distinct n1) as h1
OPTIONAL MATCH (n2:Low)
WITH collect(distinct n2) + h1 as h2
UNWIND h2 as n
//RETURN COUNT(n)
//time to get the count..
MATCH (n)<--(b:Subject)-[r]->(v:Rare)
MATCH (v)<--(a:Subject)-->(p:Phenotype)
WHERE id(b) <> id(a)
WITH p, n, a, b, v, r
//RETURN p, n, a, b, v, r LIMIT 40
Run these checks and let me know. Thanks.
08-26-2020 03:42 AM
Hi @ameyasoft,
The first count came back as 2,093 and returned within 108ms.
The second query returned 40 records within 1ms and completed after 1813ms.
I guess it's the sheer volume of records being queried that is making the query hang (that and the restricted amount of memory my computer has). I guess the issue is that there are around 200k :Rare nodes, so it's doing ~400,000,000 possible queries?
I guess what I need is counts for a, b, c and d for each :Subject, but not sure how to get that instead of just showing the specific nodes, which is probably a much bigger query!
08-26-2020 07:51 PM
Change this return: //RETURN p, n, a, b, v, r LIMIT 40
to RETURN count(distinct a) as Cnta, count(distinct b) as Cntb.
Check if this helps.
08-30-2020 08:07 AM
This is still running 4 days on @ameyasoft. Perhaps there’s not a good way to do this in cypher and it just needs to be done more statistically outside of Neo4j?
08-30-2020 09:15 AM
Sorry to hear that. Is it possible to send me some sample data so that I can run on it my machine. You can send the file to ameyasoft@gmail.com.
08-31-2020 09:41 AM
Interesting read, thank you for posting the question, and information on the dataset. I agree with ameyasoft, the query challenge is complicated enough that having access to the dataset (or subset) maybe required to give meaningful help.
Regarding query construction: This may be apparent now, but I didn't see it actually stated.
I recommend always (carefully) using LIMIT statements to return just a few and verify the results first before running queries wide open on all data.
I feel this is good advice for new neo4j users, but also for everyone with non-trivial data sets.
Also, if a query takes more than a few minutes to run, I'd likely kill the query and look at the query PROFILE, and redesign the query. Once sure the query does what you want, you could estimate the time to complete by extrapolation from timing smaller result returns (e.g. LIMIT 100 and LIMIT 1000) Note: this may be an exponential curve
09-01-2020 05:47 AM
Thanks @Joel , I haven't had time to review the dataset, but I'll have to review as it is genomic and therefore sensitive data. It's still running 6 days in, so I guess I should kill it. I'll run an explain on it afterwards.
It is likely to be exponential, as it was ok with a limit of ~40. I'll try and run it again with a limit of 40 subjects and profile it, to see how that fairs.
09-01-2020 05:58 AM
The plan is below; I imagine the 50 trillion rows is the issue!...
All the sessions of the conference are now available online