Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-25-2020 02:27 PM
I have a sequence string 'TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCT'
I have nodes with the label Sequence and property seqFull which contains a large DNA String.
Want to return the nodes and the similarity score where the similarity score is greater the .75 (75%) where the input string finds a similar strings within a larger string on a Node in Neo4J
Not looking for exact match using the term CONTAINS but something like CONTAINS but not exact match but matches at 75% or greater
03-25-2020 06:38 PM
You can use apoc.text.jaroWinklerDistance to get the similarity and this gives a much better similarity.
I am using this in a production database for different purpose. Need to use APOC library.
Here is an example with two sequence strings that I got from internet:
with "gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattg" as seq1 ,
"gaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccg" as seq2
return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity
Result:
similarity: 78
with "gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg" as seq1,
"gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg" as seq2
return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity
Result
similarity: 80
09-30-2020 02:36 PM
Thank you - sorry been a long time to respond. Got on a new project but this is exactly what I am looking for
09-30-2020 09:29 PM
Thanks for your appreciation. During my previous era I worked on biomembranes and surfactant-oil miscibility. By these studies, I developed lot of environment friendly solutions. THOSE WERE THE DAYS!! LIFE GOES ON..!
Now I am purely into Neo4j!
Let me know if you need any help and am very happy to help.
Thanks
All the sessions of the conference are now available online