Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-23-2020 01:44 AM
MODULE_1 CD55 FILIP1L WFS1 SYNE1 CNN3
MODULE_2 CNN3 IGHM TNFAIP3 CFD RGS2 MYL9 GPC3 PLSCR1
I have above data set where in each row first column is Geneset Name (MODULE_1 and MODULE_2) and all the rest column contains genes for that geneset.
i.e ("MODULE_1 geneset contains genes CD55 FILIP1L WFS1 SYNE1 CNN3")
I have the dataset in csv file with 431 rows and varying column lengths with max column being 833)
How should I create nodes and relations for this dataset.
07-23-2020 10:17 AM
Try this:
merge (g:Geneset {name: "Module_1"})
merge (gn1:Gene {gene: "CD55"})
merge (gn2:Gene {gene: "FILIP1L"})
merge (gn3:Gene {gene: "WFS1"})
merge (gn4:Gene {gene: "SYNE1"})
merge (gn5:Gene {gene: "CNN3"})
merge (g)-[:GENE]->(gn1)
merge (gn1)-[:GENE]->(gn2)
merge (gn2)-[:GENE]->(gn3)
merge (gn3)-[:GENE]->(gn4)
merge (gn4)-[:GENE]->(gn5)
07-23-2020 10:50 AM
Thank you ameyasoft. I believe your model is chained. I wanted to create a "Geneset contains Genes" relation and working out as below.
I did transpose of the given data to have fixed column length 433 and max row length as 833.
Then structure would be
MODULE_1 MODULE_2
CD55 CNN3
FILIP1L IGHM
.. ....
CNN3 PLSCR1
Then I did try below query for a single column with GeneSet MODULE_1 as below
LOAD CSV WITH HEADERS FROM "file:///genefinal.csv" AS csvLine
with csvLine where not csvLine.MODULE_1 is NULL
MERGE (p:gene {gene_name: csvLine.MODULE_1})
MERGE(a:GeneSet{GeneSet:'MODULE_1'})
WITH a, COLLECT(p) as gen
foreach(q in gen | CREATE (a)-[r:CONTAINS]->(q))
RETURN *
But now, I am trying to extend this query to 431 columns. Any inputs on how to proceed?
Thank you once again.
07-23-2020 02:57 PM
With transposed dataset:
LOAD CSV WITH HEADERS FROM "file:///genefinal.csv" AS csvLine
with csvLine where not csvLine.MODULE_1 is NULL
MERGE(a1:GeneSet{GeneSet:'MODULE_1'})
MERGE(a2:GeneSet{GeneSet:'MODULE_2'})
........
MERGE(an:GeneSet{GeneSet:'MODULE_n'})
MERGE (p1:gene {gene_name: csvLine.MODULE_1})
MERGE (p2:gene {gene_name: csvLine.MODULE_2})
........
MERGE (pn:gene {gene_name: csvLine.MODULE_n})
MERGE (a1)-[:CONTAINS]->(p1)
MERGE (a2)-[:CONTAINS]->(p2)
......
MERGE (an)-[:CONTAINS]->(pn)
;
With the first dataset:
MERGE (g:Geneset {name: csvLine.Col1})
MERGE (gn1:Gene {gene: csvLine.Col2})
MERGE (gn2:Gene {gene: csvLine.Col3)
MERGE (gn3:Gene {gene: csvLine.Col4})
MERGE (gn4:Gene {gene: csvLine.Col5})
MERGE (gn5:Gene {gene: csvLine.Col6})
MERGE (g)-[:CONTAINS]->(gn1)
MERGE (g)-[:CONTAINS->(gn2)
MERGE (g)-[:CONTAINS]->(gn3)
MERGE (g)-[:CONTAINS]->(gn4)
MERGE (g)-[:CONTAINS]->(gn5)
;
Both produce same results, but my preference is to use the first dataset and not the transposed dataset.
07-24-2020 04:22 AM
I got this error when that code ran. "Cannot merge node using null property value for gene_name"
Also is there any better way to improve the code like writing a loop instead giving 433 merge statements for 433 columns
07-24-2020 09:25 AM
Use COALESCE function like:
MERGE (gn5:Gene {gene: COALESCE(csvLine.Col6, 'NA')})
Replaces null values with 'NA'. You can have any value for this.
All the sessions of the conference are now available online