cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Graph Data Model for Gene dataset

MODULE_1 CD55 FILIP1L WFS1 SYNE1 CNN3
MODULE_2 CNN3 IGHM TNFAIP3 CFD RGS2 MYL9 GPC3 PLSCR1

I have above data set where in each row first column is Geneset Name (MODULE_1 and MODULE_2) and all the rest column contains genes for that geneset.
i.e ("MODULE_1 geneset contains genes CD55 FILIP1L WFS1 SYNE1 CNN3")

I have the dataset in csv file with 431 rows and varying column lengths with max column being 833)

How should I create nodes and relations for this dataset.

5 REPLIES 5

ameyasoft
Graph Maven
Try this:

merge (g:Geneset {name: "Module_1"})
merge (gn1:Gene {gene: "CD55"})
merge (gn2:Gene {gene: "FILIP1L"})
merge (gn3:Gene {gene: "WFS1"})
merge (gn4:Gene {gene: "SYNE1"})
merge (gn5:Gene {gene: "CNN3"})

merge (g)-[:GENE]->(gn1)
merge (gn1)-[:GENE]->(gn2)
merge (gn2)-[:GENE]->(gn3)
merge (gn3)-[:GENE]->(gn4)
merge (gn4)-[:GENE]->(gn5)

Thank you ameyasoft. I believe your model is chained. I wanted to create a "Geneset contains Genes" relation and working out as below.
I did transpose of the given data to have fixed column length 433 and max row length as 833.
Then structure would be
MODULE_1 MODULE_2
CD55 CNN3
FILIP1L IGHM
.. ....
CNN3 PLSCR1

Then I did try below query for a single column with GeneSet MODULE_1 as below

LOAD CSV WITH HEADERS FROM "file:///genefinal.csv" AS csvLine
with csvLine where not csvLine.MODULE_1 is NULL
MERGE (p:gene {gene_name: csvLine.MODULE_1})

MERGE(a:GeneSet{GeneSet:'MODULE_1'})

WITH a, COLLECT(p) as gen
foreach(q in gen | CREATE (a)-[r:CONTAINS]->(q))
RETURN *

But now, I am trying to extend this query to 431 columns. Any inputs on how to proceed?
Thank you once again.

With transposed dataset:

LOAD CSV WITH HEADERS FROM "file:///genefinal.csv" AS csvLine
with csvLine where not csvLine.MODULE_1 is NULL


MERGE(a1:GeneSet{GeneSet:'MODULE_1'})
MERGE(a2:GeneSet{GeneSet:'MODULE_2'})
........
MERGE(an:GeneSet{GeneSet:'MODULE_n'})

MERGE (p1:gene {gene_name: csvLine.MODULE_1})
MERGE (p2:gene {gene_name: csvLine.MODULE_2})
........
MERGE (pn:gene {gene_name: csvLine.MODULE_n})

MERGE (a1)-[:CONTAINS]->(p1)
MERGE (a2)-[:CONTAINS]->(p2)
......
MERGE (an)-[:CONTAINS]->(pn)
;

With the first dataset:

MERGE (g:Geneset {name: csvLine.Col1})
MERGE (gn1:Gene {gene: csvLine.Col2})
MERGE (gn2:Gene {gene: csvLine.Col3)
MERGE (gn3:Gene {gene: csvLine.Col4})
MERGE (gn4:Gene {gene: csvLine.Col5})
MERGE (gn5:Gene {gene: csvLine.Col6})

MERGE (g)-[:CONTAINS]->(gn1)
MERGE (g)-[:CONTAINS->(gn2)
MERGE (g)-[:CONTAINS]->(gn3)
MERGE (g)-[:CONTAINS]->(gn4)
MERGE (g)-[:CONTAINS]->(gn5)

;

Both produce same results, but my preference is to use the first dataset and not the transposed dataset.

I got this error when that code ran. "Cannot merge node using null property value for gene_name"
Also is there any better way to improve the code like writing a loop instead giving 433 merge statements for 433 columns

Use COALESCE function like:

MERGE (gn5:Gene {gene: COALESCE(csvLine.Col6, 'NA')})

Replaces null values with 'NA'. You can have any value for this.