cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to Split a Row While importing CSV with Cypher

mochalla
Node Clone

Hi,

I want to create a database from a CSV file using LOAD CSV. A column called Problem contains a unique value in each cell. So I'm going to generate a node from each one of the values, belonging to the p: product label.
Two other columns -Patient and Medication- may have a unique value or multiple values in each of their cells (and, sometimes, are empty).
So, I want to split cells containing multiple values in Problem Column in order to generate separated nodes. Nodes from the column patient will belong to the label p.patient, and nodes from the Medication column will belong to the label m.medication.

Test Data: Drive Link

I am successful with this cypher query to load and transform data except splitting the row which has multi value.

CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT FOR (p:Problem) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT FOR (t:Test) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT FOR (m:Medication) REQUIRE m.name IS UNIQUE;

// Patient id, Problem, Test, Medication.
LOAD CSV WITH HEADERS FROM
'file:///tathasthu1.csv' AS row
WITH row

CREATE (patient:Patient {id: row.Patient})
MERGE (problem:Problem {name: row.Problem})
MERGE (test:Test {name: row.Test})
MERGE (medication:Medication {name: row.Medication})


CREATE (patient)-[:HAS]->(problem)
CREATE (patient)-[:PERFORMED]->(test)
CREATE (test)-[:TAKEN_FOR]->(problem)
CREATE (patient)-[:TAKES]->(medication)
CREATE (medication)-[:CAN_TREAT]->(problem)
;

--
CREATE CONSTRAINT FOR (s:Subproblem) REQUIRE s.name is UNIQUE;
LOAD CSV WITH HEADERS FROM
'file:///tathasthu1.csv' AS row
WITH row WHERE NOT row.Subproblem IS null
MATCH (problem:Problem {name: row.Problem})
MATCH (medication:Medication {name: row.Medication})
MERGE(subproblem:Subproblem {name: row.Subproblem})
CREATE (subproblem)-[:IN_CATEGORY]->(problem)
CREATE (medication)-[:SIDE_EFFECT]->(subproblem)
SET subproblem.name = row.Subproblem

mochalla_0-1659682905670.png

Now I have to split Glaucoma and Chronic Pain value row into two nodes, because as per test data, vision loss is due to chronic pain but not to Glaucoma as mentioned in the Details column.

Help is appreciated. Thank you.

1 ACCEPTED SOLUTION

These queries should do what you want:

// Create constraints
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT FOR (p:Problem) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT FOR (t:Test) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT FOR (m:Medication) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT FOR (s:Subproblem) REQUIRE s.name is UNIQUE;

// Create nodes
LOAD CSV WITH HEADERS FROM 'file:///tathasthu1.csv' AS row
WITH row
MERGE (:Patient {id: row.Patient})
MERGE (:Test {name: row.Test})
MERGE (:Medication {name: row.Medication})
WITH row
CALL apoc.do.when(row.Subproblem IS NOT NULL,
    'MERGE (:Subproblem {name: row.Subproblem}) RETURN row', 
    'RETURN row', 
    {row: row}
)
YIELD value
WITH split(row.Problem, ", ") AS problems
UNWIND problems AS problem
MERGE (:Problem {name: problem});

// Create relationships
LOAD CSV WITH HEADERS FROM 'file:///tathasthu1.csv' AS row
WITH row
MATCH (p:Patient {id: row.Patient})
MATCH (t:Test {name: row.Test})
MATCH (m:Medication {name: row.Medication})
CREATE (p)-[:PERFORMED]->(t)
CREATE (p)-[:TAKES]->(m)
WITH row, p, t, m
CALL apoc.do.when(row.Subproblem IS NOT NULL,
    'MATCH (s:Subproblem {name: row.Subproblem}) CREATE (m)-[:SIDE_EFFECT]->(s) RETURN s', 
    'RETURN null AS s', 
    {row: row, m: m}
)
YIELD value
WITH split(row.Problem, ", ") AS problems, value.s AS subproblem, p, t, m
UNWIND problems AS problem
MATCH (prob:Problem {name: problem})
CREATE (p)-[:HAS]->(prob)
CREATE (t)-[:TAKEN_FOR]->(prob)
CREATE (m)-[:CAN_TREAT]->(prob)
WITH prob, subproblem
WHERE subproblem IS NOT NULL
CREATE (subproblem)-[:IN_CATEGORY]->(prob);

Regards,
Cobra

View solution in original post

23 REPLIES 23

Hello 😄

In your example, which cell in your CSV has multiple values?

Regards,
Cobra

Patient B0021 with Problem. Cell number: B2

All columns except Patient one can have multiple values?

These queries should do what you want:

// Create constraints
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT FOR (p:Problem) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT FOR (t:Test) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT FOR (m:Medication) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT FOR (s:Subproblem) REQUIRE s.name is UNIQUE;

// Create nodes
LOAD CSV WITH HEADERS FROM 'file:///tathasthu1.csv' AS row
WITH row
MERGE (:Patient {id: row.Patient})
MERGE (:Test {name: row.Test})
MERGE (:Medication {name: row.Medication})
WITH row
CALL apoc.do.when(row.Subproblem IS NOT NULL,
    'MERGE (:Subproblem {name: row.Subproblem}) RETURN row', 
    'RETURN row', 
    {row: row}
)
YIELD value
WITH split(row.Problem, ", ") AS problems
UNWIND problems AS problem
MERGE (:Problem {name: problem});

// Create relationships
LOAD CSV WITH HEADERS FROM 'file:///tathasthu1.csv' AS row
WITH row
MATCH (p:Patient {id: row.Patient})
MATCH (t:Test {name: row.Test})
MATCH (m:Medication {name: row.Medication})
CREATE (p)-[:PERFORMED]->(t)
CREATE (p)-[:TAKES]->(m)
WITH row, p, t, m
CALL apoc.do.when(row.Subproblem IS NOT NULL,
    'MATCH (s:Subproblem {name: row.Subproblem}) CREATE (m)-[:SIDE_EFFECT]->(s) RETURN s', 
    'RETURN null AS s', 
    {row: row, m: m}
)
YIELD value
WITH split(row.Problem, ", ") AS problems, value.s AS subproblem, p, t, m
UNWIND problems AS problem
MATCH (prob:Problem {name: problem})
CREATE (p)-[:HAS]->(prob)
CREATE (t)-[:TAKEN_FOR]->(prob)
CREATE (m)-[:CAN_TREAT]->(prob)
WITH prob, subproblem
WHERE subproblem IS NOT NULL
CREATE (subproblem)-[:IN_CATEGORY]->(prob);

Regards,
Cobra

I am getting this error!

mochalla_0-1659688833711.png

 

Sorry, you must install APOC plugin on your database.

Thank you so much. This worked.

Could you help me to understand the following:
1. Why is yield used?
2. How cypher is reading prop as problem because not in any clause you mentioned the same.

MATCH (prob:Problem {name: problem})

3. If i want the subproblem to be connected to one problem, but to both split values, how to do that!
Thanks.

  1. YIELD is used to get result from procedure, in your case it's apoc.do.when(). The value is a dictionary composed of what is in the return clause of apoc.do.when().
  2. prob and problem are both variables, prob is the variable that contains the node and problem is the variable that contains the name that I get from the UNWIND clause.

Thank you.

3. If i want the subproblem to be connected to one problem, but not to both split values, then may I know how to do that?

Do you have a rule that works all the time? For example always connect to the first problem.

Otherwise, I will need more details about your use case.

No there no such rule. It depends on the problem and subproblem!
Kindly look at the following image, here you see Glaucoma doesn't have a relationship to sub problem, but chronic pain does have a relationship to sub problem.
So, now with the Split, the values in the problem row are separated. But, what if I want to connect one value from the same cell to the sub problem but not the other one!

Well I think you will have create 2 CSVs, one for nodes and the other one for relationships. Since it's specific, you will have to define them before to load them in the database.

mochalla
Node Clone

Instead of 2 CSV's I have created a separate column for each and added more rows. Now in this file (FILE), you can see there are one or two null values in each column.

// Create constraints
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT FOR (d:Disease) REQUIRE d.name IS UNIQUE;
CREATE CONSTRAINT FOR (t:Test) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT FOR (m:Medication) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT FOR (s:Sub_disease) REQUIRE s.name is UNIQUE;
CREATE CONSTRAINT FOR (dm:Medication_details) REQUIRE dm.name is UNIQUE;
CREATE CONSTRAINT FOR (td:Test_details) REQUIRE td.name is UNIQUE;

// Create nodes
LOAD CSV WITH HEADERS FROM 'file:///healthcare.csv' AS row
WITH row
MERGE (:Patient {id: row.Patient})
MERGE (:Test {name: row.Test})
MERGE (:Medication {name: row.Medication})
MERGE (:Disease {name: row.Disease});
WITH row
CALL apoc.do.when(row.Sub_disease IS NOT NULL,
'MERGE (:Sub_disease {name: row.Sub_disease}) RETURN row',
'RETURN row',
{row: row}
CALL apoc.do.when(row.Detailmed is NOT NULL,
'MERGE (:Detailmed {name: row.Medical_details}) RETURN row',
'RETURN row',
{row: row}
CALL apoc.do.when(row.Test_details is NOT NULL,
'MERGE (:Test_details {name: row.Test_details}) RETURN row',
'RETURN row',
{row: row}
)
YIELD value

// Create relationships
LOAD CSV WITH HEADERS FROM 'file:///healthcare.csv' AS row
WITH row
MATCH (p:Patient {id: row.Patient})
MATCH (t:Test {name: row.Test})
MATCH (m:Medication {name: row.Medication})
MATCH (dm:Medical_details {name: row.Medica_details})
MATCH (tm:Test_detials {name: row.Test_details})

CREATE (p)-[:HAS]->(d)
CREATE (p)-[:PERFORMED]->(t)
CREATE (p)-[:TAKES]->(m)
WITH row, p, t, m,
CALL apoc.do.when(row.Sub_disease IS NOT NULL,
'MATCH (s:Sub_disease {name: row.Sub_diease}) CREATE (s)-[:CAN_BE_ASSOCIATED_WITH]->(d) RETURN s',
'RETURN null AS s',
{row: row, d: d}
CALL apoc.do.when(row.Medical_detials is NOT NULL,
'MATCH (:Medical_details {name: row.Medical_detials}) CREATE (m)-[:IS_A]->(dm) RETURN dm',
'RETURN null AS dm',
{row: row, m: m}
CALL apoc.do.when(row.Test_details is NOT NULL,
'MATCH (:Testdetails {name: row.Test_details}) CREATE (td)-[:IS_A_TEST_OF]->(t) RETURN td',
'RETURN null AS td',
{row: row, td: td}
)
YIELD value
CREATE (t)-[:TAKEN_FOR]->(d)
CREATE (m)-[:CAN_TREAT]->(d)
;
I have shared example of one patient about how I want to connect.

Untitled graph.png


Following is the error I am getting, 

mochalla_0-1659959663374.png

Thanks in advance.

I guess you have a problem of copy-paste. The beginning of the query is missing.

I have the one I pasted. I think I am missing something in my query!

I will see but I cannot access your new file

Apologies. Here is the File

There are a lot of typos in your queries and column names are not corresponding to what you write in the queries... You could at least have a look at the documentation...

I changed the column name while sharing it with you. Sometimes I am getting errors when there is a gap between two words. And wanted to know how avoid that error.

// Create constraints
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT FOR (d:Disease) REQUIRE d.name IS UNIQUE;
CREATE CONSTRAINT FOR (t:Test) REQUIRE t.name IS UNIQUE;
CREATE CONSTRAINT FOR (m:Medication) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT FOR (s:SubDisease) REQUIRE s.name is UNIQUE;
CREATE CONSTRAINT FOR (dm:MedicationDetails) REQUIRE dm.name is UNIQUE;
CREATE CONSTRAINT FOR (td:TestDetails) REQUIRE td.name is UNIQUE;

// Create nodes
LOAD CSV WITH HEADERS FROM 'file:///healthcare.csv' AS row 
WITH row 
MERGE (:Patient {id: row.Patient}) 
MERGE (:Test {name: row.Test}) 
MERGE (:Medication {name: row.Medication}) 
MERGE (:Disease {name: row.Disease}) 
WITH row
CALL apoc.do.when(
	row.Subdisease IS NOT NULL, 
	'MERGE (:SubDisease {name: row.Subdisease}) RETURN row', 
	'RETURN row', 
	{row: row}
) 
YIELD value
WITH row
CALL apoc.do.when(
	row.Detailmed IS NOT NULL, 
	'MERGE (:MedicationDetails {name: row.Detailmed}) RETURN row', 
	'RETURN row', 
	{row: row}
)
YIELD value
WITH row
CALL apoc.do.when(
	row.Testdetails IS NOT NULL, 
	'MERGE (:TestDetails {name: row.Testdetails}) RETURN row', 
	'RETURN row', 
	{row: row}
)
YIELD value
RETURN row;

// Create relationships
LOAD CSV WITH HEADERS FROM 'file:///healthcare.csv' AS row 
WITH row 
MATCH (p:Patient {id: row.Patient}) 
MATCH (t:Test {name: row.Test}) 
MATCH (m:Medication {name: row.Medication}) 
MATCH (d:Disease {name: row.Disease}) 
MERGE (p)-[:PERFORMED]->(t) 
MERGE (p)-[:TAKES]->(m) 
MERGE (p)-[:HAS]->(d) 
MERGE (t)-[:TAKEN_FOR]->(d) 
MERGE (m)-[:CAN_TREAT]->(d) 
WITH row, p, t, m, d 
CALL apoc.do.when(
	row.Detailmed IS NOT NULL, 
	'MATCH (n:MedicationDetails {name: row.Detailmed}) MERGE (m)-[:IS_A]->(n) RETURN row', 
	'RETURN row', 
	{row: row, m: m}
)
YIELD value
WITH row, p, t, m, d
CALL apoc.do.when(
	row.Testdetails IS NOT NULL, 
	'MATCH (n:TestDetails {name: row.Testdetails}) MERGE (n)-[:IS_A_TEST_OF]->(t) RETURN row', 
	'RETURN row', 
	{row: row, t: t}
)
YIELD value
WITH row, p, t, m, d
CALL apoc.do.when(
	row.Subdisease IS NOT NULL, 
	'MATCH (s:SubDisease {name: row.Subdisease}) MERGE (s)-[:CAN_BE_ASSOCIATED_WITH]->(d) RETURN row', 
	'RETURN row',
	{row: row, d: d}
)
YIELD value
RETURN row;

Regards,
Cobra

Thank you so much.
So, here apart from the column name,
My mistake was not adding WITH clause before every Apoc.do.when? And 

There was a ; in the middle of the query, WITH clauses were missing, end brackets of apoc.do.when() were missing, column names were different...

Thank you Cobra.