Neo4j

dbranco · ‎01-30-2020

Hello everyone,

I have a question about loading data from a file that is plained and has some rules:This is my sample data

00001 A00 0 Cholera Cholera
00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified
00005 A01 0 Typhoid and paratyphoid fevers Typhoid and paratyphoid fevers

The rules are the following

So the question is how can I tell neo4j to follow that space format?
also I would like to ignore maybe the first column (the id should be autogenerated

dbranco · ‎01-30-2020

After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?

CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;

USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
  code: code,
    parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
    shortDescription:
    shortDescription,
    longDescription: longDescription
});

MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)

I must add that in step 3 I needed to add the label (n:PathologicalDisease) in order to speed the creation of the relations

View solution in original post

dbranco · ‎01-30-2020

After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?

CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;

USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
  code: code,
    parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
    shortDescription:
    shortDescription,
    longDescription: longDescription
});

MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)

I must add that in step 3 I needed to add the label (n:PathologicalDisease) in order to speed the creation of the relations

ameyasoft · ‎01-30-2020

It's a brilliant idea!!.I really like this.

dbranco · ‎01-31-2020

Thanks, as I said I am newbie and I am still looking for the best solution when importing data; any new idea is very welcome.

Also, I have changed my import in order to separate the node from the description in order to create the translation node and get more flexibility when adding other languages. The problem is that the import took 2 hours and 6 minutes, so I probably need to create an index somewhere for the translation but haven't think about where.

This is my new modification:

// Load all PDs into CVS
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (pd:PathologicalDisease {
  code: code, 
  parent: (CASE isGroup WHEN "0" THEN true ELSE false END)
})
MERGE (pd_translation:PDTranslation{
  shortDescription: shortDescription, 
  longDescription: longDescription
})

// Link PD to its translation
MERGE (pd)-[:TRANSLATE_TO {language: "en"}]->(pd_translation);

I believe that the problem is related to the MERGE of the translation. Since there are previous matches between some nodes (690 nodes are reused for the ICD-10 revision 2020) so that might do that it takes more time for the importing process.

If someone has an opinion of indexing the translation nodes or the relation please help me.

dbranco · ‎01-31-2020

Don't forget to like the post that helps with the visibility I guess.

Neo4j

Import plain file with a format