Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-30-2020 02:44 PM
Hello everyone,
I have a question about loading data from a file that is plained and has some rules:This is my sample data
00001 A00 0 Cholera Cholera
00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified
00005 A01 0 Typhoid and paratyphoid fevers Typhoid and paratyphoid fevers
The rules are the following
So the question is how can I tell neo4j to follow that space format?
also I would like to ignore maybe the first column (the id should be autogenerated
Solved! Go to Solution.
01-30-2020 02:47 PM
After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?
CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
code: code,
parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
shortDescription:
shortDescription,
longDescription: longDescription
});
MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)
01-30-2020 02:47 PM
After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?
CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
code: code,
parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
shortDescription:
shortDescription,
longDescription: longDescription
});
MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)
01-30-2020 11:37 PM
It's a brilliant idea!!.I really like this.
01-31-2020 01:58 AM
Thanks, as I said I am newbie and I am still looking for the best solution when importing data; any new idea is very welcome.
Also, I have changed my import in order to separate the node from the description in order to create the translation node and get more flexibility when adding other languages. The problem is that the import took 2 hours and 6 minutes, so I probably need to create an index somewhere for the translation but haven't think about where.
This is my new modification:
// Load all PDs into CVS
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (pd:PathologicalDisease {
code: code,
parent: (CASE isGroup WHEN "0" THEN true ELSE false END)
})
MERGE (pd_translation:PDTranslation{
shortDescription: shortDescription,
longDescription: longDescription
})
// Link PD to its translation
MERGE (pd)-[:TRANSLATE_TO {language: "en"}]->(pd_translation);
I believe that the problem is related to the MERGE of the translation. Since there are previous matches between some nodes (690 nodes are reused for the ICD-10 revision 2020) so that might do that it takes more time for the importing process.
If someone has an opinion of indexing the translation nodes or the relation please help me.
01-31-2020 01:58 AM
Don't forget to like the post that helps with the visibility I guess.
All the sessions of the conference are now available online