Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-23-2021 02:51 PM
I am a novice. I have tried to create a small model with some test data.
The next step was to use a much larger set of input, about 300,000 rows.
I tried using the USING PERIODIC COMMIT but received an error, stating
"Executing queries that use periodic commit in an open transaction is not possible"
I am stumped concerning how to address my problem. I will assume I am doing something wrong or should use some other approach.
Using Windows 10 and Neo4j 4.1.3
Solved! Go to Solution.
05-23-2021 03:13 PM
Through what client are you submitting the cypher statement/load csv?
If using the Neo4j Browser, can you preface the statement with :auto
similar to what is described at Importing CSV Data into Neo4j - Developer Guides and for example
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
and does this address your issue
05-23-2021 03:13 PM
Through what client are you submitting the cypher statement/load csv?
If using the Neo4j Browser, can you preface the statement with :auto
similar to what is described at Importing CSV Data into Neo4j - Developer Guides and for example
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
and does this address your issue
05-23-2021 07:22 PM
Fantastic. This worked. Thank you very much.
05-23-2021 03:25 PM
Hi @fredwat3
I think loading 300,000 records is still a small number.
I have loaded 30 million records in one case.
If processing speed is an issue for you, show us your Cypher.
And I can make some good comments.
02-21-2022 09:21 AM
Hi @koji , thank you for sharing your use case of 30 million rows. I have a similar large dataset and I kept seeing some fatal error or connection time out issue (also asked here VM stop-the-world pause for simple LOAD csv query for creating a graph - #26 by mengjiakang2017). it would be great if you could share your experience in loading large datasets, I'm posting my cypher here. Thanks!
CALL apoc.periodic.iterate('call apoc.load.jdbc($azure_url,
"select distinct m.measurement_id, m.visit_occurrence_id, m.measurement_concept_id, v.visit_concept_id, v.visit_start_date, v.visit_end_date
from omop.measurement m
join omop.visit_occurrence v
on v.visit_occurrence_id = m.visit_occurrence_id
where measurement_id < 30000000")
yield row
','
match (m:Measurement {measurement_concept_id: row.measurement_concept_id})
match (v:VisitOccurrence {visit_concept_id:row.visit_concept_id})
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,
visit_start_date:row.visit_start_date,
visit_end_date:row.visit_end_date,
measurement_id:row.measurement_id}]->(v)
',{batchSize:1000000, iterateList:True, parallel:False, params:{azure_url:$azure_url}, retries:10});
02-21-2022 09:55 AM
when you use apoc.load.jdbc you need to make sure your batch size of data can be processed without having timeouts.
Also, would it be possible to post your schema statements. most of the time when you have so many GC stop the world statements, then the indexes may not exist.
Also this statement
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,
visit_start_date:row.visit_start_date,
visit_end_date:row.visit_end_date,
measurement_id:row.measurement_id}]->(v)
is going to be very expensive as more and more relationships are created between m and v. It has to check all the relationships before it can decide it can create the relationship.
Do you need to add the properties there?
02-21-2022 11:06 AM
Thank you @anthapu ! looks like I might want to decrease the batch size and give it a try
below is the schema and index I created (I didn't add all the attributes in the whiteboard so you might see fewer properties in the schema statements)
to answer your question -
Do you need to add the properties there?
Yes, I need these edge attributes to store the details of a healthcare visit. but I think updating to
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}
on create set the rest edge attributes might be better
CREATE
(`0` :ObservationPeriod {period_type_concept_id:'Integer',period_type_concept_name:'String'}),
(`1` :Gender {gender_concept_id:'Integer',gender_concept_name:'String'}),
(`2` :Measurement {measurement_concept_id:'Integer',measurement_concept_name:'String'}),
(`3` :Race {race_concept_id:'Integer',race_concept_name:'String'}),
(`4` :Person {person_id:'Integer',year_of_birth:'Integer',SCRIPT_case_number:'Integer'}),
(`5` :Ethnicity {ethnicity_concept_id:'Integer',ethnicity_concept_name:'String'}),
(`6` :VisitOccurrence {visit_concept_id:'Integer',visit_concept_name:'String'}),
(`7` :ProcedureOccurrence {procedure_concept_id:'Integer',procedure_concept_name:'String'}),
(`8` :ConditionOccurrence {condition_concept_id:'Integer',condition_concept_name:'String',condition_type_concept_id:'Integer',condition_type_concept_name:'String'}),
(`9` :Observation {observation_concept_id:'Integer',observation_concept_name:'String'}),
(`10` :DrugExposure {drug_concept_id:'Integer',drug_concept_name:'String',drug_type_concept_id:'Integer',drug_type_concept_name:'String',route_concept_id:'Integer',route_concept_name:'String'}),
(`4`)-[:HAS_OBSERVATION_PERIOD {observation_period_start_date:'Date',observation_period_end_date:'Date',observation_period_id:'Integer'}]->(`0`),
(`4`)-[:HAS_ETHNICITY ]->(`5`),
(`4`)-[:HAS_RACE ]->(`1`),
(`4`)-[:HAS_MEASUREMENT {measurement_date:'Date',measurement_id:'Integer'}]->(`2`),
(`4`)-[:HAS_RACE ]->(`3`),
(`4`)-[:HAS_VISIT_OCCURRENCE {visit_occurrence_id:'Integer',visit_start_date:'String',visit_end_date:'String'}]->(`6`),
(`4`)-[:HAS_PROCEDURE_OCCURRENCE {procedure_occurrence_id:'Integer',procedure_date:'Date'}]->(`7`),
(`4`)-[:HAS_CONDITION_OCCURRENCE {condition_occurrence_id:'Integer',condition_start_date:'Date',condition_end_date:'Date'}]->(`8`),
(`4`)-[:HAS_OBSERVATION {observation_id:'Integer',observation_date:'String'}]->(`9`),
(`4`)-[:HAS_DRUG_EXPOSURE {drug_exposure_id:'Integer',drug_exposure_start_date:'String',drug_exposure_end_date:'String'}]->(`10`),
(`0`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
(`2`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
(`8`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
(`10`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
(`9`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
(`7`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`);
// -----------------------create constraints for import -----------------------//
CREATE CONSTRAINT ON (p:Person) ASSERT p.person_id IS UNIQUE;
CREATE CONSTRAINT ON (o:Observation) ASSERT o.observation_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (o1:ObservationPeriod) ASSERT o1.period_type_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (d1:DrugExposure) ASSERT d1.drug_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (v:VisitOccurrence) ASSERT v.visit_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (c:ConditionOccurrence) ASSERT c.condition_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (m:Measurement) ASSERT m.measurement_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (p:ProcedureOccurrence) ASSERT p.procedure_concept_id IS UNIQUE;
CREATE INDEX obs_visit FOR ()-[r:HAS_MEASUREMENT]->() ON (r.measurement_id);
CREATE INDEX cond_visit FOR ()-[r:HAS_CONDITION_OCCURRENCE]->() ON (r.condition_occurrence_id);```
02-21-2022 12:34 PM
@mengjiakang2017 If visit_occurrence_id is unique and you want to use that on relation then you are better off doing
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}
and create index/constraint on it.
If you do not have index it will do db scans for all relationships between those nodes and it might be costlier as the number of relationships grow.
02-21-2022 12:43 PM
Thanks @anthapu , actually I made a mistake in the example, measurement_id is unique so I'd better try index and merge on it.
CREATE INDEX meas_visit FOR (m:Measurement)-[r:ASSOCIATED_DURING_VISIT]->(v:VisitOccurrence) ON (r.measurement_id);
UPDATE
Looks like I cannot use (m:Measurement) label match in creating edge indexes, I got error of
Thanks!
02-21-2022 02:32 PM
You won't be able specify the start and end nodes for relationship index. It is on a given relationship type. That's the error you are seeing.
CREATE INDEX meas_visit FOR ()-[r:ASSOCIATED_DURING_VISIT]->() ON (r.measurement_id);
Also, we should be careful about how much work we are doing when we are ingesting work loads.
Cypher makes it easy to express traversal patterns. Sometimes that simplicity can hide the amount of work you have to do when you are ingesting data.
For example say you didn't have index on relationship
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}
when there were no relationships then it is quick.
When you have 10 relationships, then it has to traverse all the 10 relationships between those nodes and retrieve the property and compare and find out if it can create the relationship or not.
So, progressively you are doing more and more work as the data keeps increasing.
This is where relying on index makes life bit easier.
All the sessions of the conference are now available online