Neo4j

mengjiakang2017 · ‎02-21-2022

Hello all,

I'm loading data from Azure SQL server to neo4j hosted on docker in a Linux server. one of the table that includes ~7 million rows cannot be loaded in one run and kept causing connection timeout issue.
can someone help on configuring the timeout during the JDBC connection?

below is my connection string and import query -

:param azure_url => 'jdbc:sqlserver://SERVER;databaseName=DATABASE;user=User;password=PW';

CALL apoc.periodic.iterate('call apoc.load.jdbc($azure_url, "select distinct m.person_id, m.measurement_id, m.measurement_concept_id,m.measurement_date, m.unit_concept_id, m.unit_concept_name, m.value_as_number from omop.measurement m where m.measurement_id > 200000000")

yield row

','

with row

where row.unit_concept_id is not null

and row.value_as_number is not null

match (p:Person {person_id:row.person_id})

match (m:Measurement {measurement_concept_id: row.measurement_concept_id})

merge (p)-[r:HAS_MEASUREMENT {measurement_id:row.measurement_id}]->(m)

on create

set

r.measurement_date=row.measurement_date,

r.unit_concept_id=row.unit_concept_id,

r.unit_concept_name=row.unit_concept_name,

r.value_as_number=row.value_as_number

',{batchSize:100000, iterateList:True, parallel:False, params:{azure_url:$azure_url}, retries:5

}

);

mengjiakang2017 · ‎03-02-2022

Thank you all for the nice suggestions!

just wanted to update that this was solved by testing using neo4j-admin import - Import - Operations Manual

loaded super fast compared to apoc.load.jdbc or apoc.load csv as below

IMPORT DONE in 29s 560ms. 
Imported:
  1944 nodes
  12987721 relationships
  45588552 properties
Peak memory usage: 76.06MiB

View solution in original post

anthapu · ‎02-21-2022

apoc.load.jdbc is bit difficult to control and continue in case of failure.

There are 2 options.

Use apache hop. It can stream the data from your database and write to neo4j without timeouts..
If you comfortable with Python, you can export the csv file to local file system and use this Python ingest utility pyingest to write the data to neo4j.

Advantage of this approach is that you can start from the place of failure so that you don't need to restart the whole process. You can fix the cypher and continue.

mengjiakang2017 · ‎02-22-2022

Thanks a lot @anthapu for the suggestions.

I haven't tried apache hop yet but will look into that. as for csv, I don't think it can store more than ~1million rows so might need to use txt (not sure if neo4j supports) or json.

anthapu · ‎02-22-2022

You could load json into neo4j.
The Python ingest utility GitHub - neo4j-field/pyingest can ingest JSON also into neo4j

sameer_gijare14 · ‎02-23-2022

I think one can export csv in sql server in manageable volumes and write a cutom ETL pipeline to ingest data into Neo4j rather than relying on ready made open source libraries until you are sure of its design goals.

mengjiakang2017 · ‎03-02-2022

Thank you all for the nice suggestions!

just wanted to update that this was solved by testing using neo4j-admin import - Import - Operations Manual

loaded super fast compared to apoc.load.jdbc or apoc.load csv as below

IMPORT DONE in 29s 560ms. 
Imported:
  1944 nodes
  12987721 relationships
  45588552 properties
Peak memory usage: 76.06MiB

dana_canzano · ‎03-03-2022

@mengjiakang2017
neo4j-admin import will be faster than apoc.load.jdbc but neo4j-admin import can only operate on a new/empty database. You can not run neo4j-admin import against an existing/populated database. It is also run against a offline database. Whereas apoc.load.jdbc behaves in the exact opposite manner

mengjiakang2017 · ‎03-09-2022

Yes. I agree that apoc is great for database updates while neo4j-admin works well for database building from the ground. I haven't checked apache hop but it might be a good candidate for large data import based on Anthapu's suggestion.

Neo4j

Apoc.load.jdbc data import connection time out