Neo4j

12kunal34 · ‎02-10-2020

Hi Graphis,

I have some doubts regarding importing data into Neo4j.
I have a large volume of data (i have 100k JSON files and each JSON contains 200k records).
what is the best way to import this data?
I am using Pyspark and neo4j-admin import currently. is there any alternative method for this or can I import this much of huge data using pyspark only?

paulare · ‎02-11-2020

Hi @12kunal34

Maybe this blog is helpful

If you can describe a more specific issue you having with your current method, then maybe the community may give you more ideas back?

krisgeus · ‎02-11-2020

Using apache spark only will most likely result in deadlock situations for large graphs. Creating files and using the admin import is currently the best option I believe.
There might be a possibility to run a clustering algorithm on your graph in spark and separate the clusters so you get rid of the deadlocks. At the end you need to create the cluster connections again of course.
This is in no way an easy solution though.

anthapu · ‎02-11-2020

You could try this utility written in python.

It has a config yaml file, where you can specify the file URL and corresponding cypher to ingest the data. It import each file in sequence.

If you want to parallelize the import, you can create multiple config yaml files and run them in parallel.

As others mentioned, when you run in parallel there is a possibility of dead locks, as relationship creation locks both side nodes.

Neo4J admin import would still be fastest way to import huge amount of initial data.

Neo4j

Load large volume of data in neo4j