cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Fastest way to load data in neo4j using python

Hi, i have csv files with fairly large number of rows(<10M for now). I have to read that
csv and create nodes and relationships among them in neo4j. I'm using pandas for csv data manipulation and py2neo for node and relation creation. But the problem is that for a dataset as small as 500,000 rows it is taking hours(>10 hours) to read data, create nodes and relationships in the graph DB. Is there any solution to this ?

Thanks

1 ACCEPTED SOLUTION

apoc.load.csv is your new friend if it's a new database or one who needs to be updated.
It's explicitly build for your use case, and it won't fear your tiny 500 000 lines.

Be aware that you must always create constraints before importing any data with a MERGE OR MATCH clause, or your children will die before it's finish.

I used Python before for my data injection, now I use apoc.load.csv to standardize and improve the speed of the process.

View solution in original post

5 REPLIES 5

the absolute fastest way (by far) to load large datasets into neo4j is to use the bulk loader

neo4j-admin import

it is orders of magnitude faster, for one reason it only builds a database from the ground up, so transaction tracking can be (and is) turned off during the load.

caveats: this only works for new databases, it can't be used to add new data to an existing database

apoc.load.csv is your new friend if it's a new database or one who needs to be updated.
It's explicitly build for your use case, and it won't fear your tiny 500 000 lines.

Be aware that you must always create constraints before importing any data with a MERGE OR MATCH clause, or your children will die before it's finish.

I used Python before for my data injection, now I use apoc.load.csv to standardize and improve the speed of the process.

I'm facing one problem while loading csv data using cypher. The script I am using works fine for both node and relationship creation from the csv but only for limited number of rows in csv (400-500 rows). While i use the same script for original dataset with large number of rows, the script is running infinitely and at last throws an error:
*"ServiceUnavailable: WebSocket connection failure. Due to security constraints in your web browser, *
the reason for the failure is not available to this Neo4j Driver. Please use your browsers development
*console to determine the root cause of the failure. Common reasons include the database being unavailable, *
using the wrong connection URL or temporary network problems. If you have enabled encryption, ensure your browser is
configured to trust the certificate Neo4j is configured to use. WebSocket readyState is: 3
"
I'm not able to find any working solution for this problem. Can you guide me through it ?

Following is the cypher script I'm using:

LOAD CSV WITH HEADERS FROM 'file:///Links.csv' as row
WITH row WHERE row.ObjectID IS NOT NULL
MERGE (f:Fiber Cable {ObjectID: row.ObjectID, Identifier: row.Identifier, Status: row.Status, RouteType: row.Type})
with f, row
UNWIND split(row.Segments, ' ') AS node
MERGE (n:Node{Identifier:node})
MERGE (f)-[r:ATTACHED_TO]->(n)
return count(f)

  1. Create two constraint on Node(Identifier) and on :Fiber Cable(Identifier) so that the nodes are looked up quickly from that (or add a Node label to the first node you're creating)
  2. Only merge on the single identifier property and set the others
  3. forlarge files (50k+) you probably want to use USING PERIODIC COMMIT and three separate passess two for the nodes and one for the relationships
LOAD CSV WITH HEADERS FROM 'file:///Links.csv' as row
WITH row WHERE row.ObjectID IS NOT NULL
MERGE (f:Node  {Identifier: row.Identifier})
ON CREATE SET f:`Fiber Cable`, f.ObjectId=row.ObjectID, f.Status=row.Status, f.RouteType=row.Type
with f, row
UNWIND split(row.Segments, ' ') AS node
MERGE (n:Node{Identifier:node})
MERGE (f)-[r:ATTACHED_TO]->(n)
return count(f)