Neo4j

shivanandiyer · ‎05-18-2019

How is the write performance of Neo4j on ubuntu? I'm trying to load yelp dataset to neo4j using a cypher query and it took me 7 hours to load a 300 mb json file with 200k nodes. I'm using ubuntu 19.04 laptop, 32gb ram, 4 core i7 cpu with NVMe. I tried changing the IO scheduler from deadline to none, increased the heap size and memory in the config files but with little improvements. Is this due to the Linux file system or is the load performance so poor in all environments?

If anyone has faced a similar problem and has a figured a way out, let me know.

stefan_armbrust · ‎05-18-2019

Common issues with large imports are:

lack of indexes (in case your statements use MATCH or MERGE)
too large transactions

To get a better understanding pls share what exactly you do for importing.

View solution in original post

stefan_armbrust · ‎05-18-2019

Common issues with large imports are:

lack of indexes (in case your statements use MATCH or MERGE)
too large transactions

To get a better understanding pls share what exactly you do for importing.

shivanandiyer · ‎05-18-2019

Hi Stephan,

This is the cypher query I'm using to import.

CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///business.json')
YIELD value
WITH value
RETURN value",
"MERGE (b:Business {id:value.business_id})
SET b += apoc.map.clean(value,
['attributes','hours','business_id','categories',
'address','postal_code'],
)",
{iterateList: true, batchSize:10000, parallel: true});

and this is a sample json record in the file
{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}

stefan_armbrust · ‎05-19-2019

The cypher statement looks good to me.

Do you have an index created upfront prior to run the import: create index on :Business(id)? Not that if use a unique constraint it requires global lock upon writes, so parallel:true will not work in this case.
For maximum performance use a regular index.
You can also play with batchsize value, try e.g. 1000, 10000 and maybe 100000 to see what is faster in your case.

Neo4j

Neo4j on Ubuntu