Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
04-22-2020 08:27 AM
I am wondering if anybody can point me to any benchmarking figures of Spark DataFrame writes to neo4j database using neo4j-spark-connector
I am currently using the following versions on a 60 core/ 60 executor cluster.
I am using neo4j version = 3.5
neo4j-java-driver-1.7.2.jar
Spark 2.4.0
Using Neo4jDataFrame.mergeEdgeList(), I have tried using batch sizes (10k, 20k and 40k)
However, it seems to take unreasonable amount of time.
100k record takes about 35 minutes. For a million records , it seemed to be hanging for more than 14hrs. The seems to be no progress in Spark UI and all tasks show 0/100
What is the expected write rates to neo4j database using Spark connector and what is the best way to optimise larger dataframes (containing millions of records) to ensure faster loads.
Thanks
Shiva
11-12-2020 04:31 AM
Neo4j has a new approach to the spark connector which can be found here, and includes architectural guidance for getting best performance
It's hard to say exactly what performance each user will gets because it depends heavily on your data model and setup. But we have seen tens of thousands of node writes per second on moderate hardware, for nodes consisting of say 10 or so properties, when written using the "normalized loading" approach that's documented on that page.
All the sessions of the conference are now available online