Neo4j

zak1234 · ‎06-02-2022

Databricks notebook 500 gb ram mchine spark 3
using Neo4j Connector for Apache Spark 4.1.2
neo4j on 8 vcpus, 32 GiB memory vm
Data as deleta files parquets

i tried to ingest my edges and nodes from delta file to neo4j database using spark connector, but it gets slower and slower the first 4 mill edges took 1hr and it gets slower

i ingested 130 million nodes in 6 hrs, i see that other people ingest there billions on nodes and edges like in 1-2 hrs, what did i do wrong here

michael_hunger · ‎06-02-2022

I think you'll need a bit more memory on the neo4j machine.
Did you create the constraints so that the db can look up data efficiently during ingest for creating the connections?

zak1234 · ‎06-02-2022

What you mean by creating constraints, i used

"schema.optimization.type": "NODE_CONSTRAINTS

santand84 · ‎06-03-2022

There are a ton of reasons that can contribute to slow down the process:

Neo4j hardware issues:
- is the HD fast enough?
- is the RAM enough?
If you reuse the same Spark DataFrame over the time and you don't cache it this forces Spark to recompute it each time; so it seems that the ingestion is slow but this is because it recomputes the same data over and over again
The batch size is too slow or too big
The DataFrame partitioning is too low or too high
If you're using your own Cypher query to ingest the data, is it optimized?

The first thing to check is the query.log in order to understand which queries are slow

brianmartin · ‎06-10-2022

@santand84 called it out. We have a graph that is ~32M nodes/1.7B edges that we load from Apache spark. We've had to work our way through quite a number of performance issues on the loading side, mostly tuning the batch size and partitioning/executor count.

The bigger issue we run into with large loads where there is significant overlap in relationship/node coverage is locking issues on nodes from parallel/concurrent transactions.

santand84 · ‎06-15-2022

@brianmartin the best practice on batch importing the data with Spark is:

insert in parallel the nodes by partitioning the data via the node key column (otherwise this will lead to locking issues and you cannot leverage the parallelism); please consider that a high number of partitions can overwhelm the database no it's not about just giving enormous parallelism to the ingestion
insert all the relationships sequentially as there is no way to truly avoid deadlocks at this moment
as you said the batch size is also important, it and depends on the amount of ram that your Neo4j instance has

Neo4j

Neo4j spark connector slow ingestion