Neo4j

jeg · ‎05-11-2021

Looking for some advice on speeding up database creation. It’s being created from thousands of little JSON statements. Right now I have each node and arc being created as a separate transaction, and unsurprisingly, it’s about as fast as pouring molasses. I’m using C#/MONO and the Bolt driver.

Any suggestions on how to speed up this process? I did check out PERIODIC COMMIT, but this seems to be limited to the IMPORT statement.

tard_gabriel · ‎05-11-2021

I don't know the language you are using, it might effect a lot the performance depending on how you are using the driver.

But whatever the language you use, you should always create constraints in your graph before importing your data. Do you have any of theses?

jeg · ‎05-12-2021

Haven't done any constraints because the input data is high quality. All integrity is enforced on the source side, so it seemed to me that constraints would be redundant.

andrew_bowman · ‎05-12-2021

Can you provide an example snippet of your code? As far as I know there isn't an IMPORT keyword in Cypher, though there is LOAD CSV. If you can, take your import query and execute an EXPLAIN of it and add the expanded query plan here.

jeg · ‎05-12-2021

Oops, yes, I did mean LOAD CSV.
It isn't the query, it's that there are thousands of transactions. I'm looking for ways to process those more efficiently.

andrew_bowman · ‎05-13-2021

As Gabriel mentioned, you will need these.

Specifically this is related to MATCH and MERGE operations in the query. Without indexes or unique constraints, it's possible that each of these (per operation and per row, not just once!) will perform an entire label scan to find the node. So if you had a 1000 line file consisting of two MERGEs, then that's 2000 label scans happening for the query, which will be horrible performance. Adding indexes changes those to index lookups instead, which is where you'll speed things up significantly.

There's also a question of Eager operators in the plan, as that can cause heap issues if you're not careful. So getting details about the query, and supporting your import with indexes, is essential for speed.

tard_gabriel · ‎05-13-2021

Unique constraints or indexes are almost mandatory if you are using merge during your data importation process. It's not about the quality of your data but about speed.

[Importing Data with Neo4j 4.x - Neo4j Graph Database Platform](https://Importing data)

Do you have an example of your importation query?

jeg · ‎05-16-2021

Sorry for delay, been sick.

The database gets created with thousands of create node statements, followed by thousands of arc creations:
MATCH (n1…) MATCH (n2…) MERGE (n1)-[:whatever]-(n2)

So if I understand correctly then with constraints those MATCH statements are going to execute much more quickly.

I hadn’t thought about doing INDEX anywhere because I thought that the scale of the database was small enough that it wouldn’t make a difference, but from the numbers you’re stating it sounds like I’m already way past that point.

I’ll try this as soon as I can get back to work.

tard_gabriel · ‎05-17-2021

You can even try to create unique constraint instead of simple index, it's slightly better.

Careful, even with unique constraints, double MATCH clauses are often deadly.
We could help you better if you can share a part of your importation query.

jeg · ‎05-18-2021

There's no import query, the database is created by a C# application with huge lists of data. First there's a loop
CREATE (:Thought) { (some attributes including string ID) }
for a few thousand elements, followed by a loop of a few thousand
MATCH (thought1 by ID) MATCH (thought2 by ID) MERGE (thought1)-[:whatever]-(thought2)
where whatever comes from the source data. The source application is trusted to not duplicate an ID or a link, and it's pretty mature software, so it's well trusted.

andrew_bowman · ‎05-18-2021

If by "loop" you mean a separate query per loop, and one update per query, then that's not optimized. You should try to batch many creations (10k-100k, depending on your heap) at a time per each query.

jeg · ‎05-20-2021

Yes, that's why I was asking for advice, and this page you referenced has a ton of good stuff. Thanks! I'll be going through it when I'm back at work next week.

Neo4j

Speed up database creation?