cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Improving very slow MERGE on relationship

tinqnit
Node Link

I have a MERGE query (on relationship) of the below form, and about 2000 queries are invoked around the same time, its taking ~5 minutes to complete all of them.

How can I refactor the query or application logic so that this can be drastically faster.

MERGE (a: category { name: {name} }) 
MERGE (b: item { hash: {hash}, timestamp: {timestamp} } )
MERGE (b)-[r:PRESENT_IN{count: {count}}]->(a)

Conditions:

  • unique constraint exists on category.name and item.hash
  • the PRESENT_IN relationships created should be unique , no duplicates
  • either protocol or item already exists
  • both protocol and item already exists
  • neither exists
11 REPLIES 11

Hi @tinqnit

Can You add timestamp index to this DB to make it faster?

CREATE INDEX item_timestamp FOR (n:item) ON (n.timestamp)

I already have a unique constraint placed on item.hash, assuming this implicitly behaves like an index and suffices?

Can you provide the code where this request is used?

You might use the EXPLAIN and PROFILE clause before your MERGE clauses to see what is happening there?

I wonder if we have a cartesian product here between each MERGE, I have to double check that.
I don't thing MERGE does cartesian product, must be in your code.

Not sure what you mean by Cartesian product in my code.
I dont think it matters what the data actually is, we have ~2000 relationships that needs to be inserted by processing an input file at once, and I feel ~5 minutes is too long.

Five minutes is huge for a tiny 2000 relations indeed

Yes, and the query structure is very simple, which is confusing me.
The actual query we are running in our code has a lot more properties in the nodes and relationships. But the basic essence is the same.
Any other clues on this please?

The way the session and transactions are coded in you language have a huge influence on the speed of your data creation process.

If, as an example, you create a session for each query in a loop the time will increase drastically.
( Don't do that, I did, it was funny )

Your MERGE must include the strict necessary properties to check if it's already exist only.
The other properties on creation must be set with the ON CREATE subclause.

Matching or merging with the clause MERGE with too much properties inside {} can slow down the process significantly too. Unless using a really big composite index. Do not hesitate to use the EXPLAIN or PROFILE clause. It's the neo4j magic debugger.

Hi,
Why are you saying "If, as an example, you create a session for each query in a loop the time will increase drastically." ? (see this link : Should the session be closed each query? · Issue #384 · neo4j/neo4j-javascript-driver · GitHub where neo4j is saying that is it not a bad practice.)
That's exactly what i am doing lol. Got many microservices pushing data, and everytime a microservices found new data, he's opening new session, making his transaction thought it and then closing it.
You think it's a better idea to create one session peer microservices, and then make all transaction thought the same session, even if the session is alive for many days ?
Thanks,
Gautier

Excuse me gautier

It's not what I mean, I didn't want to put you in the wrong way.
Yes indeed the session are extrêmely lightweight and disposable, but I only mean that often your code or the way you code your transactions / query / sessions in the language for your choisce might have a huge impact on the performance.

It's not necessary your query, but in this case yes your query and some catesian products in it which can be deadly for huge dataset.

tinqnit
Node Link

I am reusing the session. The same query is invoked concurrently in batches of 50 at a time, until all the 2000 requests are completed.

I did try the PROFILE on my query, and it seems to show (786,579 db hits) and (455,803 db hits) for category and items. I believe the existing unique constraint should be sufficient.

However, I will make changes to include the ON CREATE subclause you pointed you! Not sure if it will make a big change in execution time though.

tinqnit
Node Link

This is pretty lame, but just realised on this test machine the constraints were not setup correctly.

Thanks for the help!

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online