cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Very slow merges neo4j

alex3
Node Clone

Hi, I have been experiences extremely slow relationship merges to Neo4j. When rerunning a merge of data already inserted, the query runs 10x faster (as there are no writes to perform), but when none of the nodes / relationships exist, the query runs very slowly.

I have the following code:

UNWIND $author.organizations AS orgs
# fulltext lookup to find organization (variable orgResult)
MERGE (p: Person {{ name: $author.name }})
MERGE (organization: Organization {{ name: orgResult }})
MERGE (p)-[:AFFILIATED_WITH]-(organization)

I have verified with PROFILE that index lookups on name are being used for both Persons and Organizations. I also have tried using a directional relationship merge. I have tried batch sizes from 2000 relationships to 20000 relationships and the writes / second is always the same (13 relationships / second). With tens of millions of relationships this would take well over a day to insert.

12 REPLIES 12

Query performance can be greatly increased by creating proper indexing. So please review your indexing and use it appropriately.
Many thanks
Mr Sameer Sudhir G

@alex3

given your initial comment of

I have verified with PROFILE that index lookups on name are being used for both Persons and Organizations

I would think these indexes should be sufficient. i.e. I cant think of how/why adding more indexes would be required.

What version of Neo4j?
Also you cypher doesnt appear to run as-is, for example

MERGE (p: Person {{ name: $author.name }})

even if for example replacing $auhor.name with 'Alex' and thus

 MERGE (p: Person {{ name: 'Alex' }});

and thus error

Invalid input '{': expected an identifier (line 1, column 19 (offset: 18))
"MERGE (p: Person {{ name: 'Alex' }});"

and back to is the initial post and include and cypher correct?

Can you run a explain/profile on said query? Query tuning - Neo4j Cypher Manual

Dana, I apologize for not clarifying in my post. The notation in my Cypher is for the python driver. In python, the data would look something like:

[ { "author": { "organizations": ["Org 1", "Org 2"], "name": "Author Name" } }, ...]

The actual query is rather complicated so did my best to cut down to the meat. In reality there are two conceptual loops:

  1. Loop over all authors
  2. For each author, loop over their associated organizations and create an affiliation relationship between them

Hi @alex3 ,

Can you define # fulltext lookup to find organization (variable orgResult)?

You can also do the merge on p once before the UNWIND.

Bennu

Oh, y’all wanted a twist, ey?

Bennu, sorry for being vague. Basically I have a fulltext index for :Organization nodes and invoke db.index.fulltext.queryNodes and some extra logic to find the first two potential matches based on the organization name for an author.

Since the query runs much more quickly AFTER data has been inserted once, my assumption is that this part of the query is not a bottleneck

@alex3 and Neo4j version?

We might get some benefit with call { } IN TRANSACTONS cypher syntax as described at CALL {} (subquery) - Neo4j Cypher Manual but this is only available in 4.4 forward. And so if you are not running 4.4 then

Dana, I am using 4.4 so will give this a shot, thanks!

alex3
Node Clone

After revisiting the fulltext lookup, it actually seems that it may be interacting strangely with the rest of the query. The goal was to return the top two fulltext matches if there was a tie, top match otherwise. And if no matches at all, return null so that I could coalesce later in the query. It looks like this:

CALL {
WITH author
CALL db.index.fulltext.queryNodes("OrganizationNameFulltextIndex", org)
YIELD node, score
WITH collect(node) as nodes, collect(score) as scores
UNWIND (
    CASE
    WHEN size(nodes) >= 2 AND scores[0] = scores[1] THEN nodes[0..2]
    ELSE [head(nodes)]
    END
) AS orgResult
RETURN orgResult
}

I have no idea if this will be impactful. I assume only if you had a large number of results returned from your queryNode call. Anyway, you could insert 'with node, score limit 2' between your 'call db' and 'with collect' phrases. This would avoid collecting over a large list of irrelevant results and you could replace 'nodes[0..2]' with 'scores', avoid the list reduction too.

That is a good thought and indeed I had tried that as well as introducing a WITH * and LIMIT 2 before the collect but nothing changed.

Indeed the query also runs just as slowly on an empty DB (no Organization nodes yet) as it does with a full one.

The whole point of the subquery really is so that I can merge using a fulltext lookup instead of using traditional merge with the BTREE lookup. Without the subquery, if there were no fulltext index matches, the zero rows would short circuit the query due to the cartesian product with the existing rows

Sorry if my comments are naive, but maybe just throwing out comments will stir up some ideas on your end.

You stated the performance is significantly worse with no data in the db. The big difference in that case would be that the data needs to be created. I doubt the issue is with creating the person and organization nodes. How about the updating of the indexes. Do you have lots of indexes on these nodes? How about the time to update the full-text indexes? Could you test by dropping the full-text index and having the full-text subquery just return a simple test node, so the query runs through?

Thanks for the response Gary.

Yes, that's what really confuses me and initially why I disregarded the index lookup as a source of inefficiency. I figured that if the query ran faster the second time (and the second time it would have to do the fulltext index lookups on more data) then that couldn't be the cause. Seems like a bad assumption though.

There is only one fulltext index on Organization nodes. There is a BTREE index for another field on Organizations as well.

As for your question about dropping the index lookup, I tested replacing the contents of the subquery with a simple return of a test node and the query was indeed fast.

Is there some common approach people have (perhaps an apoc function) that approximates what I am trying to do? Basically just merge / upserting but allowing the lookup to be done with fulltext index