Neo4j

IngoMarquart · ‎04-02-2020

I have yet to understand how to profile Neo4j correctly, but I am attempting to apply best practices in writing queries. I am running into some performance problems.

In my use case, I need to add ties between nodes that, for now, we can just identify by a "token_id" variable. The network is quite dense, and so the queries, sorted by originating node, have about 100 ties to other nodes per query.

In my data model, ties have weights and a timestamp. Given this configuration, I decided to use a layout where the ties are represented as nodes (such that I can index timestamps on the ties), and connect with directed anonymous ties. That is

(a:token {token_id:x )-[:onto]->(b:edge {weight:..., timestamp:...})-[:onto]->(b:token)

I set token_id's as uniques and index timestamp.
Performance for retrieval is then quite good. But I need to do a lot of merging, and that is terribly slow.

After my research, I found out that I should do a double unwind. Suppose my data is in a list called "sets". Each element has two variables, "ego" for the originating token, and "ties" for all ties that originate there. My program supplies this via JSON as parameter.

Here is an example (to run it, need to add nodes with token_ids)
(p1,p2 are unindexed parameters)

:param sets=>[{ego: 12, ties: [{alter: 11, time: 20000101, weight: 0.5, p1: 15, p2: 0},{alter: 13, time: 20000101, weight: 0.5, p1: 15, p2: 0}]},{ego: 12, ties: [{alter: 14, time: 20000101, weight: 0.5, p1: 15, p2: 0},{alter: 11, time: 20000101, weight: 0.5, p1: 15, p2: 0}]}]

The corresponding query is then:

UNWIND $sets as set MATCH (a:token{token_id: set.ego}) WITH a,set UNWIND set.ties as tie MATCH (b:token{token_id: tie.alter}) MERGE (b)<-[:onto]-(r:edge {weight:tie.weight, time:tie.time, p1:tie.p1,p2:tie.p2})<-[:onto]-(a)

This could be faster.. If I run Profile, I get a curiously huge plan which I don't understand.

My main intend was to let Neo4J query an indexed node first, THEN unroll other parameters and add ties. That way, it would go row by row where each row is an element of "sets" and correspond to an originating node.

This plan suggests Neo4j does a lot of global indexing besides this - but there is a good chance I misunderstand what is going on.

I can vary the size of "sets" per query programmatically. Currently I pass about 100-200 of such sets through HTTP in Python, where each set has a "tie" object of about 100 ties. In every case, it's always the same "query", only different parameters, so I was hoping to get some more performance from Neo4J

Is this the best I can do? In that case, I would try to squeeze more out of posting these queries concurrently with my analysis (I already found out that parallel writes in these dimensions is a nono). Or can I improve that query?

In my situation, what do you think would be an efficient query?

andrew_bowman · ‎04-02-2020

I think your query and plan look good.

As for the curiously large plan, MERGE operations tend to be a little expensive, given that they need to check if the pattern already exists, and if not create the pattern.

If the pattern already exists, then matching on the pattern is trivial, and doesn't require more complex operations.

If the pattern doesn't exist, then there are steps that need to be taken to ensure correct behavior.

Locking. The existing nodes to need to be locked to guarantee mutual exclusion and prevent race conditions.
Double-check of the pattern. In between the pattern existence check and the locking of the nodes, there's a potential race condition where a separate transaction may have created the pattern. To ensure correctness, we need to double-check if the pattern still doesn't exist and that creation of the pattern is still necessary. This is why you see what look like duplicate operations but with locking operations mixed in.
The MergeCreate operations are present in the branch at the end for when the pattern actually does need to be created.

Neo4j

Merge and Nested Unwind: How to write an efficient query wrt. indexing