cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Comparing Mutual Connections Performance Issue

I am having performance trouble working with a simple data model but with complex queries. I have person nodes (p) and item nodes (i) and a person can have an unlimited number of item nodes, but the most we really see is 12 per person. The first part of the data model is simply (p)-[conn:HAS]->(i) relationships, where we store data about the person's relationship to the item on the conn relationship. Item nodes can have as many has 1 million people pointing to it, but usually in the range of 100 to 10,000.

The problematic queries in question come up I want to find information from one person (x) and the "most similar" people to that person. I find this using a longer version of the following query (There are a few more FOREACH loops in my current version to get more features on the SIMILARITY relationship):

MATCH (x:Person { id: "PERSON_ID" })-[xi:HAS]->(i:Item)<-[ix:HAS]-(m:Person)
MERGE (x)<-[conn:SIMILARITY]->(m)
  ON CREATE SET
    conn += { sameItem: 0, sameTitle: 0 }
FOREACH(_ IN CASE WHEN i:Consumable THEN [true] ELSE [] END |
  SET conn += { sameItem: 1, itemElement: i.name, itemId: i.id, xTitle: xi.title, mTitle: ix.title }
)
FOREACH(_ IN CASE WHEN i:Consumable AND ix.title IS NOT NULL AND xi.title IS NOT NULL AND ix.title = xi.title THEN [true] ELSE [] END |
  SET conn.sameTitle = 1
);

When this query is executed from a person with ~1000 SIMILARITY relationships it takes under a second to execute, but this scales quadratically as ~10000 SIMILARITY relationships take around thirty seconds and ~180,000 SIMILARITY relationships took 3 hours.

Is there anything obvious I am missing here to optimize the performance of the longer-running queries?

2 REPLIES 2

I don't quite get the point of the query.

You're doing a MERGE of a single relationship between these two persons, but then you keep overwriting the properties of that relationship per item (for :Consumable items), so you're never actually incrementing or adding to information on the node, simply replacing property values.

This is assuming that there are multiple items in common between the same two persons.

You may want to reevaluate the query.

Also we'd recommend aggregating data such that you have one row per pair of persons with collected info, then process that and determine what to write to the relationship, instead of writing over and over.

Also you may want to batch your updates with apoc.periodic.iterate(), that should help with the execution time.

Thanks @andrew.bowman for taking a look at this. I will give the apoc method a try. I didn't add the second query in which we do some scoring based on the relationship properties at a 0 or 1 and weight properties for feature, to obtain a score:

MATCH (w:Weight { id: "WEIGHT_ID" }), (p:Person { id: "PERSON_ID" })<-[conn:SIMILARITY]->(p2:Person)
WITH p, conn, p2,
w.sameItem * conn.sameItem + w.sameTitle * conn.sameTitle as score
RETURN p, conn, p2, score
ORDER BY score DESC
LIMIT 100;

This query has fast execution time, and allows different weights to be interchanged. This is why the SIMILARITY connections are organized the way they are, and also means that the slower executing first query only needs to be run once, but 3 hours for the first query is still too slow. Does this make sense as to what I'm trying to accomplish?