cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Recommendation Query on Large Data Set

I have built a large DB of :customer, :product, :use and am trying to build a real-time product recommendation tool but the query is taking several mins to run. Any help optimizing this query would be appreciated.

//Recommendation Query
MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})
MATCH (c)-[:PURCHASE]->(p:product)-[:HAS]->(u:use)
MATCH (p)<-[:PURCHASE]-(oc:customer)
MATCH (oc)-[:PURCHASE]->(np:product)
WHERE NOT(p.product_code = np.product_code) AND (np)-[:HAS]->(u)
WITH COUNT(*) AS score, np, u
ORDER BY score DESC
WITH u.name AS use, COLLECT(np.article_id) as products
RETURN use, products[0..6]
ORDER BY use

17 REPLIES 17

I really don't know if this is any more efficient. I am just throwing it out there. I also wasn't able to test it.

I did have one observation. You are ordering by score, then collecting the products by use. I believe this eliminates the usefulness of the score ordering, because the score represent the number of duplicate rows there are for the combination of use and product. The data would get rearranged when you aggregate the products by use, and the score is no longer associated with a use/product combination.

EXPLAIN
MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})
MATCH (c)-[:PURCHASE]->(p:product)-[:HAS]->(u:use)
MATCH (oc:customer)-[:PURCHASE]->(p)
WHERE c <> oc
MATCH (oc)-[:PURCHASE]->(np:product)-[:HAS]->(u)
WHERE p <> np
RETURN u.name AS use, COLLECT(np.article_id)[0..6] as products
ORDER BY use

Thank you. That is interesting about the score. I was hoping to get the scores within the use... that way getting the relative highest scored product per-use. Your query is faster but I seem to be getting different products. IF I was to add the score back any idea where to put it? I want to make sure I get the most popular items per-use.

Original Query:
Started streaming 25 records after 3 ms and completed after 68936 ms.

Your Query:
Started streaming 25 records after 40 ms and completed after 55329 ms.

I did a little prototyping to understand neo4j's behavior when sorting, followed by collecting. It turns out that the elements of the lists resulting from collecting into groups retains the order from the larger list. As such, your approach of sorting the entire list by score, then collecting into lists by use should work. You will get lists for each use that are ranked by score (highest to lowest).

I ended up using the same logic you did for the sorting, collecting, and truncating, so my query is not that different. Give it a try to see if there is any benefit. The earlier benefit may been the result of one less 'order by' clause.

You may get slightly different results, because I eliminated the products from the result that the customer already purchased.

EXPLAIN
MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})
MATCH (c)-[:PURCHASE]->(p:product)-[:HAS]->(u:use)
MATCH (oc:customer)-[:PURCHASE]->(p)
WHERE c <> oc
MATCH (oc)-[:PURCHASE]->(np:product)-[:HAS]->(u)
WHERE p <> np
WITH u.name as use, np.article_id as product, count(*) as score
ORDER BY score DESC
RETURN use, collect(product)[0..6] as products
ORDER BY use

Thank you for checking on the Score, that helps.
Your new query is: Started streaming 25 records after 48 ms and completed after 60276 ms.
so a bit faster than my original but the score must increase it.

How many rows are returned from your data? If it is a lot, maybe limiting the result to the top X records will improve performance by reducing the collect and reduce operations following the ‘ORDER BY score’ operation. Plus, it would give back the most relevant recommendations back to your customer.

Also, maybe just presenting them with the list of the top X products with their use and score is good enough for the user. This would eliminate the need for the collect, reduce, and sort by use operations entirely.

I believe "use" is key to a "good" recommendation as "use" is a proxy for a customer's need. To jump to other "needs" may be recommending things a customer will not use.

Hello @aaron.damiano ,

Can I join the party?

As @glilienfield suggested, can you share a PROFILE execution of the queries so we check db hits as well?

Are your :Purchase and :Has relationships just used between customer->Product and Product-> Use?, if that's the case we may save some filtering as well by removing the explicit Label.

Bennu

Oh, y’all wanted a twist, ey?

I Tried PROFILE at the top and didn't return anything "new" what was I supposed to see?
The :Purchase Relaitonship is ONLY between customer->product BUT I have :HAS on other nodes.
Do you think I could INDEX the relationship? Would that help?

Hi all!

Profile should give you additional info with an Orange color. It's going to show to actual number of hit and records retrieved during a particular phase.

This is kinda important because it will give us an idea about the cardinality of your label stores. i.e. If I get the number of products bought by client how got the same products as your customer, is the cardinality bigger/equal/less than the products with same usage from our sample user. Usually the planner will take this into consideration in order to pivot the best way possible but it's always important to check when you aim for the best performance possible.

About relationships, I don't think the index on the relationship will give us any improvement right now. Having specific relationships for particular label interactions increase the complexity of the model, but it help us by removing the :LABEL in some queries so we avoid one filter. Trying :PURCHASE without using :Product should give us an improvement. A lot? Not sure but for sure better than the actual result.

If you share the profile it may help. If you have any shareable connection to the DB I'll be glad to help you directly on the data.

Bennu!

Oh, y’all wanted a twist, ey?

Thank you all for your help!

Removing the unnecessary Labels increased the speed by 2000 ms to 15241 ms.
THEN adding an age range WHERE clause reduced it by 9000 ms to 6851 ms!!!

I would like to get the profile to output something but nothing changes....?
I am starting my query with:
PROFILE
MATCH (c:customer.....
IS there something else I need to do?

Sorry found it... I get it a different way and was confused..

Also I found something about apoc mapParallel2 could that help?
CALL apoc.cypher.mapParallel2(...

Hi @bennu.neo; it would be great learning some optimization principles from an insider.

I just noticed the first post contains a profile. It looks like the cohort size is approximately 5.6k customers. The result size blows out when finding the cohort population's purchased products, increasing the record set to approximately 115k. Each cohort customer is averaging around 21 purchases. Are there additional customer characteristics you could leverage to reduce the size of the cohort to find a more targeted group based on additional similarities, instead of just purchasing the same products. This would reduce the number of cohort products to process.

How about a different metric for the score? The current score ranks by product and use, requiring a two-level grouping. Since the metric is to recommend a product, how about just counting the number of product instead? Something like the following:

MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})
MATCH (c)-[:PURCHASE]->(p:product)-[:HAS]->(u:use)
MATCH (oc:customer)-[:PURCHASE]->(p)
WHERE c <> oc
MATCH (oc)-[:PURCHASE]->(np:product)-[:HAS]->(u)
WHERE p <> np
RETURN np.article_id as product, count(*) as score, collect(u.name) as uses
ORDER BY score DESC
LIMIT 10

Unfortunately, I don't have any "valuable" data to sort customers by (not even sex) I have age and could try that to see what I get.

I like what you are saying about the score as I may not need to do that as I am grouping by use at the end. I will try.

I agree. The use of “use” makes a lot of sense. If you have age, maybe consider filtering the customers by age, as a person may have interests more aligned to their contemporaries. You could filter customers to those that are within the range of plus/minus x years of your customer.

Here is my NEW QUERY

MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})
MATCH (c)-[:PURCHASE]->(p)-[:HAS]->(u:use)
MATCH (p)-[:HAS]->(h:color)-[:IS_A]->(ph)
MATCH (oc)-[:PURCHASE]->(p)
WHERE c <> oc AND oc.age > (c.age-10) AND oc.age < (c.age+10)
MATCH (oc)-[:PURCHASE]->(np)-[:HAS]->(u)
WITH p, np, ph, u, COLLECT(DISTINCT p.gra_id) as styles
WHERE p.product_code <> np.product_code AND np.gra_id in styles AND (np)-[:HAS]->(:color)-[:IS_A]->(ph)
WITH u.name as use, np.article_id as product, count(*) as score
ORDER BY score DESC
RETURN use, collect(product)[0..6] as products
ORDER BY use

Hi @aaron.damiano !

What's exactly color here? And ph? Are those values that can be transformed into labels? I feel like those values can be handle differently.

There're different way to approach this according to the cardinalities of the elements. Can you creare an specific HAS_COLOR, HAS_USE relationship?
Also an index on :Product gra_id and :Customer age may be useful (depending our approach.
In that case, try something like next query and share the profile.

MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})-[:PURCHASE]->(p)
WITH p, COLLECT(DISTINCT p.gra_id) as styles
MATCH (p)-[:HAS_USE]->(u)<-[:HAS_USE]-(np)
WHERE np.gra_id in styles 
WITH u, p, np,c
WHERE (p)-[:HAS]->(:color)-[:IS_A]->()<-[:IS_A]-(:color)-(np)
MATCH (np)<-[:PURCHASE]-(oc)-[:PURCHASE]->(p)
WHERE c <> oc AND oc.age > (c.age-10) AND oc.age < (c.age+10)
WITH u.name as use, np.article_id as product, count(*) as score
ORDER BY score DESC
RETURN use, collect(product)[0..6] as products
ORDER BY use

Bennu

Oh, y’all wanted a twist, ey?

Hello Bennu,
Pardon the late reply.
Color is the hue of the product, ph = parent hue or the parent color. I want my recommendations to be in the historic color pallet of the user.

I like your idea on unique relationships but will have to rebuild my graph (I will try). I altered my original query to MATCH more patterns upfront and made some gains (see below). I tried yours (not including new relations) and got a similar result to my new query.

Thanks for the help.

//Recommendation Query 
MATCH (c:customer {customer_id:'0c4c518e5d1eaf3fc39f93463c2406ad8b659d6c22c9107179e3992f647b12aa'})-[:PURCHASE]->(p)<-[:PURCHASE]-(oc)
MATCH (u:use)<-[:HAS]-(p)-[:HAS]->(h:color)-[:IS_A]->(ph)     
WHERE c <> oc  AND oc.age > (c.age-10) AND oc.age < (c.age+10) 
MATCH (oc)-[:PURCHASE]->(np)-[:HAS]->(u)  
WITH p, np, ph, u, COLLECT(DISTINCT p.gra_id) as styles 
WHERE p.product_code <> np.product_code AND np.gra_id in styles AND (np)-[:HAS]->(:color)-[:IS_A]->(ph) 
WITH u.name as use, np.article_id as product, count(*) as score 
ORDER BY score DESC 
RETURN use, collect(product)[0..6] as products 
ORDER BY use