Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-25-2019 11:35 PM
Hey everyone,
I am using neo4j community version 3.5.12 as of now.
Its running on a CentOs server machine version 7.7 with 29 GB of ram.
I currently have a graph that comprises of around 46k user nodes and 7k movie nodes along with 21 Genre nodes. Relationships count is at around 2.1 million
I have altered the following parameters in neo4j.config file :
dbms.memory.heap.initial_size=8024m
dbms.memory.heap.max_size=8024m
I have also created the indexes on movie_id, user_id and genre_name
Apart from the obvious relation between users and movies and movies and genre, i have also related each user with two genre nodes depending on their top genres.
So overall relationships count is around 2.1 million here
So, going with the above set-up, i plan to get the number of similar customers (or the actual nodes for similar customers) for each customer. I am using the below query for that.
Its taking forever to run (even when i am providing limit at the end).
Currently i am running it from browser but have tried the same using cypher-shell
match (u1:User) - [r1:top_2_genre_is*2] - (u2:User)
where u1.movies_count_85 >= 20
and u2.movies_count_85 >= u1.movies_count_85*0.5
and u1 <> u2
with u1.user_id as tc,u1.movies_count_85 as tc_count,u1.movies_rated_85 as movie_set,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies,
count(1) as repeated
where similarity >= 0.75 and repeated = 2
return tc,count(sc) order by count(sc) desc limit 100
I am attaching the explain plan for the same :
I tried modifying my query for checking the credibility of it, and also to see the DB hits it was going for when doing only for single user (being u1 in the above query) .
profile match (u1:User{user_id : 1}) - [r1:top_2_genre_is*2] - (u2:User)
where u2.movies_count_85 >= u1.movies_count_85*0.5
and u1 <> u2
with u1.user_id as tc,u1.movies_count_85 as tc_count,u1.movies_rated_85 as movie_set,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies, count(1) as repeated
where similarity >= 0.75 and repeated = 2
return tc,count(sc) order by count(sc) desc
I am attaching the profile plan for this one :
Its clear that the DB hits are quite high even for a single user calculation.
But, i need to run this whole thing for all the users.
I can't seem to find any alternative approach to this method,
or is there anything i need to configure from my DB side to make this current query faster.
Thanks for any insight.
11-26-2019 11:19 AM
So I am guessing there is a node between the two users? Doubt it would change much but you could do
profile match (u1:User{user_id : 1})-[r1:top_2_genre_is]->()<-[r1:top_2_genre_is]-(u2:User)
so that the relationships are directed and not a variable length expansion.
If you have the option, I'd say set something up to run the calculations and create a rel between User
nodes so you can just query that relationship. Basically take your current query, and instead of returning, do a MERGE to create a rel between the matching users. Then you have a simple match to find related users. More work to maintain (since you have to do something to ensure the related users remains up to date if :top_2_genre_is
relationships change) but makes for a much faster READ to get the related users.
11-26-2019 09:30 PM
Hey @jsmccrumb, thanks for your inputs.
You were right about the first suggestion. It didn't improve anything.
The issue with creating relationships between user nodes are :
So, my judgement suggests that we can possibly create relationships by picking up nodes in batches. But, as far as i know, batch size control feature is available only while importing data in graph.
Please correct me if i am wrong somewhere, as i am quite a newbie here, and any type of suggestions for improvement are welcome.
11-26-2019 11:02 PM
I am currently trying to build the relation between different nodes by using apoc.periodic.iterate function. with a batch size of 1000.
It's already been 24 minutes since i started that query. Will update if it works, or how long it takes.
11-27-2019 05:55 AM
An update on this. This query to create relations takes forever to complete. I stopped it.
11-27-2019 06:24 AM
That's exactly the function I would have recommended using, how did you split up the two queries? Did it create any relationships?
11-27-2019 06:39 AM
In your original query, not sure what the count(1)
is for but left it in, here is how I would periodic iterate (batch size of 1 to send one user at a time, parallel: false).
call apoc.periodic.iterate(
"MATCH (u1:User) RETURN u1",
"MATCH (u1) - [r1:top_2_genre_is*2] - (u2:User) // or use directed relationships to the node
where u1.movies_count_85 >= 20
and u2.movies_count_85 >= u1.movies_count_85*0.5
and u1 <> u2
with u1.user_id as tc, u1.movies_count_85 as tc_count, u1.movies_rated_85 as movie_set, u2.user_id as sc, u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85, u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85, u2.movies_rated_85) as matching_movies,
count(1) as repeated
CREATE (u1)-[r:SIMILAR_TASTES]->(u2)
SET r.similarity = similarity,
r.matching_movies = matching_movies,
r.repeated = repeated,
r.created_on = datetime()
// i find it useful to track created on sometimes, could help when you are running updates later
", {batchSize:1, parallel:false})
11-27-2019 06:53 AM
I split it differently, and now it appears to me that my method would end up consuming too much memory to go with. This is what i tried with.
PROFILE CALL apoc.periodic.iterate(
"match (u1:User) - [r1:top_2_genre_is*2] - (u2:User)
where u1.movies_count_85 >= 20 and u2.movies_count_85 >= u1.movies_count_85*0.5 and u1 <> u2
with u1.user_id as tc,u1.movies_count_85 as tc_count,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies,
count(1) as repeated where similarity >= 0.75 and repeated = 2
unwind matching_movies as matched_movies with tc,tc_count,sc,sc_count,matched_movies
match (u1:User{user_id:tc}) - [r1:rated] -> (m:Movie{movie_id:matched_movies}) <- [r2:rated] - (u2:User{user_id:sc})
with tc,r1.rating as tc_rating,tc_count,sc,sc_count,r2.rating as sc_rating,matched_movies
where sc_rating >= tc_rating - 1 and sc_rating <= tc_rating + 1
with tc,tc_count,sc,sc_count, case when tc_count > sc_count then sc_count else tc_count end as smaller_base, count(1) as movies_matched_real
with tc,sc,1.0*movies_matched_real/smaller_base as real_similarity
where real_similarity >= 0.60
match (u1:User{user_id: tc}) , (u2:User{user_id: sc})
return u1,u2,real_similarity",
"merge (u1) - [r:similar_to{similarity:real_similarity}] - (u2)",{batchSize:1000, iterateList:True, parallel:true}) ;
This does not seems to work.
And for your question, Count(1) was to count the matching top genres for 2 customers and then picking up those who have both of their top genres matched up.
11-27-2019 11:15 PM
I went with the above line of thought, and split the query in the first line itself.
PROFILE CALL apoc.periodic.iterate(
"MATCH (u1:User) where u1.movies_count_85 >= 20 RETURN u1",
"match (u1) - [r1:top_2_genre_is*2] - (u2:User)
where u2.movies_count_85 >= u1.movies_count_85*0.5 and u1 <> u2
with u1.user_id as tc,u1.movies_count_85 as tc_count,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies,
count(1) as repeated where similarity >= 0.75 and repeated = 2
unwind matching_movies as matched_movies with tc,tc_count,sc,sc_count,matched_movies
match (u1:User{user_id:tc}) - [r1:rated] -> (m:Movie{movie_id:matched_movies}) <- [r2:rated] - (u2:User{user_id:sc})
with tc,r1.rating as tc_rating,tc_count,sc,sc_count,r2.rating as sc_rating,matched_movies
where sc_rating >= tc_rating - 1 and sc_rating <= tc_rating + 1
with tc,tc_count,sc,sc_count, case when tc_count > sc_count then sc_count else tc_count end as smaller_base, count(1) as movies_matched_real
with tc,sc,1.0*movies_matched_real/smaller_base as real_similarity
where real_similarity >= 0.60
match (u1:User{user_id: tc}) , (u2:User{user_id: sc})
merge (u1) - [r:similar_to{similarity:real_similarity}] - (u2)",
{batchSize:10,parallel:false}) ;
Tried first with batchsize of 1 and then with that of 10.
In both the cases, it seems like that the relations are being created, but still of no practical use, as even after 2 hours only 2k-3k users had relations on them.
I am still confused regarding what i am missing here.
All the suggestions are welcome.
11-28-2019 05:29 AM
A minor thing (you aren;t likely to see a boost in performance cause it probably won't make that big of a difference...) is you can keep u1
and u2
aliases at each WITH so you don't have to re-match them:
PROFILE CALL apoc.periodic.iterate(
"MATCH (u1:User) where u1.movies_count_85 >= 20 RETURN u1",
"match (u1) - [r1:top_2_genre_is*2] - (u2:User)
where u2.movies_count_85 >= u1.movies_count_85*0.5 and u1 <> u2
with u1, u2, u1.user_id as tc,u1.movies_count_85 as tc_count,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies,
count(1) as repeated where similarity >= 0.75 and repeated = 2
unwind matching_movies as matched_movies with u1, u2, tc,tc_count,sc,sc_count,matched_movies
match (u1) - [r1:rated] -> (m:Movie{movie_id:matched_movies}) <- [r2:rated] - (u2)
with u1, u2, tc,r1.rating as tc_rating,tc_count,sc,sc_count,r2.rating as sc_rating,matched_movies
where sc_rating >= tc_rating - 1 and sc_rating <= tc_rating + 1
with u1, u2, tc,tc_count,sc,sc_count, case when tc_count > sc_count then sc_count else tc_count end as smaller_base, count(1) as movies_matched_real
with u1, u2, tc,sc,1.0*movies_matched_real/smaller_base as real_similarity
where real_similarity >= 0.60
merge (u1) - [r:similar_to{similarity:real_similarity}] - (u2)",
{batchSize:10,parallel:false}) ;
Hard to be sure what is slowing it down (could just be that it is just that complex) You could remove the apoc.periodic.iterate and instead do a WITH u1 LIMIT 10
so the profile comes back and post the entire profile in case something shows up there...
Might have better luck first creating the non-standardized "similiarity" then building the "real_similiarity" relationship, creates are faster than merges since it isn't checking if the rel already exists:
PROFILE CALL apoc.periodic.iterate(
"MATCH (u1:User) where u1.movies_count_85 >= 20 RETURN u1",
"match (u1) - [r1:top_2_genre_is*2] - (u2:User)
where u2.movies_count_85 >= u1.movies_count_85*0.5 and u1 <> u2
with u1, u2, u1.user_id as tc,u1.movies_count_85 as tc_count,u2.user_id as sc,u2.movies_count_85 as sc_count,
algo.similarity.overlap(u1.movies_rated_85,u2.movies_rated_85) as similarity,
apoc.coll.intersection(u1.movies_rated_85,u2.movies_rated_85) as matching_movies,
count(1) as repeated where similarity >= 0.75 and repeated = 2
CREATE (u1)-[:SIMILAR_TO {similiarity: similarity, matching_movies: matching_movies}]->(u2)
",
{batchSize:10,parallel:false}) ;
note: since its a create if you stop it you need to remove those relationships before re-running (but should be noticeably faster). Once all the creates are done you could run a second query to handle the "real similiarity" calculations
All the sessions of the conference are now available online