Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-24-2019 05:22 AM
Hi Michael,
I am trying to calculate jaccard similarity but it takes more than 3 minutes.
MATCH (u:User)-[:HAS]->(f:Feature) RETURN count(*);
count(*) : 106384
MATCH (u:User) RETURN count(*);
count(*) : 15198
:schema
Indexes
ON :Feature(id) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON ( feature:Feature ) ASSERT feature.id IS UNIQUE
ON ( user:User ) ASSERT user.id IS UNIQUE
dbms.memory.heap.initial_size=1g
dbms.memory.heap.max_size=8g
Following the documentation:
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN algo.getNodeById(item1).id AS startNode, algo.getNodeById(item2).id AS endNode, similarity
ORDER BY similarity DESC limit 10
Started streaming 10 records after 225411 ms and completed after 225414 ms.
Is there any way to improve the performance ?
Because this is not even close to the real amount of data.
01-31-2019 02:56 AM
How long does the query run to compute the data?
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
RETURN size(data)
You could increase your cut-off or add a degree-filter or add a top-k (like top-5 pairs)
how many rows/pairs are returned in total and how long does this run?
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN count(*)
How many cpu's do you have?
02-04-2019 07:54 AM
Hi Michael,
Thank you for the response. I needed to use cut-off , degree-filter and top k to speed up the query.
But I also need to find similarity between nodes with different labels.
MATCH (m:User{id:"34er796-23sa2963-4235534hj34"})
MATCH (m)-[:HAS]->(g:Feature)<-[:HAS]-(other:Post)
WITH m, other, COUNT(g) AS interscn
MATCH (m)-[userHas:HAS]->(mg:Feature)
WITH m, other, interscn, {item:id(m), categories: collect(id(mg))} as userData
MATCH (other)-[postHas:HAS]->(og:Feature)
WITH m, other, interscn,userData,{item:id(other), categories: collect(id(og))} as aData
WITH m,other, interscn, collect(userData) as udata , collect(aData) as adata
with udata+adata as data
CALL algo.similarity.jaccard.stream(data, {concurrency:8, similarityCutoff: 0.5})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1) AS startNode, algo.getNodeById(item2) AS endNode, similarity
ORDER BY similarity DESC limit 100;
is there any better approach? should i add a common Label to all node types that i need to compute similarity metrics between them?
02-04-2019 09:52 AM
Would you mind answering my other questions too? So we know more about your setup?
02-05-2019 02:49 AM
Hi Michael,
Currently the setup is not the same because neo4j data managed from other services.
MATCH (u:User) RETURN count(*);
"count(*)"││11804
MATCH (u:User)-[:HAS]->(f:Feature) RETURN count(*);
"count(*)"││64897
ID Allocation
Node ID 21148
Property ID 1273369
Relationship ID 834959
Relationship Type ID 5
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN count(*)
"count(*)"││28255867
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
RETURN size(data)
"size(data)"││11804
$ lscpu
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
02-05-2019 06:44 AM
Hey,
So this algorithm does a pair wise comparison of all the {item:id(p), categories: collect(id(f))}
values that are passed in. So in this example it's doing:
11804^2 / 2 = 69,667,208
comparisons
And then it only returns the ones with a similarity > 0.01, so that's how we get down to 28m records.
What do you want to do with the similarities once they're computed? I think the answer to that will make it easier to figure out what to suggest.
Cheers, Mark
02-05-2019 12:10 PM
Hi Mark,
Thank you for the details. In the main service I don't need similarity with Cutoff 0.01 so if I set it up to 0.7 it needs 1.5 seconds. But I need to run a scheduled process within I need to compute similarity with any Cutoff in order to create certain paths and do some process after this. More in details I have 2 cases:
I need to compute similarity for a list of nodes with all other nodes. So I don't need to compute similarity all versus all. Any approach like with periodic commit iterate etc for batch processing it may helps.
I need to find similarity between nodes with different labels where I can also set up common label but I want to avoid any extra labels on my nodes.
Best,
Alex
02-06-2019 01:38 PM
At the moment the only way to do this would be to use the function rather than the procedure, but then that's all working single threaded unless you parallelise with APOC, and even then I guess it wouldn't be as fast as if we can do it in a procedure.
Could you show me an example of how you'd like it to work and then maybe we can update the procedure to do it. Right now it assumes you want to compare all values in the list against each other - how would you specify the list of nodes that you want to compare with everybody else? I guess we could have the procedure take in a collection of ids, kind of like what we do for Personalized PageRank with the sourceNodes
? https://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/#_personalized_pagerank
Can you explain this a bit more? It should be possible to do this - the procedure doesn't care what labels you have, it only cares about lists of numbers...
02-06-2019 03:03 PM
Thanks you Mark, you example points out what I am looking for the first case. It whould be great to have this addition. A collection of ids like personalized pagerank it will make it very easy.
For the second case:
we have users with relationships to tags and posts with relationships to tags.
I need to compute similarity between users and posts with common tags (When referring to users and posts are the labels of the nodes) but not to cumpute similarity between user to user or post to post.
If similarity measures can have as a parameter two collection of ids or two match query results (any approach that can distinguish two sets of nodes) where similarity computed only between of those collections then it will use only :
For each a of collection1
For each b of collection2
Like a bipirtite graph as a result. But I know it make it more complex etc. I can compute the results with the query written above or calculting the jaccard index with cypher only between nodes with different labels.
Second case covers also first one where we have collection of ids vs all (which is again a collection of ids) .
02-13-2019 11:57 PM
Hey,
Yeh that makes sense. Sorry for the delayed reply - I've had this tab open and have been thinking of the best way to do this. If we did something like this I think it'd work?
MATCH (user:User)-[:HAS_TAG]->(tag)
WITH {item:id(user), categories: collect(id(tag))} as data
WITH collect(data) AS userTags
MATCH (post:Post)-[:HAS_TAG]->(tag)
WITH userTags, {item:id(post), categories: collect(id(tag))} as data
WITH userTags, collect(data) AS postTags
WITH userTags, postTags,
[userTag in userTags | userTag.item] AS sourceIds,
[postTag in postTags | postTag.item] AS targetIds
CALL algo.similarity.jaccard.stream(userTags + postTags, {topK: 1, similarityCutoff: 0.0, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY from
And then as you suggested - if you don't specify sourceIds
or targetIds
it'll assume you want to use all ids.
For the weight based similarity procedures we have support for Cypher statements. e.g. for Cosine similarity - https://neo4j.com/docs/graph-algorithms/current/algorithms/similarity-cosine/#algorithms-similarity-...
WITH "MATCH (person:Person)-[likes:LIKES]->(c)
RETURN id(person) AS item, id(c) AS category, likes.score AS weight" AS query
CALL algo.similarity.cosine(query, {
graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
In terms of keeping the functions reasonably uniform I guess we could make a version that takes in a single query + target and source ids like this:
WITH "MATCH (user:User)-[:HAS_TAG]->(tag)
RETURN id(user) AS item, id(c) AS category
UNION
MATCH (post:Post)-[:HAS_TAG]->(tag)
RETURN id(post) AS item, id(c) AS category" AS query
MATCH (p:Post)
WITH query, collect(id(p)) as targetIds
MATCH (u:User)
WITH query, targetids, collect(id(u)) AS sourceIds
CALL algo.similarity.cosine(query, {
graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true, sourceIds; sourceIds, targetIds: targetIds
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
That doesn't actually look very nice though to be honest! Perhaps instead we could have the sourceIds and targetIds treated as a Cypher query too if you have graph: "cypher"
?
WITH "MATCH (user:User)-[:HAS_TAG]->(tag)
RETURN id(user) AS item, id(c) AS category
UNION
MATCH (post:Post)-[:HAS_TAG]->(tag)
RETURN id(post) AS item, id(c) AS category" AS query,
"MATCH (u:User) RETURN id(u) AS item" as sourceQuery,
"MATCH (p:Post) RETURN id(p) AS item" as targetQuery
CALL algo.similarity.cosine(query, {
graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true, sourceIds; sourceQuery, targetIds: targetQuery
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
I guess to start we can just do the no Cypher query version though...
02-15-2019 03:02 AM
Hi Mark,
The no Cypher way looks great and easy to understand.
Thank you for all this effort...Keep up with the great support!
Best,
Alex
02-18-2019 08:14 AM
Hi Alex,
I've been working on this today and I've made some progress. The code around this area splits based on the concurrency value provided and whether you want to compute topK
or not. So there are 4 code paths:
concurrency: 1, topK: null
concurrency: 1, topK: provided
concurrency: > 1, topK: null
concurrency: >1, topK: provided
I've got it working for that top option at the moment, and I'm going to work on the other 3 now. Those 3 all have one bit of code in common, so in theory once I've got that function updated it should just work (TM).
I gave it a try on a recipes dataset that I've been playing with. You can load that data by running:
:play recipes
in the Neo4j browser, if you want to.
This is what the query looks like:
MATCH (recipe:Recipe)-[:COLLECTION]->(collection)
WITH {item:id(collection), categories: collect(id(recipe))} as data
WITH collect(data) AS collectionRecipes
MATCH (recipe:Recipe)-[:CONTAINS_INGREDIENT]->(ingredient)
WITH collectionRecipes, {item:id(ingredient), categories: collect(id(recipe))} as data
WITH collectionRecipes, collect(data) AS ingredientRecipes
WITH collectionRecipes, ingredientRecipes,
[value in collectionRecipes | value.item] AS sourceIds,
[value in ingredientRecipes | value.item] AS targetIds
CALL algo.similarity.jaccard.stream(collectionRecipes + ingredientRecipes, {similarityCutoff: 0.0, concurrency:1, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN labels(from) AS fromLabels, from.name AS from, labels(to) AS toLabels, to.name AS to, similarity
ORDER BY similarity DESC
LIMIT 100
So we're computing similarities from collections -> ingredients based on their common recipes, but we shouldn't ever see collection -> collection or recipe -> recipe.
I can send you a jar with the code in as it is now or you can play with it once I've got the other options working too. I'll probably not merge it into the library until next week as I want @michael.hunger to review it and he's on vacation this week.
Cheers, Mark
02-19-2019 01:38 AM
I'm working on it over here in case you wanted to follow - https://github.com/neo4j-contrib/neo4j-graph-algorithms/pull/820
I have a version which seems to work for all the variants that I described above. I've only implemented it for Jaccard at the moment, but will make it work for all the similarity algorithms.
Here's a fun version that computes the most similar ingredients to the Gooseberry collection:
MATCH (recipe:Recipe)-[:COLLECTION]->(collection)
WITH {item:id(collection), name: collection.name, categories: collect(id(recipe))} as data
WITH collect(data) AS collectionRecipes
MATCH (recipe:Recipe)-[:CONTAINS_INGREDIENT]->(ingredient)
WITH collectionRecipes, {item:id(ingredient), categories: collect(id(recipe))} as data
WITH collectionRecipes, collect(data) AS ingredientRecipes
WITH collectionRecipes, ingredientRecipes,
[value in collectionRecipes WHERE value.name = "Gooseberry" | value.item] AS sourceIds,
[value in ingredientRecipes | value.item] AS targetIds
CALL algo.similarity.jaccard.stream(collectionRecipes + ingredientRecipes, {similarityCutoff: 0.0, topK:10, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN labels(from) AS fromLabels, from.name AS from, labels(to) AS toLabels, to.name AS to, similarity
ORDER BY similarity DESC
LIMIT 100
It'd be cool if you have a chance to test it out to make sure it does what you want. You can grab the jar from - https://github.com/neo4j-contrib/neo4j-graph-algorithms/files/2879346/graph-algorithms-algo-3.4.12.3...
I had to zip it so that GitHub was happy, but just unzip that and replace your existing jar in the plugins folder.
02-19-2019 07:09 AM
Hi Mark that was really fast! Thank you once again.
Just tested and works as expected with really good performance.
Query profile :
Old implementation : 190405 total db hits
Plain cypher: 190632 total db hits
New implementation: 7079 total db hits
Best,
Alex
All the sessions of the conference are now available online