Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-30-2020 03:05 AM
I have a node Pats with 12M nodes and its title has been annotated using ga.nlp.annotate and a direct relation IS_RELATED_TO has bee created from this Pats node to Tag node.
Task is to identify similar Pats based on this IS_RELATED_TO relationship which can be used to cluster the data.
I tried using algo.nodeSimilarity as shown below but the code did not finish even after 48 hours
CALL algo.nodeSimilarity('Pats|Tag', 'IS_RELATED_TO', {
direction: 'OUTGOING',
write: true,
topK: 5,
similarityCutoff: 0.8,
concurrency: 4,
writeRelationshipType: 'IS_SIMILAR_WITH_TITLE'
})
YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty, p1, p50, p99, p100
Later, written below code to do compare one by one pairs and compute jaccard similarity
match(sp)-[:IS_RELATED_TO]->(t:Tag)
set sp.simProcessed=True
with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount
match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
where dp.pat_id>s_pat_id and id(t1) in sourceTags
with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity
with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5
match (dp)-[:IS_RELATED_TO]-(dpt)
with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags
with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5
with *
order by jaccardSimilarity desc
limit 10
create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)
Above code works perfect for my use case and the results looks promising, but the query just don't scale for my 12M records as it can only process 100 records per minute.
I use apoc.periodic.iterate to run the query as shown below.
CALL apoc.periodic.iterate(
"MATCH (sp:Pats)
WHERE not exists(sp.simProcessed)
RETURN sp",
"match(sp)-[:IS_RELATED_TO]->(t:Tag)
set sp.simProcessed=True
with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount
match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
where dp.pat_id>s_pat_id and id(t1) in sourceTags
with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity
with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5
match (dp)-[:IS_RELATED_TO]-(dpt)
with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags
with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5
with *
order by jaccardSimilarity desc
limit 10
create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)",
{batchSize:1, iterateList:true,parallel:true,concurrency:3});
I have created index on :Pats(pat_id) and :Pats(simProcessed)
Tried to understand the Profile output but was not much of help for me.
System Configuration
PROFILE
]Any help on this would be appreciable
04-02-2020 04:41 PM
Can you try to use the new graph data science library with node similarity which has been reimplemented to be much faster.
see:
https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/
All the sessions of the conference are now available online