Neo4j

aneeshmonn · ‎03-30-2020

I have a node Pats with 12M nodes and its title has been annotated using ga.nlp.annotate and a direct relation IS_RELATED_TO has bee created from this Pats node to Tag node.

Task is to identify similar Pats based on this IS_RELATED_TO relationship which can be used to cluster the data.

I tried using algo.nodeSimilarity as shown below but the code did not finish even after 48 hours

CALL algo.nodeSimilarity('Pats|Tag', 'IS_RELATED_TO', {
direction: 'OUTGOING',
write: true,
topK: 5,
similarityCutoff: 0.8,
concurrency: 4,
writeRelationshipType: 'IS_SIMILAR_WITH_TITLE'
})
YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty, p1, p50, p99, p100

Later, written below code to do compare one by one pairs and compute jaccard similarity

match(sp)-[:IS_RELATED_TO]->(t:Tag)
	set sp.simProcessed=True

	with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount

	match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
	where dp.pat_id>s_pat_id and id(t1) in sourceTags

	with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity

	with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5

	match (dp)-[:IS_RELATED_TO]-(dpt)
	with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags

	with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5

	with * 
	order by jaccardSimilarity desc 
	limit 10

	create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)

Above code works perfect for my use case and the results looks promising, but the query just don't scale for my 12M records as it can only process 100 records per minute.

I use apoc.periodic.iterate to run the query as shown below.

CALL apoc.periodic.iterate(
	"MATCH (sp:Pats) 
	WHERE not exists(sp.simProcessed)
	RETURN sp",
	"match(sp)-[:IS_RELATED_TO]->(t:Tag)
	set sp.simProcessed=True

	with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount

	match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
	where dp.pat_id>s_pat_id and id(t1) in sourceTags

	with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity

	with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5

	match (dp)-[:IS_RELATED_TO]-(dpt)
	with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags

	with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5

	with * 
	order by jaccardSimilarity desc 
	limit 10

	create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)",
	{batchSize:1, iterateList:true,parallel:true,concurrency:3});

I have created index on :Pats(pat_id) and :Pats(simProcessed)

Tried to understand the Profile output but was not much of help for me.

System Configuration

OS: Linux v4.19.0-041900-generic (amd64 architecture) with 8 cores
Cores: 8
RAM: 61GB
dbms.memory.heap.initial_size:20000m
dbms.memory.heap.max_size: 20000m
dbms.memory.pagecache.size: 30000m

neo4j version : Cypher version: CYPHER 3.5, planner: COST, runtime: INTERPRETED.
what kind of API / driver do you use
screenshot of [PROFILE]
which plugins / extensions / procedures: apoc, algo, apoc.periodic.iterate

Any help on this would be appreciable

michael_hunger · ‎04-02-2020

Can you try to use the new graph data science library with node similarity which has been reimplemented to be much faster.

see:

https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/

Neo4j

Performance Issue with Recommendation Query