Neo4j

abhik1368 · ‎11-15-2022

I have a file with NODE IDs and a property called MACCS with 0 and 1. I want to calculate jaccard similarity . What is the efficient way to do it ? I have attached the file linke here . I want to load the file , query i am using is gph_conn is the connection. Any

gph_conn.query("""
// USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///D:/Github/dbtest.csv' AS row
UNWIND SPLIT(row.MACCS, ',') AS i
CREATE (m:Mol {DrugBank_ID: row.DrugBank_ID,
MACCS:toInteger(i)
}
)
""")

Then i want to call the gds.similarity.jaccard to perform similarity between one node to rest of the other nodes . Below doesn't work becasue of format of the

MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.ECFP4), toIntegerList(n2.ECFP4)) AS jaccard;

Above should retuirn similarity values. Is there is a way to calculate similarity faster with indexes ?I want to do 10 million rows .

abhik1368 · ‎11-18-2022

##The correct query is below
MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.MACCS), toIntegerList(n2.MACCS)) AS jaccard;

glilienfield · ‎11-18-2022

There doesn’t seem a need to collect fp1 and f p2, since they are not used and they should be empty

Neo4j

Calculate Jaccard similarity