Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-15-2022 04:49 PM
I have a file with NODE IDs and a property called MACCS with 0 and 1. I want to calculate jaccard similarity . What is the efficient way to do it ? I have attached the file linke here . I want to load the file , query i am using is gph_conn is the connection. Any
gph_conn.query("""
// USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///D:/Github/dbtest.csv' AS row
UNWIND SPLIT(row.MACCS, ',') AS i
CREATE (m:Mol {DrugBank_ID: row.DrugBank_ID,
MACCS:toInteger(i)
}
)
""")
Then i want to call the gds.similarity.jaccard to perform similarity between one node to rest of the other nodes . Below doesn't work becasue of format of the
MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.ECFP4), toIntegerList(n2.ECFP4)) AS jaccard;
Above should retuirn similarity values. Is there is a way to calculate similarity faster with indexes ?I want to do 10 million rows .
11-18-2022 09:28 AM
##The correct query is below
MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.MACCS), toIntegerList(n2.MACCS)) AS jaccard;
11-18-2022 12:41 PM
There doesn’t seem a need to collect fp1 and f p2, since they are not used and they should be empty
All the sessions of the conference are now available online