Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
02-12-2020 11:54 AM
Hi,
I am trying to optimize the running time of the following code:
Attempt 1:
CALL apoc.periodic.iterate(
"MATCH path=((a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias))
WHERE id(a) < id(b) AND id(c1) <> id(c2)
WITH a, b, c1.CITATION as CITATION1, c2.CITATION as CITATION2, p1, p2, t, 2 as precision
WITH a, b, CITATION1, CITATION2, p1, p2, t, 10^precision as factor
WITH a, b, CITATION1, CITATION2, t, round(factor* (1/(2+p1.weight+p2.weight))) / factor as weight
RETURN a, b, CITATION1, CITATION2, t, weight",
"CREATE (a)-[e:through_topic]->(b)
SET e.weight= weight
SET e.topic = t.MESH_TERM
Set e.citation1 = CITATION1
Set e.citation2 = CITATION2", {batchSize:2500})
YIELD batches, total, errorMessages
Attempt 2:
CALL apoc.periodic.iterate(
"MATCH path=((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation))
WHERE id(c1) < id(c2)
WITH c1.CITATION as CITATION1, c2.CITATION as CITATION2, p1, p2, t, 2 as precision
MATCH (a:alias) --> (c1)
WITH a, CITATION1, CITATION2, p1, p2, t, 10^precision as factor
MATCH (b:alias) --> (c2)
WHERE id(a) <> id(b)
WITH a, b, CITATION1, CITATION2, t, round(factor* (1/(2+p1.weight+p2.weight))) / factor as weight
RETURN a, b, CITATION1, CITATION2, t, weight",
"CREATE (a)-[e:through_topic]->(b)
SET e.weight= weight
SET e.topic = t.MESH_TERM
Set e.citation1 = CITATION1
Set e.citation2 = CITATION2", {batchSize:5000})
YIELD batches, total, errorMessages
I feel I can do much better so the second and third MATCH statements could be parallelized. Is there a more elegant way?
Thanks,
Lavanya
Solved! Go to Solution.
02-14-2020 04:40 PM
If the relationship name uniquely identifies the next term is BIOTERM etc then using relationship names would be optimal than using labels.
For ex
MATCH ((c1:citation) -[p1:HAS_BIO_TERM]-> (t) <-[p2:HAS_BIO_TERM]- (c2))
....
MATCH (a) -[:HAS_ALIAS]-> (c1)
can be faster.
Whenever a label is specified during traversal Cypher has to validate the node it reached has the specified label or not.
02-13-2020 08:07 AM
Hi Lavanya,
Just trying to understand the context a bit better. Are you tring to find citations to bioterms and start to match aliases (which are indirectly linked through citation) based on citation's weight to the bioterm? Can there be more than one alias related to a citation or more than one citation to alias? (e.g. does (cit1)-(alias)-(cit2) or (a:alias)-(cit)-(b:alias) exist) if not it maybe easier to get a distinct list for each Bioterm <-[p]-(c:citation)-(a:alias) and process from that?
Not sure I fully grasp what you are doing, so if I am talking rubbish please ignore me, but I have a feeling this pattern (a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias) is what you base your calculation off, but not really the pattern you are looking for which it sounds like is:
[(a:alias) -- (c:citation) -[p]-> (t:BIOTERM)]
as you may get a list of say 10 of these for a particular term and want to calculate each permutation
02-13-2020 08:27 AM
I am trying to replace each of these paths: (a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias) where id(a) < id(b) and id(c1) <> id(c2) with an edge from (a:alias) to (b:alias) as I do in Attempt 1. I am trying to optimize Attempt 1 since there are a lot of such paths.
My Attempt 2 is slightly off as you rightly catched (there need not be a distinct alias connected to a citation). A right approach would be trying to get all paths of the form ((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation)) and then get all (a:alias) -- (c1:citation) and (c2:citation) -- (b:alias) and connect all combinations of a and b - can this be parallelized? Maybe I can add an auxilliary edge between (c1:citation) and (c2:citation) and work from there?
I like your suggestions of first getting all [(a:alias) -- (c:citation) -[p]-> (t:BIOTERM)] or rather just [(c:citation) -[p]-> (t:BIOTERM)] and then looking for all combinations of citations connected to a same bioterm and then connecting all pairs of aliases connected to each combination of citations. Can this be done within a query or is it best to export the [(c:citation) -[p]-> (t:BIOTERM)] and [(c:citation) -- (a:alias)] relations as csv files and to do it outside cypher?
There are lot of possible approaches and I want to know which one would be the best.
Thanks,
Lavanya
02-14-2020 02:36 PM
Update: Finally figured out a solution using "OPTIONAL MATCH":
CALL apoc.periodic.iterate(
"MATCH ((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation))
WHERE id(c1) < id(c2)
WITH c1, c2, p1.weight as w1, p2.weight as w2, t, 2 as precision
WITH c1, c2, w1, w2, t, 10^precision as factor
WITH c1, c2, t, round(factor* (1/(2+w1+w2))) / factor as weight
OPTIONAL MATCH (a) --> (c1)
OPTIONAL MATCH (b) --> (c2)
RETURN a, b, weight, t
",
"
CREATE (a)-[e:through_topic]->(b)
SET e.weight= weight,
e.topic = t.MESH_TERM,
e.citation1 = CITATION1
e.citation2 = CITATION2", {batchSize:5000})
YIELD batches, total, errorMessages
Still need to try it for my larger graph instance - but I am sure this will be faster.
02-14-2020 04:40 PM
If the relationship name uniquely identifies the next term is BIOTERM etc then using relationship names would be optimal than using labels.
For ex
MATCH ((c1:citation) -[p1:HAS_BIO_TERM]-> (t) <-[p2:HAS_BIO_TERM]- (c2))
....
MATCH (a) -[:HAS_ALIAS]-> (c1)
can be faster.
Whenever a label is specified during traversal Cypher has to validate the node it reached has the specified label or not.
All the sessions of the conference are now available online