cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Dependence of execution speed on the batch size of passed parameters

Hello. I'm using Neo4J to identify the connections between different node labels. With point-in-time relations.
Neo4J 4.4.4 Community Edition

DB rolled out in docker container with k8s orchestrating. 

MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path

OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered

 1) When executing a query on a large number of incoming names (Person), the query is processed quickly for 100,000 inputs is ~2.5 seconds. However, if 100,000 names are divided into several batches and fore each batch query is executed sequentially, the overall processing time increases dramatically: 
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And is it possible to get the same executions time as for the entire dataset regardless the batch size? 

2) Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec.

3) Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?   

Essentially, the goal is to use as small batches as possible in order to process the entire dataset. 

Thank you. 

PS: Any suggestions to improve the query are very welcome.)

0 REPLIES 0