Neo4j

mengjiakang2017 · ‎04-05-2022

Hello all,

I was checking on the query running time for same type of question on 2 different graph schemas. I found that the query in model1 drops dramatically (10 times faster) in running time while model2 does not (model and query , run time below). Is this expected?

Thanks in advace!

model1 cypher

match path=(c:ConditionOccurrence)<-[r1:HAS_CONDITION_OCCURRENCE]-(p1:Person)-[r0:HAS_DRUG_EXPOSURE]->(d:DrugExposure)<-[r2:HAS_DRUG_EXPOSURE]-(p1:Person)-[r3:HAS_PROCEDURE_OCCURRENCE]->(p4:ProcedureOccurrence)
where r1.visit_occurrence_id=r2.visit_occurrence_id and r2.visit_occurrence_id=r3.visit_occurrence_id and d.drug_concept_name = 'dexamethasone' and p4.procedure_concept_name contains 'chest'  and c.condition_concept_name ='Spontaneous pneumothorax' 
return path;

running time are 241 ms, 11 ms, 14 ms

model2 cypher

match path = (v1)<-[:ASSOCIATED_DURING_VISIT]-(c:ConditionOccurrence)<-[r0:HAS_CONDITION_OCCURRENCE]-(p:Person)-[r:HAS_DRUG_EXPOSURE]->(d:DrugExposure)-[r1:ASSOCIATED_DURING_VISIT]->(v2)-[r2:ASSOCIATED_DURING_VISIT]-(p2:ProcedureOccurrence)
where d.drug_concept_name = 'dexamethasone' and p2.procedure_concept_name contains 'chest' and c.condition_concept_name ='Spontaneous pneumothorax' 
and v1=v2  
return path;

running time are 52267 ms 47962 ms 48248 ms.

model1

model2

dana_canzano · ‎04-05-2022

@mengjiakang2017

if is 10 times faster it could be how much of the query is getting its data from RAM vs the filesystem and if so this then speaks to conf/neo4j.conf parameter dbms.memory.pagecache.size. Do both environments have the same value? Do both environments have the same graph size? Are both environments running the same Neo4j version? For each query if you preface the cypher with explain are similar plans generated

mengjiakang2017 · ‎04-05-2022

Thank you @dana.canzano for your timely reply!

Both databases are in the same local DBMS and same version, so the configuration is the same and the dbms.memory.pagecache.size=16G. the size of graph1 is smaller than graph4 and attached is the query plan (looks like model1 has some parallel while model4 is all sequential)

v1

v2

dana_canzano · ‎04-06-2022

@mengjiakang2017

From the Explain of v2 and the 4th to last block from the bottom we see a ExpandALL() and resulting in 55k rows surviving. And we have to do an ExpandAll given model2 includes

match path = (v1)<-[:ASSOCIATED_DURING_VISIT]-(c:ConditionOccurrence)<-[r0:HAS_CONDITION_OCCURRENCE]-(p:Person)-[r:HAS_DRUG_EXPOSURE]->(d:DrugExposure)-[r1:ASSOCIATED_DURING_VISIT]->(v2)-[r2:ASSOCIATED_DURING_VISIT]-(p2:ProcedureOccurrence)
where d.drug_concept_name = 'dexamethasone' and p2.procedure_concept_name contains 'chest' and c.condition_concept_name ='Spontaneous pneumothorax' 
and v1=v2  
return path;

and specifically -(c:ConditionOccurrence)<-[r0:HAS_CONDITION_OCCURRENCE]-(p:Person) and as the WHERE clause does restrict on c.condition_concept_name ='Spontaneous pneumothorax' there is no restriction on (p:Person) and so we are going to have to expand all relationships

mengjiakang2017 · ‎04-06-2022

Thank you @dana.canzano !

Are you saying that the expandALL() step is causing the query cache not working on v2? but we can see there are also see 4 ExpandALL() steps in v1 (query cache worked well on this one though)

Also a quick update that when I listed the specific label I need from (v) to (v:VisitOccurrence) for model2, the query plan updated to below

match path = (v1:VisitOccurrence)<-[:ASSOCIATED_DURING_VISIT]-(c:ConditionOccurrence)<-[r0:HAS_CONDITION_OCCURRENCE]-(p:Person)-[r:HAS_DRUG_EXPOSURE]->(d:DrugExposure)-[r1:ASSOCIATED_DURING_VISIT]->(v2:VisitOccurrence)-[r2:ASSOCIATED_DURING_VISIT]-(p2:ProcedureOccurrence)
where d.drug_concept_name = 'dexamethasone' and p2.procedure_concept_name contains 'chest' and c.condition_concept_name ='Spontaneous pneumothorax' 
and v1=v2  
return path;

The query running time are 40664ms, 37287 ms, 36787 ms (seems about 3k ms faster with query cache - is this an expected accelerate by the cache?)
Thanks!

bennu_neo · ‎04-06-2022

Hi @mengjiakang2017 !

I see what you are trying here. Can you open the first Explain as well so we can see what's happening on every branch?

Cypher planner will try his best base on Label Stores cardinalities and Indexes. On that note, can you add an index on :DrugExposure.drug_concept_name, ProcedureOccurrence:procedure_concept_name and :c:ConditionOccurrence.condition_concept_name before sharing the new planners? This time with the Profile instruction

Bennu

Oh, y’all wanted a twist, ey?

mengjiakang2017 · ‎04-06-2022

Hi @bennu.neo ,

Thank you for your suggestions!
just want to clarify that both versions have indexes on the same type of nodes on (concept_id) so I don't think it is the index caused the difference in the query cache in the 2 models.
but you are right that for this query, it makes sense to add the indexes on (concept_name). please see below plan (not sure why at the bottom it returns 0 rows but when I ran the query, there are patterns returned) after adding the indexes you mentioned for model2

Also wanted to add the background that I'm comparing between two graph models so really want to know what caused the query cache to work well on one model but not on the other (with same setup and same question

bennu_neo · ‎04-06-2022

Hi @mengjiakang2017 !

Can you share the Profile on both queries? So we can see the db hits on each step and compare them

Bennu

Oh, y’all wanted a twist, ey?

mengjiakang2017 · ‎04-06-2022

Of course, please see below plan for model1.

Thank you for your timely help!

bennu_neo · ‎04-06-2022

Thanks @mengjiakang2017!

First part looks the same for both plans now. How are both of them having timewise?

About Profile, it's the same as Explain but using the word PROFILE instead.

This will give us some real metrics of the execution.

Bennu

Oh, y’all wanted a twist, ey?

dana_canzano · ‎04-06-2022

couple of issues.

a. prefacing a query with Explain will simply provide a query plan based upon statistics in the database relative to Node counts etc. Profile is effectively the same except it is more accurate picture as it will also the run the query.

b. with reference to know what caused the query cache to work well maybe semantics but there really isnt a query cache.

Query and their related plans are cached. For example if on post cold start of Neo4j you run a query he 1st thing we do is determine if this query plan, i.e the plan itself no the results, not the data, but rather f the plan are in the query plan cache. If the query s not in the query plan cache then the query needs to be planned. However if the query exists in the plan then we will re-use the plan, and thus avoid re-planning the query. Yes there is a edge case that even if the plan is in the query we may not use he cached plan if Neo4j determine the data has significant changed. For example if the query plan cache entry was added when there were 100k :Person nodes but now the same query is re-run but there is 5M :Person nodes then the cached plan will be thrown out and we will thus plan.
the other 'cache' is what is described by dbms.memory.pagecache.size, which allocates a portion of RAM to record the data in the graph. When a query accesses data if it accesses the data from the filesystem, it will then add this data to the pagecache, i.e. RAM. And so if you re-run the query, the 2nd run should be faster since presumably much of the data is in the pagecache/RAM. And the pagecache is set up in a LRU (least recently used) design. So if you define the pagecache with a size of 10G, and you run a query which accesses 15G of data then in theory the pagecache will contain the last used 10G of that 15G access. Also it should be noted that your pagecache is automatically reloaded across Neo4j restarts. For example if the pageache has all data from 2021 and you restart Neo4j, then upon restart we automatically put back all the data into the pagecache which was in the pagecahce prior to stop.
@mengjiakang2017 @bennu.neo

Beyond that there are no other caches.

mengjiakang2017 · ‎04-07-2022

Thank you @bennu.neo and @dana.canzano for the suggestions and insights!

Indeed after adding the 3 indexes on concept_name, model2 ran much faster while model1 ran slower (compared to 241 ms as in the first post - could be due to server status)

model1 - 305ms, 49 ms, 23 ms
model2 - 865 ms, 95 ms, 88 ms (much faster with additional index - looks like the query cache highly depend on the index?)

model1 has less db hits than model2 too (attached the profile for both)
model1-

model2-

bennu_neo · ‎04-07-2022

Hello @mengjiakang2017,

As @dana.canzano highlighted before, I don't think this difference was related with cache at all. Previous planners weren't optimized, due to lack of indexes so some steps weren't the 'best' possible for your query.

One thing I notice is that even if the first part of the plan is the same for both queries, the rows and hits are different. Beside the model change, is the cardinality of objects the same?

e.g What's the result of

MATCH(c:ConditionOccurrence {condition_concept_name  : 'Spontaneous pneumothorax' })
return c

on each model?

Is there more of one on model 2?

Bennu

Oh, y’all wanted a twist, ey?

mengjiakang2017 · ‎04-07-2022

Thank you @bennu.neo , I agree that the original plan was not optimized due to limited indexes. for the cardinality of objects, can you please clarify?

as for the cypher results, model1 returns count of 1 while model2 returns count of 15, which is expected becuase mdoel2 loads occurrence to the Condition node while model1 only load the unique condition_concept.

when checking the path of below - model1 also returned 15

MATCH path=(p:Person)-[]->(c:ConditionOccurrence {condition_concept_name  : 'Spontaneous pneumothorax' })
return count(path)

Neo4j

Query cache for speeding up queries