cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

SQL to Cypher having multiple nested inner join

nahmed
Node Link

I am trying to convert this long SQL query to cypher. I have two types of node: patient and apachepatientresult. Both having a common relation using node property as patientunitstayid. How can I achieve this?. Please help

drop materialized view if exists labels cascade;
create materialized view labels as
  -- select all the data we need from the apache predictions table, plus patient identifier and hospital identifier
  -- information because we only want to select one episode per patient (more on this later)
  with all_labels as (
    select p.uniquepid, p.patienthealthsystemstayid, apr.patientunitstayid, p.unitvisitnumber,
      apr.predictedhospitalmortality, apr.actualhospitalmortality, apr.predictediculos, apr.actualiculos
      from patient as p
      inner join apachepatientresult as apr
        on p.patientunitstayid = apr.patientunitstayid
      -- only use the most recent apache prediction model and exclude anyone who doesn't have at least 24 hours of data
      where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
    )
  select al.patientunitstayid, al.predictedhospitalmortality, al.actualhospitalmortality,
    al.predictediculos, al.actualiculos
    from all_labels as al
    -- 'selection' is a table which will choose a random hospital stay (the lowest number is fine because the stays
    -- are randomly ordered). In the case of multiple ICU stays within that hospital admission, it will choose the
    -- first ICU stay that satisfies the 24 hours of data requirement. The rationale is that the model should be
    -- applied as soon as there is 24 hours of continuous data within the hospital. This query extracts 89143 stays.
    inner join (
      select p.uniquepid, p.patienthealthsystemstayid, min(p.unitvisitnumber) as unitvisitnumber
        from patient as p
        inner join (
          select uniquepid, min(patienthealthsystemstayid) as patienthealthsystemstayid
            from all_labels
            group by uniquepid
          ) as intermediate_selection
          on p.patienthealthsystemstayid = intermediate_selection.patienthealthsystemstayid
        group by p.uniquepid, p.patienthealthsystemstayid
      ) as selection
      on al.patienthealthsystemstayid = selection.patienthealthsystemstayid
      and al.unitvisitnumber = selection.unitvisitnumber;

I  

1 ACCEPTED SOLUTION

I don't think there is a way around matching the patient nodes three times.  The following query is the same, but breaking out the part that calculates the 'patienthealthsystemstayid' and 'unitvisitnumber' into a 'call' subquery makes it more understandable.

call {
    match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
    where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
    with p.uniquepid as uniquepid, min(p.patienthealthsystemstayid) as patienthealthsystemstayid
    match(p:patient) 
    where p.patienthealthsystemstayid = patienthealthsystemstayid
    with p.uniquepid as uniquepid, patienthealthsystemstayid, min(p.unitvisitnumber) as unitvisitnumber
    return  patienthealthsystemstayid, unitvisitnumber
}
match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
and p.patienthealthsystemstayid = patienthealthsystemstayid
and p.unitvisitnumber = unitvisitnumber
return p

By the way, I believe your original algorithm can produce more than one episode per patient.  This can happen when a patient has the same combination of patienthealthsystemstayid and unitvisitnumber that is the same as another patient's minimum combination, but is not the same as this patient's combination.  I refactored the above query to eliminate that possibility by returning only one record per patient which is the one that has their minimum combination of patienthealthsystemstayid and unitvisitnumber. 

match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
with p.uniquepid as uniquepid, min(p.patienthealthsystemstayid) as patienthealthsystemstayid
call {
    with uniquepid, patienthealthsystemstayid
    match(p:patient) 
    where p.uniquepid = uniquepid and p.patienthealthsystemstayid = patienthealthsystemstayid 
    return min(p.unitvisitnumber) as unitvisitnumber
}
match(p:patient)
where p.uniquepid = uniquepid 
and p.patienthealthsystemstayid = patienthealthsystemstayid
and p.unitvisitnumber = unitvisitnumber
return p

View solution in original post

9 REPLIES 9

I am assuming you have a data model similar to this: 'match(p:Patient)-[:HAS_RESULT]->(apr:apachepatientresult)'

 
The 'all_labels' subquery is providing a list of patients who have patient results with the following criteria: 
apr.apacheversion = 'IVa' and apr.actualiculos >= 1

The actual query is selecting from the above population each patient's row that corresponds to the patient's row that has the minimum 'patienthealthsystemstayid' and minimum 'unitvisitnumber'. If the 'patient' table represents, there would be only one record each, and these minimums would be the single values. As such, I am think the 'patient' table does not represent the patient information, but does it contain the patient's encounter information, and the other table is the results from each encounter?

Can you provide the relevant part of the database schema, i.e. these two tables?

Hello Ninja, 

Thank you for your response. I have attached the database schema. 

Does the 'patent' table in sql represent a patient, i.e. one record per patient, or is it like a patient encounter table and the uniquepid is a foreign key that references the patient information?  This would make more sense from the sql query. My confusing is how the query is finding the min values if it was a patient table. I can see how it would if 'patient' is really an encounter table, thus there would be multiple records in the 'patient' table for one real patient represented by the patient's uniquepid.  Is this correct?   If so, do you have another neo4j node that has the patient demographics?

The 'patient' table in sql does not represent a patient per record, but rather multiple stays of every patient in a hospital. Here is the link of the actual schema which I am following. https://eicu-crd.mit.edu/eicutables/patient/ . 'patientUnitStayID' is the PK and 'patientHealthSystemStayID' is the FK. I think my graph model for this sql query is wrong. I have designed each node as a patient stay that has all the demographic information of the patient. And for your last question only patient table has all the demographics. You can visit the link for more info.

Thanks 

This is making a lot more sense.  Maybe you shoukd relabel your Patient node to PatientVisit or something similar to distinguish it from the Patient. 
Is ‘patientHealthSystemStayID’ a foreign key to health provider table? 
I will review the schema. I think I can write the query now. 

'patientHealthSystemStayID' is connected to the apacheApsVar table. I will recreate the model having patient node with the uniquepid, and seperate node for the PatientVisit. Thank you so much. But what would be the query then?

 

I want to review it some more and look for optimizations, but you can give it a try to see what results you get.

match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
with p.uniquepid as uniquepid, min(p.patienthealthsystemstayid) as patienthealthsystemstayid
with uniquepid, patienthealthsystemstayid
match(p:patient) 
where p.patienthealthsystemstayid = patienthealthsystemstayid
with p.uniquepid as uniquepid, patienthealthsystemstayid, min(p.unitvisitnumber) as unitvisitnumber
with  patienthealthsystemstayid, unitvisitnumber
match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
and p.patienthealthsystemstayid = patienthealthsystemstayid
and p.unitvisitnumber = unitvisitnumber
return p

I don't think there is a way around matching the patient nodes three times.  The following query is the same, but breaking out the part that calculates the 'patienthealthsystemstayid' and 'unitvisitnumber' into a 'call' subquery makes it more understandable.

call {
    match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
    where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
    with p.uniquepid as uniquepid, min(p.patienthealthsystemstayid) as patienthealthsystemstayid
    match(p:patient) 
    where p.patienthealthsystemstayid = patienthealthsystemstayid
    with p.uniquepid as uniquepid, patienthealthsystemstayid, min(p.unitvisitnumber) as unitvisitnumber
    return  patienthealthsystemstayid, unitvisitnumber
}
match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
and p.patienthealthsystemstayid = patienthealthsystemstayid
and p.unitvisitnumber = unitvisitnumber
return p

By the way, I believe your original algorithm can produce more than one episode per patient.  This can happen when a patient has the same combination of patienthealthsystemstayid and unitvisitnumber that is the same as another patient's minimum combination, but is not the same as this patient's combination.  I refactored the above query to eliminate that possibility by returning only one record per patient which is the one that has their minimum combination of patienthealthsystemstayid and unitvisitnumber. 

match(p:patient)-[:HAS_RESULT]->(apr:apachepatientresult)
where apr.apacheversion = 'IVa' and apr.actualiculos >= 1
with p.uniquepid as uniquepid, min(p.patienthealthsystemstayid) as patienthealthsystemstayid
call {
    with uniquepid, patienthealthsystemstayid
    match(p:patient) 
    where p.uniquepid = uniquepid and p.patienthealthsystemstayid = patienthealthsystemstayid 
    return min(p.unitvisitnumber) as unitvisitnumber
}
match(p:patient)
where p.uniquepid = uniquepid 
and p.patienthealthsystemstayid = patienthealthsystemstayid
and p.unitvisitnumber = unitvisitnumber
return p

nahmed
Node Link

Thank you so much. the first query you posted works fine. And after creating an index of the node attributes, it is also much faster.