Neo4j

tms · ‎07-29-2022

Are Cypher subqueries (`CALL {}`) synchronous or aysnchronous? I thought they were synchronous,
meaning that if a query has two subqueries, the execution of the second would
not begin until the first is complete. I also thought that the results of the
transaction implied by a second would be available to the first after each
completion of the second.

At least with the way I'm calling the subqueries, this is not the case. I
invite suggestions about what I need to do differently to accomplish my goal.

I apologize for the length of this post, I don't know how to shorten it.

Everything that follows is running on neo4j-enterprise v 4.4.6, running on a
well-provisioned and robust AWS EC2 instance running RockyLinux. Python code is
running Python v3.9.6.

I have an interconnected set of processes (described below) where the results
of the second process are needed by the first process. I have been driving
these in Neo4J by a Python process that uses the standard Python driver.

There are two steps to the calculation. There is feedback from the second step
to the first step, and that feedback is the source of the issue I see in my
combined cypher.

Until now, the Python code performs the first task for a given date. It then
performs the second task for the same date. This means that the results of the
second task are available to the first task when it runs for the next date.

For example, instances of PredictedToRise are computed for 2022-02-03. Then
instances of HotSpot are computed for 2022-02-03. Instances of PredictedToRise
are then computed for 2022-02-04. The PredictedToRise instances created for
2022-02-04 are restricted by the HotSpot list created for 2022-02-03.

Since it is being driven from Python, the processing for a given date (such as
2022-02-03) is always finished before the processing for the subsequent date
(such 2022-02-04) begins.

For a variety of reasons, including performance, I attempted to do all this in
Cypher. An all-Cypher approach must produce the same results as the current
approach using the Python driver. I appreciate any guidance about what I must
do differently to accomplish this.

I migrated the two steps into subqueries (using the `CALL{}` directive)
making the apparently incorrect assumption that when processing a sequence of
date instances (produced by the `UNWIND` directive), the (lexically) second
subquery does not begin until the lexically first subquery has completed. I
also assumed that processing on the second date in the UNWIND sequence does
not begin until processing of both subqueries for the first date have
completed.

I've shown psuedocode for the Cypher I use below. The code runs without
complaint. It generates instances of PredictedToRise and HotSpot, and it adds
the expected recentHotSpotsJSON property to each instance of HotSpotMetadata.

Sadly, it produces results that are dramatically different from the current
Python ingestor. In particular, it behaves as if the recentHotSpotsJSON
generated by the second subquery on a previous date is not restricting the
first subquery for the subsequent date.

The instance count for PredictedToRise and HotSpot is the same for the first
date that is processed. Those counts are dramatically higher in the all-Cypher
approach for each date after the first date. It appears that the attempt to
update recentHotSpotsJSON in the second subquery is not available (in time) for
the first subquery. The values of recentHotSpotsJSON are each available when
the all-Cypher approach finishes, but they are much too high.

The code in question is handling a sequence of datapoints, where each datapoint
is the value of a specific property (cumulative COVID case count) measured on a
given date (the "pertainsDate") and reported for one of the 3K+ US counties.
Each county is identified by a "stateCountyFIPSCode" assigned by the US census
bureau.

For each date, instances of "PredictedToRise" are generated based on a complex
calculation of current and historical data. "PredictedToRise" is a labeled node
that contains a pertainsDate (the date of the prediction) and a
stateCountyFIPSCode. The existence of an instance of PredictedToRise signifies
that the given county is expected to show a statistically significant increase
on the given date.

After PredictedToRise instances are generated for a given date, those instances
are used to generate instances of "HotSpot", another labeled node that contains
a pertainsDate and stateCountyFIPSCode. The county identified by each instance
of PredictedToRise has geographic neighbors that it may infect. Each infected
neighbor of a PredictedToRise instance (with some complicated filtering) is a
HotSpot for that date.

The feedback is that the PredictedToRise instances for a given date are
filtered by the HotSpot instances for the immediately previous date.

This means that after the HotSpot instances are generated for a given date,
they are collected in a list for that date that is kept separately. The list is
stored and queried as a JSON string for implementation convenience (and to
reduce the memory footprint of all this).

Here is the pseudo-cypher:

UNWIND(sequenceOfDateInstances) AS processDate
CALL { // Add instances of PredictedToRise
    WITH
        processDate
    WITH
        processDate - duration({days: 1}) AS restrictionDate
    MATCH(metadata:Metadata {pertainsDate: restrictionDate})
    WITH
        metadata.recentHotSpotsJSON AS recentHotSpotsJSON
    WITH
        <complex-calculation> AS stateCountyFIPSCode
    WHERE
        <complex calculation>
        AND SIZE(apoc.text.indexesOf(recentHotSpotsJSON, neighborAdjustedCaseCount.stateCountyFIPSCode)) = 0
    CREATE (predictedToRise:PredictedToRise)
    SET
        predictedToRise.pertainsDate = pertainsDate,
        predictedToRise.stateCountyFIPSCode = stateCountyFIPSCode
}

CALL { // Add instances of HotSpot
    WITH
        processDate
    MATCH(predictedToRise:PredictedToRise {pertainsDate: processDate})
    WITH
        <complex-calculation using predictedToRise> AS hotSpot
    CREATE (hotSpot:ZeetixHotSpot)
    SET
        hotSpot.pertainsDate = pertainsDate,
        hotSpot.stateCountyFIPSCode = predictedToRise.stateCountyFIPSCode
    WITH
        processDate,
        lookbackPeriod  // instance of duration
    WITH
        processDate,
        pertainsDate - lookbackPeriod AS startDate

    MATCH (hotSpotMetadata: HotSpotMetadata {pertainsDate: processDate})
    MATCH(hotSpot:HotSpot)
    WHERE
        hotSpot.pertainsDate >= startDate
        AND hotSpot.pertainsDate <= processDate
    WITH
        hotSpotMetadata,
        hotSpot.stateCountyFIPSCode AS hotSpotStateCountyFIPSCode
    ORDER BY
        hotSpot.stateCountyFIPSCode
    WITH
        hotSpotMetadata,
        apoc.convert.toJson(COLLECT(DISTINCT(hotSpotStateCountyFIPSCode))) AS recentHotSpotsJSON
    SET
        hotSpotMetadata.recentHotSpotsJSON = recentHotSpotsJSON

glilienfield · ‎07-30-2022

'Call' subqueries are executed sequently and the side-effects of the first one are available to the subsequent subqueries. Take the following query for example:

create(:Test{id:1})
with 1 as dummy
call {
    match(x:Test{id:1})
    set x.id=100
}
call {
    match(m:Test)
    where m.id = 100
    return m.id as id
}
return id

In the above, the result is 'id' equal to 100.

The behavior becomes unexpected when the query begins with a 'unwind' clause. I modified it and added a 'unwind' in the beginning.

unwind [1,2,3,4] as id
call {
    with id
    merge(x:Test)
    set x.id=id
}
call {
    match(m:Test)
    return m.id as node_id
}
return node_id

In this case I was expecting the result to be 'id' equal 1, 2, 3, and 4. Instead, I got 'id' equal to '4, 4, 4, and 4.' I did some other test where I inserted the creating of other nodes in the query to I could track the execution order. The results implied that the 'unwind' is done and the first 'call' subquery is called all four times before call the second subquery. The is validated when looking at the query plan, which shows a forEach loop operating over the 'merge' of the first 'call' subquery. The results of the first subquery are then linked to the second subquery, explaining the output of '4, 4, 4, and 4.'

I then wrapped the two 'call' subqueries in their own subquery to see if that would resolve the issue and it did. This time I got back ids equalling '1, 2, 3, and 4.

unwind [1,2,3,4] as id
call {
    with id
call {
    with id
    merge(x:Test)
    set x.id=id
}
call {
    match(m:Test)
    return m.id as node_id
}
return node_id
}
return node_id

I also looked at removing the two 'call' subqueries to see if that resolved the issue. I also got a result of '4, 4, 4, and 4.'

unwind [1,2,3,4] as id
merge(n:Test) set n.id = id
with 1 as dummy
match(n:Test)
with n.id as nodes_id
return nodes_id

Based on all the observations, I think this behavior can be explained by realizing the query is executed in parts, which are separated by each 'with' clause. In our examples, the first part consists of the 'unwind' and the first 'call' subquery' or the 'unwind' and the first 'match' (in the simplified query). The results are sent to the second part through the 'with' clause. Now the side-effects of executing the first part for each value in the 'unwind' is available to the second part.

In conclusion, I think your issue can be resolved if you wrap your two 'call' subqueries in on outer 'call' subquery.

View solution in original post

glilienfield · ‎07-30-2022

Call subqueries are computed sequentially and within the same transaction, so any side effects in one subquery will be know to subsequent subqueries, as is the case for outer query. As an example, the query below creates a node each time it is executed, the first subquery updates its property, and the second subquery matches on all the nodes and filters on the value of the property that was updated in the first subquery. As the results show, only Test nodes with 'id' 100 are returned. One node is returned with id equal to 100 as expected.

create(:Test{id:1})
with 1 as dummy
call {
    match(x:Test{id:1})
    set x.id=100
}
call {
    match(m:Test)
    where m.id = 100
    return m.id as id
}
return id

The results get unexpected when an 'unwind' is inserted before. Take the following query that mimics the flow of your query:

unwind [1,2,3,4] as id
call{
    with id
    merge(n:Test) set n.id = id
}
call {
    match(n:Test)
    return n.id as nodes_id
}
return id, nodes_id

As you, I expected the entire query to execute individually for each value of id, thus expecting rows with id equal to 1, 2, 3, and 4. Instead, I got four rows all with id equal to 4. I inserted some other create node operations in the path so I could track the order of execution. The data implies that the first 'call' subquery gets executed four times, then the second 'call' subquery gets executed for each value of 'id.' This behavior seems validated by the attached 'explain' output. Notice there is a forEach operation that looks like it is iterating over the first 'call' subquery, resulting in its 'merge' operation occurring once for each value of 'id' in the list. Looks like by the time the second 'call' subquery is executed that the one Test node has had its 'id' set to four.

I was able to get the expected result by wrapping the two 'call' subqueries in an other subquery, so the two inter subqueries are run sequentially, once per value of 'id.' The result of this query are ids equal to 1, 2, 3, and 4.

unwind [1,2,3,4] as id
call {
    with id
    call{
      with id
       merge(n:Test) set n.id = id
    }
    call {
       match(n:Test)
      return n.id as nodes_id
    }
return nodes_id
}
return id, nodes_id

I thought another way around this is to remove the two 'call's and execute the code directly. Here is that query:

unwind [1,2,3,4] as id
merge(n:Test) set n.id = id
with 1 as dummy
match(n:Test)
with n.id as nodes_id
return nodes_id

Unexpectedly I also got '4, 4, 4, 4' as the result. I again assumed it would run the query entirely for each value of 'id'. The only way to explain the behavior is to realize the query is done in parts, and the parts are separated by each 'with' clause. In this cause, the 'unwind' and the first 'merge' are one part, which is completed entirely before passing through the 'with' clause to the second part of the query. If true, the 'unwind' results in four rows with id equal to '1, 2, 3, and 4', which follows with each row executing the 'merge' statement in the first subquery (causing the 'id' of the node to finally equal 4). The four rows containing id equalling '1, 2, 3, and 4' are passed through the 'with' clause and the match is done for each value of 'id'.
This explains the behavior with the two 'call' subqueries if we assume there is an implied 'with' statement separating the two 'call' subqueries.

The conclusion for your situation is to wrap your two 'call' subqueries in a subquery.

glilienfield · ‎07-30-2022

'Call' subqueries are executed sequently and the side-effects of the first one are available to the subsequent subqueries. Take the following query for example:

create(:Test{id:1})
with 1 as dummy
call {
    match(x:Test{id:1})
    set x.id=100
}
call {
    match(m:Test)
    where m.id = 100
    return m.id as id
}
return id

In the above, the result is 'id' equal to 100.

The behavior becomes unexpected when the query begins with a 'unwind' clause. I modified it and added a 'unwind' in the beginning.

unwind [1,2,3,4] as id
call {
    with id
    merge(x:Test)
    set x.id=id
}
call {
    match(m:Test)
    return m.id as node_id
}
return node_id

In this case I was expecting the result to be 'id' equal 1, 2, 3, and 4. Instead, I got 'id' equal to '4, 4, 4, and 4.' I did some other test where I inserted the creating of other nodes in the query to I could track the execution order. The results implied that the 'unwind' is done and the first 'call' subquery is called all four times before call the second subquery. The is validated when looking at the query plan, which shows a forEach loop operating over the 'merge' of the first 'call' subquery. The results of the first subquery are then linked to the second subquery, explaining the output of '4, 4, 4, and 4.'

I then wrapped the two 'call' subqueries in their own subquery to see if that would resolve the issue and it did. This time I got back ids equalling '1, 2, 3, and 4.

unwind [1,2,3,4] as id
call {
    with id
call {
    with id
    merge(x:Test)
    set x.id=id
}
call {
    match(m:Test)
    return m.id as node_id
}
return node_id
}
return node_id

I also looked at removing the two 'call' subqueries to see if that resolved the issue. I also got a result of '4, 4, 4, and 4.'

unwind [1,2,3,4] as id
merge(n:Test) set n.id = id
with 1 as dummy
match(n:Test)
with n.id as nodes_id
return nodes_id

Based on all the observations, I think this behavior can be explained by realizing the query is executed in parts, which are separated by each 'with' clause. In our examples, the first part consists of the 'unwind' and the first 'call' subquery' or the 'unwind' and the first 'match' (in the simplified query). The results are sent to the second part through the 'with' clause. Now the side-effects of executing the first part for each value in the 'unwind' is available to the second part.

In conclusion, I think your issue can be resolved if you wrap your two 'call' subqueries in on outer 'call' subquery.

glilienfield · ‎07-30-2022

Just a note, you could probably get by with removing the two inner subqueries and just leaving one wrapped around the entire code to force the unwind to behave like you need it to. I say this because it doesn’t look like you code uses the subqueries as subqueries, nor in transactions, so the don’t seem necessary.

tms · ‎07-30-2022

I need to ensure that the result of the second subquery is available to the first subquery when the pair is run with the second date. I'll try the code without the two subqueries and see if it behaves. I appreciate your attention.

tms · ‎07-30-2022

I'll try wrapping this as suggested and see if it solves the issue. I appreciate your attention.

tms · ‎07-30-2022

I'll explore both of these helpful suggestions and follow up when I've had a chance to examine the results.

tms · ‎08-10-2022

Wrapping the two parts in a single top-level subquery solves my problem.

I experienced some unrelated misbehavior by the Cypher parser that was complaining about duplicated variable names, and so I kept the subquery wrappers around each of the two interior parts.

I've marked the original response from @glilienfield as a solution, and I appreciate the incredibly helpful response.

Neo4j

What is the concurrency of Cypher subqueries ('CALL {}')?