Neo4j

felipe · ‎04-17-2020

Hello.

My team is working on a government project to help detect and prevent virus transmission.

Using different technologies (Wifi, bluetooth, gps, manual input) we are able to detect human to human contact events.

Based on the nature and characteristics of each contact event, we are able to determine for each contact event the probability of transmission of the virus.

In general terms, each contact between 2 humans have a date and a probability of transmission (weight).

My team is fairly new with graphs but after doing some research we are certain that this is the technology we need in order to answer the questions we want to ask to our data.

We can map all this humans and contacts between then in a graph such as the example shown in the diagram.

The main question we want to be able to answer is:
If human X is tested positive por covid19 on date Y, give me the probability of transmission to other humans taking into account the contacts with all other humans (direct and indirect having into account the date of contact).

How can I approach this problem in neo4js?
How can I build a recursive (or something like that) cypher query that is able to find all its possible transmission paths so I can find this potential sick people by transmission of this human X?

Thanx in advance

William_Lin · ‎04-17-2020

Hi, felipe
Not sure how did you determine the probability, but it seems the patient will show the symptom between 14~20 days.
maybe you can set something like the contact date threshold (14~20 days).

For example: If you know the first patient spread the virus in the country (if contact date threshold > 40 days ago), obviously he/she will not be direct transmits the virus to the patient who has confirmed diagnosis today.

For those non-symptomatic patients, maybe you can trace back by those patients who are infected by non-symptomatic patients.

I hope this will help you.

felipe · ‎04-17-2020

Hello @William_Lin , thanx for your reply.

The probability stuff has been determined by a epidemiology team based on the charactersitics of each contact event (based on distance, duration, etc).

So, the data represents the probability of transmission of that contact event if the person X was positive.

So, the question we need to answer is:
When a positive is reported (real one), we need to evaluate all the possible transmissions that might have occured based on the contacts we have gathered.

andrew_bowman · ‎04-19-2020

For this I'd imagine you'd want a query with a variable-length pattern match, with some logic to ensure that all contacts are after some starting date (when you believe the subject became contagious), with extra filtering afterwards to ensure that each relationship in the path came before the previous relationship in the path (no need to check paths out from a person before they came in contact with a suspected spreader).

Once all that filtering is done, then it's a matter of using a reduce() function over the relationships of the path to multiply out the probabilities, then do any additional filtering if you aren't interested in paths with probabilities below a certain threshold.

Here's an example, limiting the distances to 15 hops, given parameters for the id of the confirmed infected individual, a starting date for when you believe they acquired the virus and became infectious.

MATCH path = (start:Human {id:$infectedId})-[:CONTACT_WITH*..15]-(potential:Human)
WHERE all(rel in relationships(path) WHERE rel.date >= $acquireDate)
   // now ensure each node occurs once in the path, no doubling back
  AND all(node in nodes(path) WHERE single(single in nodes(path) WHERE node = single))
WITH start, potential, relationships(path) as rels, [rel in relationships(path) | rel.date] as contactDates
// now keep only paths where contactDates are in ascending order
WHERE all(index in range(1, size(contactDates) - 1) WHERE contactDates[index] >= contactDates[index - 1])
WITH potential, reduce(prob=1, rel in rels | prob * rel.pOfTrans) as infectedProbability
WHERE infectedProbability >= $lowerThreshold
RETURN potential, infectedProbability
ORDER BY potential ASC, infectedProbability DESC

If you want to do additional rollups for a potential's probability if there are multiple contact paths to them with varying levels of probability, you'd do that at the end before the RETURN, but you haven't given an indication that this is needed or what to do in this case.

If using APOC Procedures we can leverage some functions to make some of the operations a bit more concise:

MATCH path = (start:Human {id:$infectedId})-[:CONTACT_WITH*..15]-(potential:Human)
WHERE all(rel in relationships(path) WHERE rel.date >= $acquireDate)
   // now ensure each node occurs once in the path, no doubling back
  AND apoc.coll.different(nodes(path))
WITH start, potential, relationships(path) as rels, [rel in relationships(path) | rel.date] as contactDates
// now keep only paths where contactDates are in ascending order
WHERE apoc.coll.sort(contactDates) = contactDates
WITH potential, reduce(prob=1, rel in rels | prob * rel.pOfTrans) as infectedProbability
WHERE infectedProbability >= $lowerThreshold
RETURN potential, infectedProbability
ORDER BY potential ASC, infectedProbability DESC

And if you really need to squeeze out performance and want to filter ascending dates of contact during expansion rather than filtering after, you may want to look at creating your own custom procedure and making use of the Traversal API.

felipe · ‎04-20-2020

Hi @andrew.bowman, wow thats great! That is exactly the type of approach I was looking for. Thanx for all the details. I will give this a try and let you guys know how it goes.

Thanx for taking the time!

felipe · ‎04-20-2020

Hey @andrew.bowman, I did some testing with your query and I am finding some really interesting results.

Have couple of questions:

What do you mean with this comment (and instruction) in the query:
"now ensure each node occurs once in the path, no doubling back"

And algo regarding this comment:
"If you want to do additional rollups for a potential's probability if there are multiple contact paths to them with varying levels of probability, you'd do that at the end before the RETURN"
We do have multiple contacts between each person. We constantly track bluetooth interaction events between devices. So if I visit my dad every day, we create a "CONTACT_WITH" relationship between those 2 nodes every time we track a contact.

andrew_bowman · ‎04-20-2020

Paths in Neo4j have the restriction that relationships must be unique per path...you can't traverse the same relationship more than once.

However, there are no restrictions on doubling back to an already visited node, provided that you're using different relationships (no relationship is used more than once).

Your example graph shows multiple :CONTACT_WITH relationships between the same two nodes, and even without that in the graph it's possible for paths to double-back to previously visited nodes in the path.

Depending on the query, it's often not useful to include those kinds of results, however on second thought it might be important in your case to keep them, provided that multiple paths of decreasing probability are useful for your calculations. It may make your queries more expensive as a result, however. You'd have to confirm this on your side.

That makes sense. The point is that for a query like this you will have multiple paths between an infected person and those they have potentially infected. What is the approach you want to use when multiple paths with differing probabilities are found?

felipe · ‎04-21-2020

@andrew.bowman thanx once again for your help.

Regarding multiple contacts between 2 individuals, it is one of the most complex behaviours to model (I think). That is where we are struggling!

This is mainly because every new contact between 2 individuals must take into account the state of the individual transmitting in that moment in time (which could be the addition of multiple previous contact events).

Lets see the following simple example:

So, lets say Mark is tested positive covid19 in Day0. Lets say we want to calculate the probability of Ann being sick due of this contacts.

The contact paths for Ann are:

Mark -> day1 -> Tom -> day2 -> Ann
Mark -> day3 -> Tom -> day4 -> Ann

Mark is in contact with Tom in day1, transmitting the virus with a certain probability p1, and later Tom is in contact with Ann in day2, transmitting the virus with a certain probabilty p2.

In this initial sequence (path1) we can calculate the probability of Ann being infected as
P(AnnSickContact2)= p1•p2
which are all values that we can get from the path itself.

Now, Mark contacts again with Tom in day3 with a transmission probability of p3, and later in day4 Tom contacts again with Ann with a transmission probability of p4.

In this second sequence (path2), the resulting probability of Ann being infected due to this contact in day4 is NOT p3•p4. If we just take into account p3•p4 we will be assuming that the initial contact of Mark and Tom in day1 didnt exist. In day4 Tom transmitts to Ann his state at that later moment, which is a result of being contacted by Mark 2 times.

The true probability of Ann being infected due to contact in day4 is (assuming events are statistically independent for simplicity) is
P(AnnSickContact4)= p4•(p1+p3 - (p1•p3))

This is because the probability of Tom being infected in the second (later state) path is:
(p1+p3 - (p1•p3))
(theory of calculating the probability of 2 independent events)

So, what I find challenging is how to represent this is the Cypher queries. As you can see, in order to calculate the probability of Ann being sick in the second path involves knowing information of the first path (we need p1).

In addition to that, we need to calculate the overall probability of Ann being sick, which is the result of 2 independent events too: transmission in contact in day2 and transmission in contact in day 4. We can again apply the same formula of addition of independent events:

Overall probability of Ann being sick =
P(AnnSickContact2) + P(AnnSickContact4) - [P(AnnSickContact2)•P(AnnSickContact4)]

In this example I used 2 doble contacts of the same people (I could've done an even simpler example), but in general terms the issue we want to address is that the "state" being transmited by personB to personC in a given contact event cannot be evaluated taking only into account the state of personB as a result of that path but as a result of all the transmission paths that led to him in previous moments.

andrew_bowman · ‎04-21-2020

Yes, this is the toughest part.

Keep in mind that the Cypher query would actually generate 3 paths as-is, not 2:

Mark -> day1 -> Tom -> day2 -> Ann
Mark -> day1 -> Tom -> day4 -> Ann
Mark -> day3 -> Tom -> day4 -> Ann

This is because all dates of contact in each path are still temporally ordered.

Given these 3 paths, does that allow any opportunity to derive your correct answer?

felipe · ‎04-21-2020

Hello @andrew.bowman,

Well, with the filtering technique that you showed me in the first query we are able to filter out that path that do not represent a real transmission path due to the fact that events do not occurr in a sequetial date order.

andrew_bowman · ‎04-22-2020

I may be a bit confused.

day1 -> Tom -> day4 -> Ann is in temporal order, in that for each contact relationship in the path, the dates are ascending. I don't think there was any advice I gave that would filter this out. Maybe I'm not fully understanding your use cases, but it does seem like a real transmission path. There's a probability Mark infected Tom, and since that initial contact, two contacts with Ann later. The last contact with Ann also included one additional contact between Mark and Tom beforehand.

Given 3 paths with 3 separate probabilities, is it possible to derive the expected results? Or is this something that needs to be thrown out in favor of a different approach?

felipe · ‎04-22-2020

Hello @andrew.bowman

You are right, I missed that path in my previous description, sorry for causing that confusion!
(the ones we filter with the query are the ones that do not occur in sequential date order, this is still good).

That missing path might help me collect the data I need in order to find the resulting probability of Ann, by combining the individual results of all the paths that led to her.

Going to work on that and let you know,

felipe · ‎04-22-2020

Hello once again.

So, I was reviewing again the problem with my team.

The main challenge we are facing is that in order to calculate the probability of a node (person) being sick involves combining information of multiple paths in a way that I find difficult to buid a cypher query (hope it is just a result of my lack of knowledge).

So, I have here a better description of our situation:

Mark is tested positive in day0, we have the following possible paths for Ann:

Mark -> day1 -> Tom -> day2 -> Ann
Mark -> day1 -> Tom -> day4 -> Ann
Mark -> day3 -> Tom -> day4 -> Ann

note:
For calculating the probability of a person being sick after multiple contacts we assume that each contact event is independent. When there are a lot of contact events to a person, it is easier to calculate the probability of NOT being sick as a result of NOT being transmitted by ANY of the contacts, and then calculate the probability of being sick as 1-(probability of not being sick)

So, if I calculate the probability using the reduce function in the cypher query for each individual path for Ann I get:

p(sick) = p(trans1)*p(trans2)
p(sick) = p(trans1)*p(trans4)
p(sick) = p(trans3)*p(trans2)

This individual results of the paths are not useful for me to get the real final state of Ann. This is due to the reason that the transmission in contact of day4 towards Ann is a mix result of Mark and Tom's contact in day1 and day3 (which we cannot represent by mixing the multiple paths arriving to Ann).

We cannot evaluate paths individuall until the end and then combine/aggregate results. We must do a more complex analysis where one node transmits its state as a result of all the paths that led to that node previously.

Its like there is no distributive property.

andrew_bowman · ‎04-22-2020

With the complexity of these probability rollups per node over time, I'm thinking that Cypher isn't the right tool to use.

You may need to drop down into a custom procedure in Java and code it out, then call your procedure via Cypher with the necessary parameters.

Another case that may complicate implementation of this....what about multiple contacts between the same two people?

We've been looking at the chances of Ann being infected, but what about Tom?

Contact transmission paths (though maybe not useful for probability rollup) are:

Mark - day1 > Tom
Mark - day1 > Tom - day2 > Ann - day 4 > Tom

Naturally in this scenario Ann can only be infected if she got this from Tom, so Ann's contact with Tom on day 4 should not be counted and should not influence Tom's infectiousness. But if there was a separate transmission path through Ann to Tom for day 4 via some other person (Mark to Dave to Ann to Tom), then you would have to consider Ann's infectiousness, but exclude the elements in her probability rollup of contracting it from Tom.

But this is a simple scenario, it could get much more complex. Especially if we have to evaluate this and track the spread of multiple positives at once, not just from a single origin.

felipe · ‎04-23-2020

Hello @andrew.bowman

Yes, that could happen too. Just like in the example.

Yes, Tom is also important as a result. We would need to define the probability of being sick of all the affected nodes. In the example I did focus on Ann, but we do need all results.

That is right, and makes ir more complex.

Well, that is right. But for our initial approach we will only focus on the analysis of a single origin. This process will be triggered when we get a report of a covid-19 positive test.

I think what I can do is use Cypher to get all the posible paths (with some probability threshold so it doesnt go in too deep, and the date order restriction) and then compute the probability in some external operation. Graph & cypher do excel at finding those paths very easily, bacuse doing that in a traditional relational sql approach will be extremely painful and compute resource consuming.

Or maybe we have to propose a even more simpler approach where we dont determine the probability of being sick in such a detailed manner, we just generate a list of potential sick people based on a deepth threshold and overall path probability.

andrew_bowman · ‎04-23-2020

I think that makes sense, using Cypher to find the paths, and then processing in an external operation based on that.

The custom procedure approach is always open if you need to interact with the graph in a more programmatic manner.

Neo4j

COVID-19 determine probability of virus transmission