cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Aggregating functions and coVariance analysis

Hello everyone, sorry I'm a Neo4j newbie here, trying to solve a practical problem to see if neo4j is fit for the application.
Let's say I have a database with nodes of two types Candyjars and Candies. Every Candyjar (Candyjar1, Candyjar2...) has different number of candies of different types: CandyRed, CandyGreen etc..

Now let's say the end game here is to find how much is the probability of the various types of candies to occur together, and the covariance among them. Then I want to have relationships between each CandyType with an associated probabilities of co-occurence and covariance. Let's call this relationships OCCURS_WITH so that Candtype1 -[OCCURS_WITH]->Candytype2 and Candytype1 -[COVARIES]->Candytype2

I'd make a database with CandieTypes and CandyJars as nodes, make a relationship (cj:CandyJar)-[r:CONTAINS]->(ct:Candytype) where r can have an attribute to set "how many" candy of a type are cotained in the jar.

Ideally,having a a structure of the form:

(ct1:Candytype)<-[r1:CONTAINS]-(cj:CandyJar)-[r2:CONTAINS]->(ct2:Candytype)

I would want in output a table that is structured like A or B:
A) ct1.name,r1.amount,cj.id, r2.amount,ct2.name, COUNTS(r1)
B) ct1.name, cj.id , ct2.name, LIST(r1), LIST(r2), COUNTS(r1)

Where counts(r1) has to be the amount of total relationships that match the pattern and LIST(r1) is a list of all the relationship objects.
I cannot seem to obtain either of such tables, as if I ask for r1 or cj then count is always 1, and something similar happens with COLLECT, not outputting a list.

Can anybody help or point me to relevant examples?

3 REPLIES 3

The root cause of your issue is in the grouping for the aggregate functions. Aggregate functions perform their operation over a group of rows. The grouping is determined by the parameters not included within the aggregate function's parameters. Every unique combination of these grouping parameters forms a group of rows the aggregate functions is performed over. In your 'A' scenario, you have ct1.name, r1.amount, cj.id, r2.amount, and ct2.name as your grouping values. This is producing one row for each combination, so your count is always one. The same is true for your scenario 'B' grouping, so your collections have a single element and the count is one.

You may want to use the following match, so you can group items by jars:

match (c:Candytype)<-[r:CONTAINS]-(j:CandyJar)
return j.id as jar, count(*) as noOfCandyTypes, collect({name: c.name, count: r.amount})

The above will give you a row for each jar, which includes the number of different candies in the jar and a list of each candy in the jar, organized as a map with the candy's name and count.

You can go a little farther and calculate the relative frequency of each candy in a single jar and add the value to the candy's map.

The query below does that. It does the same thing as the first query, but this time it calculates the total amount in the jar, then updates each items map to include the relative frequency.

match(n:CandyJar)-[r:CONTAINS]->(m:Candytype)
with n.name as jar, count(*) as typesOfCandies, sum(r.amount) as totalItems, collect({name: m.name, count: r.amount}) as items
return jar, typesOfCandies, totalItems, [i in items | i{.*, relFreq: toFloat(i.count) / toFloat(totalItems)}] as items

I hope this helps. Let us know if you have any further questions.

This helps a lot indeed! So, correct me if I'm wrong, the problem is only due to the implicit behaviour of the aggregating functions as I understand, and there is no way to change that behaviour. Now your statements were indeed very useful examples, and probably I should add the total amount of candies as a property of the jar and the relative amount of a mineral as an extra relationship. In the end though I only want to query the Candytypes that co-occur together, so keeping your example in mind and the fact that i need a match of the type:

(m1:Candytype)<-[r1:CONTAINS]-(n:CandyJar)-[r2:CONTAINS]->(m2:Candytype)

I think the statment that im look for is something like:

MATCH (m1:Candytype)<-[r1:CONTAINS]-(n:Jar)-[r2:CONTAINS]->(m2:Candytype)
WITH n.id as id, m1,m2,collect({candytypel1:m1.name,amount1:r1.amount,candytype2:m2.name,amount2:r2.amount}) as items
WHERE m1.name='CandyRed' AND m2.name='Candyblue'
RETURN  id, items

In which case though I wouldn't know how to get a count of the Jars in this same query, but I can do it with python if it's not possible with Cypher.

So, two question on the matter:
A)In this case I would then have to iterate the query over every possible pair of candies. Would there be a way to instead do it in one go?
B)The syntax of the bit [i in items | i{.*, relFreq: toFloat(i.count) / toFloat(totalItems)}] is not very clear to me, where would I find an explanation of that kind of syntax?

You are correct. You need to understand the grouping mechanism when using the aggregate functions.

You do not need the 'collect' in your new query because the grouping formed by id, m1 and m2, will only produce one row to collect over.

You can generalize your query to get all pairwise candies per jar. The query below should work. The 'where' predicate ensures you get two different candies and don't get both ordered pairs, i.e. (candy 1, candy 2) and (candy 2, candy 1)

MATCH (m1:Candytype)<-[r1:CONTAINS]-(n:Jar)-[r2:CONTAINS]->(m2:Candytype)
WHERE id(m1)<id(m2)
RETURN n.id as id, {candytypel1:m1.name, amount1:r1.amount, candytype2:m2.name, amount2:r2.amount} as items

If you want to format the items in an array of candy elements:

MATCH (m1:Candytype)<-[r1:CONTAINS]-(n:Jar)-[r2:CONTAINS]->(m2:Candytype)
WHERE id(m1)<id(m2)
RETURN n.id as id, [{name: m1.name, amount: r1.amount}, {name: m2.name, amount: r2.amount}] as items

If you want to count the number of distinct jars and number of pairs, you will have to group the items. The following is one example approach:

MATCH (m1:Candytype)<-[r1:CONTAINS]-(n:Jar)-[r2:CONTAINS]->(m2:Candytype)
WHERE id(m1)<id(m2)
RETURN count(distinct id(n)) as numOfJars, count(*) as numOfPairs, collect({jar: n.id, items: [{name:m1.name, amount:r1.amount}, {name: m2.name, amount:r2.amount}]})as items

There are two concepts occurring in the syntax you referenced. One is 'list comprehension'. I find it very powerful. See section 2 in the following reference:

The second is 'Map Projections'. See section 2 in the following reference: