cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Aggregate on distinct nodes

I am currently trying to return some aggregated data on a set of communities, however due to the way that my graph is set up, I am getting a duplicate of some nodes.

MATCH (l:Order)-[]-(a:Equipment) 
WITH COLLECT(DISTINCT a) AS equipment, COLLECT(DISTINCT a.name) as equipmentNames, a.community_id AS cid
MATCH (l:Order)-[]-(b:Equipment)
WHERE b.name IN equipmentNames
MATCH (l)-[]-(o:OPCODE)
WIND equipment as equip
WITH DISTINCT b, o, l
WITH b.community_id as cid
	, SUM(DISTINCT o.valueToAggregate) + SUM(DISTINCT b.valueToAggregate) AS o.valueToAggregate
	, COLLECT(DISTINCT b.name) as equipment
	, COLLECT(DISTINCT b {.*}) AS equipmentProperties
RETURN equipment, o.valueToAggregate, equipmentProperties

If I include the SUM(DISTINCT o.valueToAggregate) and SUM(DISTINCT b.valueToAggregate) then if multiple equipment has the same valueToAggregate value, then I only see one of them. However I don't want the same piece of equipment (node) to get factored in twice. It is possible for the equipment a to be feeding itself b and then each one of those gets counted two times without the DISTINCT clause

Is it possible to do something to the affect of:
SUM(valueToAggregate FOR DISTINCT b) in this case?

3 REPLIES 3

There are a few things here that I need help with. First is what is your intent with the first four lines. From my understanding, the collect statements in the second line will only collect over each value of a.community_id. It is like a GROUP BY clause in SQL. As such, the collects will be singleton lists. You can verify this by taking the first two lines and executing them after replacing the WITH with RETURN.

I believe the result of the first four lines is the same as if you execute 'MATCH (l:Order)--(b:Equipment)' instead.

Is it our requirement to sum up the aggregateValue properties on the Orders and OPCODEs per each piece of equipment? If so, the following will aggregate them:

MATCH (l:Order)--(b:Equipment)
MATCH (l)--(o:OPCODE)
WITH DISTINCT b, collect(distinct o) as co, collect(distinct l) as cl

You can then calculate the sum of the aggregateValue properties for a specific value of b, using the reduce operator on each collection co and cl:

sum of co elements for a specific b: reduce(s = 0, i in co | s + i.valueToAggregate)
sum of cl elements for a specific b: reduce(s = 0, i in cl | s + i.valueToAggregate)

What are you looking for with the WITH clause on line 8? Since you are specifying b.community_id, all the sums and collections should be restricted to that specific value of b. Again, a GROUP BY operation. As a result, I think those COLLECT statements are also singleton lists and the SUM(DISTINCT b.valueToAggregate) is the sum of one value, the b.valueToAggregate for the b value for this row.

Sorry for my confusion.

I think that rewriting it with that change to the first four and using the reduce operator simplified the query and its still the same data. I went through and tried to validate the data by hand, looks to be correct.

MATCH (l:Order)--(b:Equipment)
MATCH (l)--(o:OPCODE)
WITH DISTINCT b, collect(distinct o) as co, collect(distinct l) as cl
WITH b.community_id as cid
	, REDUCE(s = 0, o in co | s + o.activeDelay) + SUM(b.activeDelay) AS activeDelay
	, COLLECT(DISTINCT b.name) as equipment
	, COLLECT(DISTINCT b {.*}) AS equipmentProperties
RETURN equipment, activeDelay, equipmentProperties

Is your new query giving you want you want? I would think the collects are returning single values and the SUM(b.activeDelay) is contributing a single value.