cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Count distinct pairs i.e. Count rows "after" RETURN distinct a, b

Hello everyone!

I'm new to neo4j and I've doing a lot of research about queries syntax, and I really need your help on this one. It's probably very simple but I'm stuck on it.

I have a query that ends with the line below and the table is the result.

[...]
RETURN distinct u2.id, u3.id
╒═══════╤═══════╕
│"u2.id"│"u3.id"│
╞═══════╪═══════╡
│1137   │444    │
├───────┼───────┤
│1137   │770    │
├───────┼───────┤
│1137   │192    │
├───────┼───────┤
│192    │1137   │
├───────┼───────┤
│444    │1137   │
├───────┼───────┤
│444    │770    │
├───────┼───────┤
│770    │1137   │
├───────┼───────┤
│770    │444    │
└───────┴───────┘
8 rows 

It represents pairs of nodes that are directly connected (actually that are multiples edges between each pair of nodes but I'm interested only the pair of nodes).

I simply want to count the number of distinct pairs, i.e., the number of rows of this result, which is 8 in this example.

I've tried a few things without success.

Attempt 1

[...]
RETURN count(distinct u2.id, u3.id)

Attempt 2

[...]
RETURN size(distinct u2.id, u3.id)

Attempt 3

[...]
WITH (distinct u2.id, u3.id) as rel
RETURN count(rel)

Attempt 4

[...]
RETURN distinct u2.id, u3.id, count(*)

The Neo4j version is 3.5.14.
Really appreciate your help!!!

.
PS: If necessary my whole query is:

MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
RETURN distinct u2.id, u3.id
5 REPLIES 5

You were so close! It took me a while to figure this one out the first time I encountered it.

Aggregation functions are calculated per row. To get the total, you have to kinda run the query twice, with two aggregations. One for one row counting the records, and another for the results you want to return.

Simplified Example:

WITH [1,3] as x, [1,2] as y
unwind x as a
unwind y as b
WITH distinct a, b
WITH count(*) as total, [1,3] as x, [1,2] as y
unwind x as a
unwind y as b
RETURN distinct a, b, total

Result:

a b total
1 1 4
1 2 4
3 1 4
3 2 4

Applied to your query

Notes:

  • Were it not for the need for distinct, we could reuse neighbours in the second MATCH.
  • The result includes a total on each row, duplicating data.
  • These results can be though of as "distinct pairs," or as removing multiple rels between pairs. If you reframe your problem to look at Relationships instead of Nodes, you might find a better solution.
MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
WITH distinct u2.id, u3.id
WITH count(*) as total

// deja vu +total
MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH total, collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
RETURN distinct u2.id, u3.id, total

Hi Tony.

Thank you for your great insight and quick reply!!
The "total" gave me the value that I needed, BUT what I sent was part of my code which includes a UNWIND in the beginning. So, is there any alternative without a deja vu?

Maybe you can better help with knowing my full code:

# My topusers list is actually bigger
WITH [522,192] as topusers
UNWIND topusers as topuser 
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) as neighbours, count(distinct u2.id) as neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
WITH topuser, neighboursCount, count(distinct u2.id, u3.id)/2+neighboursCount as links
# The part "count(distinct u2.id, u3.id)" doesn't work, and should be the "total" we are talking about
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) as coefficient
RETURN topuser, round(100*coefficient)/100 as coefficient order by coefficient desc

Thanks again!

.

The absolute best would be to write a plugin with a procedure specifically for your case. It's easier than it sounds, though does take a good IDE and learning some Java.

There are also several ways to accomplish this using one of the existing drivers for other languages... but that'll still take coding.

With pure vanilla Cypher, there is not an efficient way to extract the total without either a "deja vu," or mutating the data a bit.

Deja Vu

WITH [522,192] AS topusers
UNWIND topusers AS topuser 
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) AS neighbours, count(distinct u2.id) AS neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id IN neighbours
WITH distinct u2.id, u3.id

// deja vu
WITH count(*) AS countDistinct, [522,192] AS topusers
UNWIND topusers AS topuser 
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH countDistinct, topuser, collect(distinct u2.id) AS neighbours, count(distinct u2.id) AS neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id IN neighbours
WITH topuser, neighboursCount, countDistinct/2+neighboursCount AS links
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) AS coefficient

RETURN topuser, round(100*coefficient)/100 AS coefficient
ORDER BY coefficient DESC

There must be a better way...

It would require more careful attention and knowledge of your graph, but there might be a better way to solve your problem.

You're trying to get a normalized score of users who's friends know eachother. (Friendship triangles get a point).

I'm taking a wild shot in the dark, and you'll probably have to monkey with this a bit to get it working...

UNWIND [522,192] AS topuser
MATCH (u1 {id: topuser})
MATCH p=(u1)-[:InteractsWith]->(u2)-[:InteractsWith]->(u3)-[:InteractsWith]->(u1)
// you won't get duplicates, so no need for distinct
// you've got all the data wrapped up where you need it

Use labels and indices

Also, where's your labels!? No labels, no indices. No indices, slow big queries.

MATCH (u) 
WHERE exists(u.id)
SET u :User
;

CREATE INDEX ON :User(id)
;

Hi Tony.

I found a way to do what I needed!!!!

First of all, thanks again for taking the time to help me and provide great insights. I really appreciate it!!

This is my graph for one topuser.

For a selected cluster (topuser=522 and direct neighbours=444,770,1320,1137,192), I needed to calculate a “clustering coefficient”, which should be total existing relationships between this cluster (“links” variable in my code = 12 in this cluster) divided by the total of possible relationships (for 5 direct neighbours, there are “5*(5-1)=20 ones”).

At per my first post, I expected to count the number of unique pair of node.ids. To make it easier to see what I did I make a list of attempts below.

MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u2)-[r2:InteractsWith]-(u3)
WHERE u2.id in neighbours AND u3.id in neighbours
RETURN .....

Attempt 101

RETURN u2.id, u3.id

=	╒═══════╤═══════╕
	│"u2.id"│"u3.id"│
	╞═══════╪═══════╡
	│192    │1137   │
	├───────┼───────┤
	│192    │1137   │
	├───────┼───────┤
	│444    │1137   │
	├───────┼───────┤
	│444    │1137   │
	├───────┼───────┤
	...
	├───────┼───────┤
	│770    │1137   │
	├───────┼───────┤
	│770    │1320   │
	└───────┴───────┘
*74 rows*

Attempt 102

RETURN [u2.id, u3.id]

= [u2.id, u3.id]
  [192, 1137]
  [192, 1137]
  [444, 1137]
  [444, 1137]
...
  [770, 1137]
  [770, 1320]
*74 rows*

Attempt 103

RETURN distinct [u2.id, u3.id]

= [u2.id, u3.id]
  [192, 1137]
  [444, 1137]
...
  [770, 444]
*14 rows*

Attempt 104

RETURN collect(distinct [u2.id, u3.id])

= [192, 1137], [444, 1137], [444, 1320], [444, 770], [1320, 1137], [1320, 770], [1320, 444], [1137, 444], [1137, 770], [1137, 1320], [1137, 192], [770, 1320], [770, 1137], [770, 444]]
*1 row (with 14 pairs)*

Attempt 1001

RETURN size(collect(distinct [u2.id, u3.id]))

=14

Then, my final code was....

WITH [394,2067,1087,209,554,1627,999,516,461,668] as topusers
UNWIND topusers as topuser 
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) as neighbours, count(distinct u2.id) as neighboursCount
# So far, select a topuser, list the “direct neighbours” and count them.
MATCH (u2)-[r2:InteractsWith]-(u3)
WHERE u2.id in neighbours AND u3.id in neighbours
# So far, consider only the neighbours’ neighbours who are part of the “direct neighbours” list.
WITH topuser, neighboursCount, size(collect(distinct [u2.id, u3.id]))/2+neighboursCount as links
# So far, select all pair of neighbours (nodes’ id), then select distinct pair (to ignore multiple edges between them), collect all unique pair as a list (meaning unique relations between these neighbours), calculate this list size (how many relations there are), divide by two (to remove bi-directional relations,i.e., A-B, B-A must be counted as 1), add number of relations between the topuser and the “direct neighbours”, and finally save the total existing links between the all “direct neighbours” including topuser.
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) as coefficient
RETURN topuser, round(100*coefficient)/100 as coefficient order by coefficient desc
# Finally, for each topuser calculate the coefficient with two decimal digits

Labels and scheme

Finally, I do have labels. I just didn’t use any usage in my query. Am I missing something?

Labels:

CREATE CONSTRAINT ON (u:User) ASSERT u.id IS UNIQUE; 
CREATE CONSTRAINT ON (t:Team) ASSERT t.id IS UNIQUE; 
CREATE CONSTRAINT ON (c:TeamChatSession) ASSERT c.id IS UNIQUE; 
CREATE CONSTRAINT ON (i:ChatItem) ASSERT i.id IS UNIQUE;

Simplification of Graph Scheme:
(u)-[:CreateChat]->(i)-[:PartOf]->(c)-[:OwnedBy]->(t)
(u)-[:CreatesSession] ->(c)
(u)-[:Joins] ->(c)
(u)-[:Leaves] ->(c)
(u)<-[:Mentioned]-(i)

(i1)-[:ResponseTo]->(i2)
(u1)-[:InteractsWith]->(u2)

Final considerations

I know I’ve written a lot but I wanted to properly provide a feedback on your help and thoughts. And hopefully help others that might have the same doubt.

Thanks again!!!!!

.

Congrats! This is a little bit of work that can be useful in many places, and has direct relevance to Centrality and Community Detection in the Neo4j Graph Algorithms plugin.

That's a clever little bit there.

Labels

You have :User, :Team, :TeamChatSession, and :ChatItem, all with a CONSTAINT on id. I would suggest also do CREATE INDEX ON :User(id); for all of those labels too, it will make things a bit faster.

Use the labels in your queries.
MATCH (u1 {id: topuser}) has to check all Nodes, and can't use constraints or indices to optimize. As your graph gets bigger, this will become a RAM problem. Additionally, you have a chance of collision. For example, if a :Team or :ChatItem has the same id as a topuser you are selecting, it too will be included in that result set. It may not matter for this dataset or query, but can cause problems in most other cases.

MATCH (u1:User {id: topuser}) will be faster and smoother, especially if you add indices.

:InteractsWith counts

One more avenue worth considering...

Instead of creating multiple -[:InteractsWith]-> between the same two nodes, would having a count on that relationship simplify the query at all?

MATCH (a:User {id: 522}), (b:User {id: 1137})
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)

Could instead be

MATCH (a:User {id: 522}), (b:User {id: 1137})
MERGE (a)-[r:InteractsWith]->(b)
ON CREATE SET r.count = 1
ON MATCH SET r.count = r.count + 1