Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-05-2020 07:44 AM
Hello everyone!
I'm new to neo4j and I've doing a lot of research about queries syntax, and I really need your help on this one. It's probably very simple but I'm stuck on it.
I have a query that ends with the line below and the table is the result.
[...]
RETURN distinct u2.id, u3.id
╒═══════╤═══════╕
│"u2.id"│"u3.id"│
╞═══════╪═══════╡
│1137 │444 │
├───────┼───────┤
│1137 │770 │
├───────┼───────┤
│1137 │192 │
├───────┼───────┤
│192 │1137 │
├───────┼───────┤
│444 │1137 │
├───────┼───────┤
│444 │770 │
├───────┼───────┤
│770 │1137 │
├───────┼───────┤
│770 │444 │
└───────┴───────┘
8 rows
It represents pairs of nodes that are directly connected (actually that are multiples edges between each pair of nodes but I'm interested only the pair of nodes).
I simply want to count the number of distinct pairs, i.e., the number of rows of this result, which is 8 in this example.
I've tried a few things without success.
[...]
RETURN count(distinct u2.id, u3.id)
[...]
RETURN size(distinct u2.id, u3.id)
[...]
WITH (distinct u2.id, u3.id) as rel
RETURN count(rel)
[...]
RETURN distinct u2.id, u3.id, count(*)
The Neo4j version is 3.5.14.
Really appreciate your help!!!
.
PS: If necessary my whole query is:
MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
RETURN distinct u2.id, u3.id
05-05-2020 11:27 AM
You were so close! It took me a while to figure this one out the first time I encountered it.
Aggregation functions are calculated per row. To get the total, you have to kinda run the query twice, with two aggregations. One for one row counting the records, and another for the results you want to return.
WITH [1,3] as x, [1,2] as y
unwind x as a
unwind y as b
WITH distinct a, b
WITH count(*) as total, [1,3] as x, [1,2] as y
unwind x as a
unwind y as b
RETURN distinct a, b, total
Result:
a b total 1 1 4 1 2 4 3 1 4 3 2 4
Notes:
neighbours
in the second MATCH.MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
WITH distinct u2.id, u3.id
WITH count(*) as total
// deja vu +total
MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH total, collect(distinct u2.id) as neighbours
MATCH (u1 {id: 522})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
RETURN distinct u2.id, u3.id, total
05-05-2020 12:42 PM
Hi Tony.
Thank you for your great insight and quick reply!!
The "total" gave me the value that I needed, BUT what I sent was part of my code which includes a UNWIND in the beginning. So, is there any alternative without a deja vu?
Maybe you can better help with knowing my full code:
# My topusers list is actually bigger
WITH [522,192] as topusers
UNWIND topusers as topuser
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) as neighbours, count(distinct u2.id) as neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id in neighbours
WITH topuser, neighboursCount, count(distinct u2.id, u3.id)/2+neighboursCount as links
# The part "count(distinct u2.id, u3.id)" doesn't work, and should be the "total" we are talking about
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) as coefficient
RETURN topuser, round(100*coefficient)/100 as coefficient order by coefficient desc
Thanks again!
.
05-05-2020 01:44 PM
The absolute best would be to write a plugin with a procedure specifically for your case. It's easier than it sounds, though does take a good IDE and learning some Java.
There are also several ways to accomplish this using one of the existing drivers for other languages... but that'll still take coding.
With pure vanilla Cypher, there is not an efficient way to extract the total without either a "deja vu," or mutating the data a bit.
WITH [522,192] AS topusers
UNWIND topusers AS topuser
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) AS neighbours, count(distinct u2.id) AS neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id IN neighbours
WITH distinct u2.id, u3.id
// deja vu
WITH count(*) AS countDistinct, [522,192] AS topusers
UNWIND topusers AS topuser
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH countDistinct, topuser, collect(distinct u2.id) AS neighbours, count(distinct u2.id) AS neighboursCount
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)-[r2:InteractsWith]-(u3)
WHERE u3.id IN neighbours
WITH topuser, neighboursCount, countDistinct/2+neighboursCount AS links
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) AS coefficient
RETURN topuser, round(100*coefficient)/100 AS coefficient
ORDER BY coefficient DESC
It would require more careful attention and knowledge of your graph, but there might be a better way to solve your problem.
You're trying to get a normalized score of users who's friends know eachother. (Friendship triangles get a point).
I'm taking a wild shot in the dark, and you'll probably have to monkey with this a bit to get it working...
UNWIND [522,192] AS topuser
MATCH (u1 {id: topuser})
MATCH p=(u1)-[:InteractsWith]->(u2)-[:InteractsWith]->(u3)-[:InteractsWith]->(u1)
// you won't get duplicates, so no need for distinct
// you've got all the data wrapped up where you need it
Also, where's your labels!? No labels, no indices. No indices, slow big queries.
MATCH (u)
WHERE exists(u.id)
SET u :User
;
CREATE INDEX ON :User(id)
;
05-06-2020 06:53 AM
Hi Tony.
I found a way to do what I needed!!!!
First of all, thanks again for taking the time to help me and provide great insights. I really appreciate it!!
For a selected cluster (topuser=522 and direct neighbours=444,770,1320,1137,192), I needed to calculate a “clustering coefficient”, which should be total existing relationships between this cluster (“links” variable in my code = 12 in this cluster) divided by the total of possible relationships (for 5 direct neighbours, there are “5*(5-1)=20 ones”).
At per my first post, I expected to count the number of unique pair of node.ids. To make it easier to see what I did I make a list of attempts below.
MATCH (u1 {id:522})-[r1:InteractsWith]-(u2)
WITH collect(distinct u2.id) as neighbours
MATCH (u2)-[r2:InteractsWith]-(u3)
WHERE u2.id in neighbours AND u3.id in neighbours
RETURN .....
RETURN u2.id, u3.id
= ╒═══════╤═══════╕
│"u2.id"│"u3.id"│
╞═══════╪═══════╡
│192 │1137 │
├───────┼───────┤
│192 │1137 │
├───────┼───────┤
│444 │1137 │
├───────┼───────┤
│444 │1137 │
├───────┼───────┤
...
├───────┼───────┤
│770 │1137 │
├───────┼───────┤
│770 │1320 │
└───────┴───────┘
*74 rows*
RETURN [u2.id, u3.id]
= [u2.id, u3.id]
[192, 1137]
[192, 1137]
[444, 1137]
[444, 1137]
...
[770, 1137]
[770, 1320]
*74 rows*
RETURN distinct [u2.id, u3.id]
= [u2.id, u3.id]
[192, 1137]
[444, 1137]
...
[770, 444]
*14 rows*
RETURN collect(distinct [u2.id, u3.id])
= [192, 1137], [444, 1137], [444, 1320], [444, 770], [1320, 1137], [1320, 770], [1320, 444], [1137, 444], [1137, 770], [1137, 1320], [1137, 192], [770, 1320], [770, 1137], [770, 444]]
*1 row (with 14 pairs)*
RETURN size(collect(distinct [u2.id, u3.id]))
=14
WITH [394,2067,1087,209,554,1627,999,516,461,668] as topusers
UNWIND topusers as topuser
MATCH (u1 {id: topuser})-[r1:InteractsWith]-(u2)
WITH topuser, collect(distinct u2.id) as neighbours, count(distinct u2.id) as neighboursCount
# So far, select a topuser, list the “direct neighbours” and count them.
MATCH (u2)-[r2:InteractsWith]-(u3)
WHERE u2.id in neighbours AND u3.id in neighbours
# So far, consider only the neighbours’ neighbours who are part of the “direct neighbours” list.
WITH topuser, neighboursCount, size(collect(distinct [u2.id, u3.id]))/2+neighboursCount as links
# So far, select all pair of neighbours (nodes’ id), then select distinct pair (to ignore multiple edges between them), collect all unique pair as a list (meaning unique relations between these neighbours), calculate this list size (how many relations there are), divide by two (to remove bi-directional relations,i.e., A-B, B-A must be counted as 1), add number of relations between the topuser and the “direct neighbours”, and finally save the total existing links between the all “direct neighbours” including topuser.
WITH topuser, tofloat(links)/(neighboursCount*(neighboursCount-1)) as coefficient
RETURN topuser, round(100*coefficient)/100 as coefficient order by coefficient desc
# Finally, for each topuser calculate the coefficient with two decimal digits
Finally, I do have labels. I just didn’t use any usage in my query. Am I missing something?
Labels:
CREATE CONSTRAINT ON (u:User) ASSERT u.id IS UNIQUE;
CREATE CONSTRAINT ON (t:Team) ASSERT t.id IS UNIQUE;
CREATE CONSTRAINT ON (c:TeamChatSession) ASSERT c.id IS UNIQUE;
CREATE CONSTRAINT ON (i:ChatItem) ASSERT i.id IS UNIQUE;
Simplification of Graph Scheme:
(u)-[:CreateChat]->(i)-[:PartOf]->(c)-[:OwnedBy]->(t)
(u)-[:CreatesSession] ->(c)
(u)-[:Joins] ->(c)
(u)-[:Leaves] ->(c)
(u)<-[:Mentioned]-(i)
(i1)-[:ResponseTo]->(i2)
(u1)-[:InteractsWith]->(u2)
I know I’ve written a lot but I wanted to properly provide a feedback on your help and thoughts. And hopefully help others that might have the same doubt.
Thanks again!!!!!
.
05-06-2020 01:23 PM
Congrats! This is a little bit of work that can be useful in many places, and has direct relevance to Centrality and Community Detection in the Neo4j Graph Algorithms plugin.
That's a clever little bit there.
You have :User
, :Team
, :TeamChatSession
, and :ChatItem
, all with a CONSTAINT
on id
. I would suggest also do CREATE INDEX ON :User(id);
for all of those labels too, it will make things a bit faster.
Use the labels in your queries.
MATCH (u1 {id: topuser})
has to check all Nodes, and can't use constraints or indices to optimize. As your graph gets bigger, this will become a RAM problem. Additionally, you have a chance of collision. For example, if a :Team
or :ChatItem
has the same id
as a topuser you are selecting, it too will be included in that result set. It may not matter for this dataset or query, but can cause problems in most other cases.
MATCH (u1:User {id: topuser})
will be faster and smoother, especially if you add indices.
One more avenue worth considering...
Instead of creating multiple -[:InteractsWith]->
between the same two nodes, would having a count on that relationship simplify the query at all?
MATCH (a:User {id: 522}), (b:User {id: 1137})
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)
CREATE (a)-[:InteractsWith]->(b)
Could instead be
MATCH (a:User {id: 522}), (b:User {id: 1137})
MERGE (a)-[r:InteractsWith]->(b)
ON CREATE SET r.count = 1
ON MATCH SET r.count = r.count + 1
All the sessions of the conference are now available online