cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Newbie question: How to aggregate and count identical query results?

I am trying to do some text analysis in Neo4j and I want to write a query where it sorts the number of results in a descending order. My data is structured: (Word)->[next]->(Word)->[Next] etc, I want to write a query which says which are the most popular 3 word combinations, 4 word combinations, etc. I tried this but it always gives a count of one for word combinations:

MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word) 
WITH [w1.name,w2.name,w3.name] AS word_pair 
RETURN COUNT(word_pair) as frequency, word_pair 
ORDER BY frequency DESC LIMIT 50
1 ACCEPTED SOLUTION

You're both right and wrong.

  1. if there were duplicate nodes/paths in the graph your query should work
  2. but your import statements make sure that there are no duplicate nodes and rels in the graph

So what you need to do is to take either the word or relationship frequencies into account and sum them up.

e.g.

MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word)
RETURN [w1.name,w2.name,w3.name] AS word_pair,
sum(r.count + r2.count) as frequency
ORDER BY frequency DESC LIMIT 50

If you want to you can additionally sum up the counts of the words, or actually use a proper formula for computing a relevance score.

View solution in original post

4 REPLIES 4

I think it's your data

This should actually work if you have the w1-w2-w3 pattern appearing in your graph more than once.

Can you share your data or an example?

This is how I loaded the data, so there is only one node for a unique word, but the relationships repeat. I thought there was a way to take an output of a query and count how many times it repeats within the dataset.

WITH split(tolower("My cat eats fish on Saturday")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1
RETURN w1,r, w2

second statement

WITH split(tolower("My cat eats cat food on Mondays")," ") as text
Unwind range(0,size(text)-2) as i
MERGE (w1:Word {name: text[i]})
ON CREATE SET w1.count = 1 ON MATCH SET w1.count=w1.count+1
MERGE (w2:Word {name: text[i+1]})
ON CREATE SET w2.count = 1 ON MATCH SET w2.count=w2.count+1
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count=r.count+1

You're both right and wrong.

  1. if there were duplicate nodes/paths in the graph your query should work
  2. but your import statements make sure that there are no duplicate nodes and rels in the graph

So what you need to do is to take either the word or relationship frequencies into account and sum them up.

e.g.

MATCH p = (w1:Word)-[r:NEXT]->(w2:Word)-[r2:NEXT]->(w3:Word)
RETURN [w1.name,w2.name,w3.name] AS word_pair,
sum(r.count + r2.count) as frequency
ORDER BY frequency DESC LIMIT 50

If you want to you can additionally sum up the counts of the words, or actually use a proper formula for computing a relevance score.

Thanks, that's really helpful!