Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-30-2020 01:55 PM
Hi,
Have a relatively simple graph
I am first looking to include the TF part (Term frequency) into a property of the relationship.
And I am testing things out with a very simple query.
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term, r.count as Num,1.0*r.count/sum(r.count)
There are 14 relationships in this query and with a total of 26 total words (sum of r.count)
The issue is the SUM(r.count) is not going all the relationships and only seeing the single relationship (1 or 2).It looks like I am running into the issue of the return statement having both a grouping key and an aggravating function. So how do I get the aggravating function resolved before the grouping function? How do I get/pass a global sum (26) for the division?
Andy
09-30-2020 02:19 PM
Haven't tested this, but have you looked into using the collect function inbetween your match and return statements, e.g with..., collect(...)
09-30-2020 02:59 PM
Hi,
Not sure who collect which is an aggregating function similar to sum would be used.
If I do this:
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN sum(r.count)
It returns a value of 26 which is correct.
If I do this:
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
RETURN r.count,sum(r.count)
It groups by the possible values of r.count and gives this.
r.count | sum(r.count) |
---|---|
2 | 24 |
1 | 2 |
If I try to include an aggregating step
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
WITH r,sum(r.count) AS fred
RETURN r.count, fred
It does the group by
r.count | fred |
---|---|
2 | 2 |
2 | 2 |
2 | 2 |
1 | 1 |
2 | 2 |
2 | 2 |
If I remove the r in the with clause I get an error.
Andy
10-01-2020 02:24 PM
Bumping this.
ANY HELP?
How do I get both the individual value and the sum of the values in the turn statement so I can calculate a simple ratio?
Andy
10-02-2020 06:34 AM
Hello @andy.hegedus
Please give an example of calculation and result and I will try to make you the query.
Regards,
Cobra
10-02-2020 08:12 AM
Hi,
The objective here is to implement TFIDF (Term Frequency Inverse Document Frequency) where the document group will be be latter defined by a query. The first part is to calculate all the term frequencies (TF part) since they will not change as part of latter queries.
The data model contains two node types: Document, and Word, with one relationship [:Is_in] since not all words are in all documents and most words will be in multiple documents.
The relationship, [:Is_in] has a property, num, that defines how many times a word is in a given document. To calculate the the Term frequency I need two know two factors, the total number of words in a document and how many times that specific word is included. So for the example of a single document (I will need to expand it latter to run through all documents)
MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
RETURN a.term,r.count, 1.0*r.count/sum(r.count)
Returns a value of 1.0 for the ratio which is not the intent. It appears the function sum(r.count) is being segmented by a and r and does not reflect the global sum count.
a.term | r.count | 1.0*r.count/sum(r.count) |
---|---|---|
"improved" | 2 | 1.0 |
"produce" | 2 | 1.0 |
"decrease" | 2 | 1.0 |
"thickness" | 1 | 1.0 |
"film" | 2 | 1.0 |
If I try to calculate earlier in the query I cannot propagate the a and r variables
MATCH (:document{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN a.term,r.count,1.0*r.count/total
Returns an error about a and r not being defined.
If I include a, and r in the With statement I get the same result as the first attempt.
a.term | r.count | 1.0*r.count/total |
---|---|---|
"improved" | 2 | 1.0 |
"produce" | 2 | 1.0 |
"decrease" | 2 | 1.0 |
"thickness" | 1 | 1.0 |
"film" | 2 | 1.0 |
This query returns the correct number of total words
MATCH (:patent{num:7863179})<-[r:Is_in]-(a:Word)
WITH sum(r.count) as total
RETURN total
total
1 26
Andy
10-02-2020 08:15 AM
You have to collect() things if you want to propage them:
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect(w) AS words, collect(r) AS relations
RETURN total, words, relations
Regards,
Cobra
10-02-2020 10:58 AM
Hi Cobra,
Sort of worked but not quite since I cannot access the individual count values in the return statement.
I did modify it a bit using your suggestion of collection, but then also adding an unwind.
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect(r) AS texts
UNWIND texts as target
RETURN target.count, 1.0*target.count/total as TF
It does return values that make sense on the face of it. Though I don't know how to get the both the word property value word.term also.
target.count | TF |
---|---|
2 | 0.07692307692307693 |
2 | 0.07692307692307693 |
2 | 0.07692307692307693 |
1 | 0.038461538461538464 |
10-02-2020 01:05 PM
MATCH (:patent{num:7863179})<-[r:Is_in]-(w:Word)
WITH sum(r.count) AS total, collect({r:r, w:w}) AS texts
UNWIND texts AS target
RETURN target.w.term AS term, target.r.count AS count, 1.0*target.r.count/total AS TF
10-02-2020 01:33 PM
Hi Cobra,
Thank you. That collect notation is definitely new to me and not directly clear from the documentation. The object of the collect function is only listed as expression and when looking at the expression page in the documentation is not the most clarifying since it can be basically anything.
One slight tweak in the code provided:
1.0target.count/total AS TF
should be
1.0target.r.count/total AS TF
Thank you again.
Andy
10-03-2020 01:41 AM
No problem, I corrected the query you can collect and build a dict at the same time, it's very practical
Hope this helped you solve your problem.
Regards,
Cobra
All the sessions of the conference are now available online