Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-07-2020 11:24 AM
Hi,
Working on a TF-IDF graph with a bit of dynamic scoring. Testing out some functionality with this code:
Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
RETURN test.r1,test.r1.TF,count(test.r2), Total_Doc, log(1.0*Total_Doc/count(test.r2)) as idf, log(1.0*Total_Doc/count(test.r2))*test.r1.TF as TFidf
So I can look at the data and hand calculate to verify that code is correct It looks like the final TFidf is correct and now I want to set it as property key into the first relationship.
Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
SET test.r1.TFidf= log(1.0*Total_Doc/count(test.r2))*test.r1.TF
And it is returning an error complaining about the count function. How do I work around this?
Andy
10-07-2020 05:47 PM
Use a WITH clause to project the result of the count() to a variable, then use it in a SET afterwards.
Typically you'll only want to use aggregation functions within WITH or RETURN clauses.
10-08-2020 10:54 AM
Hi,
I have scoured the documentation many a time and it is really deficient in describing how the aggregation process with return and with work. Is there another document you could point to?
Basically I need three levels of aggregation:
And to calculate for each word in a document a metric that relates it to the number of other documents and the total number.
In my query with return that did seem to give the correct result uses a collect statement to give the correct group. I cannot include a count(r2) in the collect statement because an aggregation function cannot be embedded in aggregation function.
Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
WITH count(DISTINCT a) as Total_Doc, Collect({r1:r1, r2:r2}) as Texts
Unwind Texts as test
RETURN test.r1,test.r1.TF,count(test.r2), Total_Doc, log(1.0*Total_Doc/count(test.r2)) as idf, log(1.0*Total_Doc/count(test.r2))*test.r1.TF as TFidf
I have tried this and it does NOT give the results needed. Total_Docs and Other _Docs are just the total number of relationships.
Match (a:patent)-[r1:Is_in]-(b:Word)-[r2:Is_in]-(:patent)
With count(a) as Total_Docs, Collect(r1) as Texts, count(r2) as Other_Docs
Unwind Texts as test
RETURN test.r1, Total_Docs, Other_Docs
Andy
10-08-2020 11:33 AM
Here's the documentation on aggregations:
The most important thing to be aware of is that when you aggregate, the non-aggregation variables present become the grouping key, which is the context for the aggregation. If you don't have your grouping key right, then you will be counting the wrong things.
Thanks for the detailed description of what you need. Let's see how to get all 3.
First, the total number of documents. We should get that separately from the rest, and if we keep it simple it will leverage the counts store and get a quick count for us:
MATCH (a:patent)
WITH count(a) as Total_Doc
...
There. That will be faster than your current approach.
Now to get the number of words per document. If we're just counting the :Is_in relationships per document, we can use the size(<pattern>)
which will get the degree of that relationship from the node, which is pretty quick (note that size() isn't an aggregation function):
MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
...
If you need the count of distinct words in the doc then that's different, we can't use the degree approach. We would have to MATCH out and count () the distinct words per document:
...
MATCH (a:patent)-[:Is_in]-(b:Word)
WITH Total_Doc, a, count(DISTINCT b) as Distinct_Words_per_Doc
...
Note that a
has to be part of the grouping key so the words counted are per document, and not just the total words across all documents.
That said, I'm guessing you already handled this so that :Is_in only occurs once between a document and a distinct word. That may make the last part easier. We can use the degree of the :Word node -1 to get the number of other documents the word occurs in (we're not counting the current document).
MATCH (a:patent)
WITH count(a) as Total_Doc
MATCH (a:patent)
WITH Total_Doc, a, size((a)-[:Is_in]-()) as Words_per_Doc
MATCH (a)-[r1:Is_in]-(b:Word)
WITH Total_Doc, a, Words_per_Doc, r1, size((b)-[:Is_in]-()) - 1 as Other_Docs
WITH Total_Doc, r1, r1.TF as TF, Other_Docs, log(1.0*Total_Doc/Other_Docs) as idf
RETURN Total_Doc, r1, TF, idf, idf * TF as TFidf
10-08-2020 12:56 PM
Hi Andrew,
Thank you very much for your very clear assistance. I have gotten it to work and able to set the values in the relationship.
Thank you again
Andy
All the sessions of the conference are now available online