Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-18-2021 10:00 PM
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-full-text-search/
The score is good, however, the score can be well above or below 1. Is it possible to get it as a float number between 0-1? So that it can be easier to set a threshold to retain or discard the results, based on the relevance.
03-19-2021 10:35 AM
You could create a normalized score, just divide all the scores by the maximum score from the result set. Note, this is a "search relative" normalization. From my fairly limited experience with fulltext indexes (but very recent..) I believe that the maximum score possible depends on the specific fulltext index design and is also data dependent.
I quick googled it, and this page appears to confirm this.
That page has a link to the lucene page about how score is calculated as well, I'll include it here for convenience
03-19-2021 11:21 PM
Maybe you mean divide all the scores by the sum of all scores to normalize? My purpose is to set a threshold value to decide whether the results should be kept or discarded. If I divide by the max score, the top score's normalization value will always be 1 and this doesn't serve my purpose. I need a way to compress all the scores to the range of 0<x<1. So for example, if the normalized score >0.8, I want to keep it, and discard all others.
03-20-2021 02:03 PM
Normalized range is 0.0 to 1.0, but if you really really want it to be 0.0 to <1.0 I guess there are a variety of ways to fudge that.
I'll give an example, (similar to the scores I see for my index), if I return only the top 4 scores and they are
7.0, 4.0, 1.0, 1.0
The max score is 7.0, so then the normalized scores would be
1.0, 0.5714, 0 .1429, 0 .1429 (rounded to four digits...)
for a fudge (though I don't understand why you want to do this), you could simply multiply those scores by 0.99, yielding these scores, now forced into the range 0 to <1.0
0.99, 0.5657, 0.1414, 0.1414 (rounded to four digits...)
03-20-2021 09:57 PM
That works for transformation, but my real point is, how to decide whether a result's relevance is strong enough to keep it. In your example, if the max score of 7.0 is represented by the node and query of:
aaa abacad
I may not want to keep it, because the similarity is not good enough. Instead, if the 7.0 is:
aaa aaab
Then this result is a lot better in terms of similarity. The scores are only top ranked results, but don't say how relevant (similar) to the query. Even if the very top result is little similar to the query, it is still a top rank result, which is right, but isn't indicative of how relevant to the query. In your example, if the 4 scores can be transformed to:
0.45, 0.31, 0.11, 0.09
0.87, 0.47, 0.3, 0.12
I would certainly want to discard the first set of results and keep the first result of the 2nd set, since it is more than 0.8. That's the 'threshold' I was talking about.
How do you use the scores you modified down the road?
All the sessions of the conference are now available online