←back to thread

261 points fzliu | 4 comments | | HN request time: 0.627s | source
1. mech4lunch ◴[] No.42163961[source]
The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?
replies(3): >>42164339 #>>42165357 #>>42165524 #
2. brokensegue ◴[] No.42164339[source]
The raw output value is generally irrelevant. What matters is its position in the distribution of outputs
3. fzliu ◴[] No.42165357[source]
While the raw similarity score does matter, what typically matters more is the score relative to other documents. In the case of the examples in the notebook, those values were the highest in relative terms.

I can see why this may be unclear/confusing -- we will correct it. Thank you for the feedback!

4. minimaxir ◴[] No.42165524[source]
A 0.4 with cosine similarity is not the same as a 0.4 with sigmoid thresholding.

0.4 cosine similarity is pretty good for real-world data that isn't an near-identical duplicate.