How big are our embeddings now and why?

(vickiboykis.com)

113 points alexmolas | 3 comments | 02 Sep 25 11:45 UTC | HN request time: 0.793s | source

Show context

Xenoamorphous ◴[05 Sep 25 20:36 UTC] No.45143322[source]▶

Question for the experts: a few years back (even before covid times?) I was tasked with building a news aggregator. Universal Sentence Encoder was new, and we didn’t even have BERT back then. It felt magical (at least as a regular software dev) seeing how the cosine similarity score was heavily correlated with how similar (from a semantic standpoint) two given pieces of text were. That plus some clustering algorithm got the job done.

A few months ago I happened to play with OpenAI’s embeddings model (can’t remember which ones) and I was shocked to see that the cosine similarity of most texts was super close, even if the texts had nothing in common. It’s like the wide 0-1 range that USE (and later BERT) were giving me was compressed to perhaps a 0.2 one. Why is that? Does it mean those embeddings are not great for semantic similarity?

replies(2): >>45143411 #>>45143449 #

1. minimaxir ◴[05 Sep 25 20:47 UTC] No.45143411[source]▶

>>45143322 #

It's likely because the definition of "similar" varies, and it doesn't necessarily mean semantic similarity. Depending on how the embedding model was trained, just texts with a similar format/syntax are indeed "similar" on that axis.

The absolute value of cosine similarity isn't critical (just the order when comparing multiple candidates), but if you finetune an embeddings model for a specific domain, the model will give a wider range of cosine similarity since it can learn which attributes specifically are similar/dissimilar.

replies(1): >>45143509 #

2. teepo ◴[05 Sep 25 20:56 UTC] No.45143509[source]▶

>>45143411 (TP) #

Thanks - that helped it click a bit more. If the relative ordering is correct it doesn't matter they look so compressed.

replies(1): >>45145430 #

3. clickety_clack ◴[06 Sep 25 00:39 UTC] No.45145430[source]▶

>>45143509 #

That’s not necessarily true. If the embedding model hasn’t been trained on data you care about, then similarity might be dominated by features you don’t care about. Maybe you want documents that reference pizza toppings, but the embedding similarity might actually be dominated by the tone, word complexity, and the use of em dashes as compared to your prompt. That means the relative ordering might not turn out the way you want.

↑