The Theoretical Limitations of Embedding-Based Retrieval

(arxiv.org)

147 points fzliu | 1 comments | 29 Aug 25 20:25 UTC | HN request time: 0.207s | source

Show context

gdiamos ◴[29 Aug 25 21:39 UTC] No.45069705[source]▶

Their idea is that capacity of even 4096-wide vectors limits their performance.

Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.

It seems like the holy grail is a sparse semantic model. I wonder how splade would do?

replies(3): >>45070552 #>>45070624 #>>45088848 #

CuriouslyC ◴[29 Aug 25 23:29 UTC] No.45070552[source]▶

>>45069705 #

We already have "sparse" embeddings. Google's Matryoshka embedding schema can scale embeddings from ~150 dimensions to >3k, and it's the same embedding with layers of representational meaning. Imagine decomposing an embedding along principle components, then streaming the embedding vectors in order of their eigenvalue, kind of the idea.

replies(3): >>45070777 #>>45071166 #>>45075319 #

1. miven ◴[30 Aug 25 15:08 UTC] No.45075319[source]▶

>>45070552 #

Correct me if I'm misinterpreting something in your argument but as I see it Matryoshka embeddings just sort the vector bases of the output space roughly by order of their importance for the task, PCA-style, so when you truncate your 4096-dimensionnal embedding down to a set of let's say 256 dimensions, those are the exact same 256 vector bases doing the core job of encoding important information for each sample, so you're back to dense retrieval on 256-dimensional vectors, just that all the minor miscellaneous slack useful for a very low fraction of queries has been trimmed away.

True sparsity would imply keeping different important vector bases for different documents, but MRL doesn't magically shuffle vector bases around depending on what's your document contains, were that the case cosine similarity between the resulting documents embeddings would simply make no sense as a similarity measure.

↑