Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.
It seems like the holy grail is a sparse semantic model. I wonder how splade would do?
Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.
It seems like the holy grail is a sparse semantic model. I wonder how splade would do?
euclidean embedding
hyperbolic embedding
sparse BM25 / SPLADE lexical search
optional multi-vector signatures
↓ merge & deduplicate candidates
followed by weight scoring, expansion (graph) & rerank (LLM)?In fact, I think we can do it in d=2k dimensions, if we're willing to have arbitrarily precise query vectors.
Embed our points as (sin(theta), cos(theta), sin(2 x theta), cos(2 x theta)..., sin(k x theta), cos(k x theta)), with theta uniformly spaced around the circle, and we should be able to select any k of them.
Using a few more dimensions we can then ease the precision requirements on the query.
In the end they still rely on Maximum Inner Product Search, just with several lookups for smaller partitions of the full embedding, and the largest dataset is Books, where this paper suggests you'd need more than 512 embedding dimensions, and MoL with 256-dimensional embeddings split into 8 parts of 32 each has an abysmal hit rate.
So that's hardly a demonstration that arbitrary high-rank distributions can be approximated well. MoL seems to approximate it better than other approaches, but all of them are clearly hampered by the small embedding size.
Let H(z) = (z- exp(i theta)) (z - exp(i phi))
H(z) is zero only for our two points. We'll now take the norm^2, to get something that's minimized on only our two chosen points.
|H(exp(i x))|^2 = H(z) x conj(H(z)) = sum from -2 to 2 of c_j x exp(i j x)
For some c_j (just multiply it out). Note that this is just a Fourier series with highest harmonic 2 (or k in general), so in fact our c_j tell us the four coefficients to use.
For a simpler, but numerically nastier construction, instead use (t, t^2, t^3, ...), the real moment curve. Then given k points, take the degree k poly with those zeros. Square it, negate, and we get a degree 2k polynomial that's maximized only on our selected k points. The coefficients of this poly are the coefficients of our query vector, as once expanded onto the moment curve we can evaluate polynomials by dot product with the coefficients.
The complex version is exactly the same idea but with z^k and z ranging on the unit circle instead.
When a human tries to retrieve information in a library, they first locate a book by category or by using a metadata keyword search. Then, they open the table of contents (ToC) to find the relevant section, and repeat this process as needed. Therefore, I believe the future of AI retrieval systems should mimic this process. The recently popular PageIndex approach (see this discussioin: https://news.ycombinator.com/item?id=45036944) also belongs to this category, as it generates a table-of-contents–like tree for LLMs to reason over. Again, it is a form of lossy compression, so its limitations can be discussed. However, this approach is the closest to how humans perform retrieval.
No lunch theorem only works, because they assume you care about every single value of noise. Nobody does. There's a free lunch to be had, and it's order. You don't care about a single pixel difference between two cat pictures, NFL does.
Lossy compression is precisely where NFL does not apply.
embedding search (0.4)
lexical/keyword search (0.4)
fuzzy search (0.2)
might indeed achieve the best of all worldsI recommend Judith Flanders' "A Place for Everything" as a both a history and survey of the constraints in sorting and organising information in an alphabetic language. It's also a fun read!
tl;dr why would we want an LLM do something as inefficiently as a human?
I mean, I hear, about experiments of running Turing machine simulation on NN, or even simulation of some physics on NN, but I have not seen any good survey on these topics, and they could be very interest on subject.
True sparsity would imply keeping different important vector bases for different documents, but MRL doesn't magically shuffle vector bases around depending on what's your document contains, were that the case cosine similarity between the resulting documents embeddings would simply make no sense as a similarity measure.
As an example, current AI is famously very poor at learning relationships between non-scalar types, like complex geometry, which humans learn with ease. That isn’t too surprising because the same representation problem exists in non-AI computer science.