The Theoretical Limitations of Embedding-Based Retrieval

(arxiv.org)

147 points fzliu | 2 comments | 29 Aug 25 20:25 UTC | HN request time: 0.462s | source

Show context

mingtianzhang ◴[30 Aug 25 09:33 UTC] No.45073267[source]▶

We are always looking for representations that can capture the meaning of information. However, most representations that compress information for retrieval are lossy. For example, embeddings are a form of lossy compression. Similar to the no-free-lunch theorem, no lossy compression method is universally better than another, since downstream tasks may depend on the specific information that gets lost. Therefore, the question is not which representation is perfect, but which representation is better aligned with an AI system. Because AI evolves rapidly, it is difficult to predict the limitations of the next generation of LLMs. For this reason, a good representation for information retrieval in future LLM systems should be closer to how humans represent knowledge.

When a human tries to retrieve information in a library, they first locate a book by category or by using a metadata keyword search. Then, they open the table of contents (ToC) to find the relevant section, and repeat this process as needed. Therefore, I believe the future of AI retrieval systems should mimic this process. The recently popular PageIndex approach (see this discussioin: https://news.ycombinator.com/item?id=45036944) also belongs to this category, as it generates a table-of-contents–like tree for LLMs to reason over. Again, it is a form of lossy compression, so its limitations can be discussed. However, this approach is the closest to how humans perform retrieval.

replies(5): >>45073303 #>>45073320 #>>45073346 #>>45073570 #>>45074064 #

1. quadhome ◴[30 Aug 25 12:31 UTC] No.45074064[source]▶

>>45073267 #

Humans only retrieve information in a library in that way due to the past limitations on retrieval and processing. The invention of technologies like tables of contents or even the Dewey Decimal Classification are strongly constrained by fundamental technologies like ... the alphabet! And remember, not all languages are alphabetic. And embeddings aren't alphabetic and don't share the same constraints.

I recommend Judith Flanders' "A Place for Everything" as a both a history and survey of the constraints in sorting and organising information in an alphabetic language. It's also a fun read!

tl;dr why would we want an LLM do something as inefficiently as a human?

replies(1): >>45077746 #

2. mingtianzhang ◴[30 Aug 25 20:28 UTC] No.45077746[source]▶

>>45074064 (TP) #

"why would we want an LLM do something as inefficiently as a human?" -- That is a good point. Maybe we should rename artificial intelligence (AI) to super-artificial intelligence (SAI).

↑