An LLM is a lossy encyclopedia

(simonwillison.net)

509 points tosh | 1 comments | 29 Aug 25 09:40 UTC | HN request time: 0s | source

(the referenced HN thread starts at https://news.ycombinator.com/item?id=45060519)

Show context

GuB-42 ◴[02 Sep 25 10:29 UTC] No.45101186[source]▶

>>45062046 (OP) #

There are a lot of parallels between AI and compression.

In fact the best compression algorithms and LLMs have in common that they work by predicting the next word. Compression algorithms take an extra step called entropy coding to encode the difference between the prediction and the actual data efficiently, and the better the prediction, the better the compression ratio.

What makes a LLM "lossy" is that you don't have the "encode the difference" step.

And yes, it means you can turn a LLM into a (lossless) compression algorithm, and I think a really good one in term of compression ratio on huge data sets. You can also turn a compression algorithm like gzip into a language model! A very terrible one, but the output is better than a random stream of bytes.

replies(3): >>45101276 #>>45102534 #>>45103227 #

layer8 ◴[02 Sep 25 12:58 UTC] No.45102534[source]▶

>>45101186 #

One difference is that compression gives you one and only one thing when decompressing. Decompression isn't a function taking arbitrary additional input and producing potentially arbitrary, nondeterministic output based on it.

We would have very different conversations if LLMs were things that merely exploded into a singular lossy-expanded version of Wikipedia, but where looking at the article for any topic X would give you the exact same article each time.

replies(1): >>45102611 #

withinboredom ◴[02 Sep 25 13:06 UTC] No.45102611[source]▶

>>45102534 #

LLMs deliberately insert randomness. If you run a model locally (or sometimes via API), you can turn that off and get the same response for the same input every time.

replies(1): >>45103024 #

layer8 ◴[02 Sep 25 13:43 UTC] No.45103024{3}[source]▶

>>45102611 #

True, but I'd argue that you can't get the definite knowledge of an LLM by turning off randomness, or fixing the seed. Otherwise that would be a routinely employed feature, to determine what an LLM "truly knows", removing any random noise distorting that knowledge, and instead randomness would only be turned on for tasks requiring creativity, not when merely asking factual questions. But it doesn’t work that way. Different seeds and will uncover different "knowledge", and it's not the case that one is a truer representation of an LLM's knowledge than another.

Furthernore, even in the absence of randomness, asking an LLM the same question in different ways can yield different, potentially contradictory answers, even when the difference in prompting is perfectly benign.

replies(1): >>45105145 #

1. withinboredom ◴[02 Sep 25 16:20 UTC] No.45105145{4}[source]▶

>>45103024 #

This is because the knowledge is encoded in a multi-dimensional space, and a seed doesn’t change the knowledge, only the expression of it. If you ask me what E=mc^2 means, I’ll give you different answers depending on whether I think you are a curious lay-person vs. a physicist testing my response.

You see this with humans who encode physical space to physical matrix in our brain. When asking for directions, people have to traverse this matrix until it is memorized, then it isn’t used any longer; only the rote data is referenced.

↑