←back to thread

An LLM is a lossy encyclopedia

(simonwillison.net)
509 points tosh | 1 comments | | HN request time: 0s | source

(the referenced HN thread starts at https://news.ycombinator.com/item?id=45060519)
Show context
latexr ◴[] No.45101170[source]
A lossy encyclopaedia should be missing information and be obvious about it, not making it up without your knowledge and changing the answer every time.

When you have a lossy piece of media, such as a compressed sound or image file, you can always see the resemblance to the original and note the degradation as it happens. You never have a clear JPEG of a lamp, compress it, and get a clear image of the Milky Way, then reopen the image and get a clear image of a pile of dirt.

Furthermore, an encyclopaedia is something you can reference and learn from without a goal, it allows you to peruse information you have no concept of. Not so with LLMs, which you have to query to get an answer.

replies(10): >>45101190 #>>45101267 #>>45101510 #>>45101793 #>>45101924 #>>45102219 #>>45102694 #>>45104357 #>>45108609 #>>45112011 #
gjm11 ◴[] No.45102219[source]
Lossy compression does make things up. We call them compression artefacts.

In compressed audio these can be things like clicks and boings and echoes and pre-echoes. In compressed images they can be ripply effects near edges, banding in smoothly varying regions, but there are also things like https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres... where one digit is replaced with a nice clean version of a different digit, which is pretty on-the-nose for the LLM failure mode you're talking about.

Compression artefacts generally affect small parts of the image or audio or video rather than replacing the whole thing -- but in the analogy, "the whole thing" is an encyclopaedia and the artefacts are affecting little bits of that.

Of course the analogy isn't exact. That would be why S.W. opens his post by saying "Since I love collecting questionable analogies for LLMs,".

replies(3): >>45102280 #>>45102368 #>>45103467 #
moregrist ◴[] No.45103467[source]
> Lossy compression does make things up. We call them compression artefacts.

I don’t think this is a great analogy.

Lossy compression of images or signals tends to throw out information based on how humans perceive it, focusing on the most important perceptual parts and discarding the less important parts. For example, JPEG essentially removes high frequency components from an image because more information is present with the low frequency parts. Similarly, POTS phone encoding and mp3 both compress audio signals based on how humans perceive audio frequency.

The perceived degradation of most lossy compression is gradual with the amount of compression and not typically what someone means when they say “make things up.”

LLM hallucinations aren’t gradual and the compression doesn’t seem to follow human perception.

replies(2): >>45104296 #>>45105018 #
1. baq ◴[] No.45104296[source]
LLM confabulations might as well be gradual in the latent space. I don’t think lossy is synonymous to perceptual and the high frequency components rather easily translate to less popular data.