An LLM is a lossy encyclopedia

Yes and working out how to disentangle the information storage mechanisms from say language processing is a massive area of interest. Only problem with Attention Transformers imo is that they're a bit too good :p

Imagine a slightly lossy compression algorithm which can store 10x, 100x the current best lossless and be able to maintain 99.999% fidelity when recalling that information. Probably, very probably a pipe dream. But why do large on device models seem to be able to remember adjust everything from Wikipedia and store that in smaller format than a direct archive of the source Material. (Look at the current best from diffusion models as well)