Imagine a slightly lossy compression algorithm which can store 10x, 100x the current best lossless and be able to maintain 99.999% fidelity when recalling that information. Probably, very probably a pipe dream. But why do large on device models seem to be able to remember adjust everything from Wikipedia and store that in smaller format than a direct archive of the source Material. (Look at the current best from diffusion models as well)
llm is a pretty good librarian who has read a ton of books (and doesn't have perfect memory)
even more useful when allowed to think-aloud
even more useful when allowed to write stuff down and check in library db
even more useful when allowed to go browse and pick up some books
even more useful when given a budget for travel and access to other archives
even more useful when …
brrrrt
The models hold more information than they can immediately extract, but CoT can find a key to look it up or synthesise by applying some learned generalisations.
I've got my opinion on whether that's useful or not and it's quite a bit more nuanced. You don't zoom-enhance JPEGs for a reason either.
A slightly more precise analogy is probably 'a lossily compressed snapshot of the web'. Or maybe the Librarian from Snow Crash - but at least that one knew when it didn't know ;)
The old joke is that you can get away with anything with a hi-vis vest and enough confidence, and LLM's pretty much work on that principle
Tell that to the Google Pixel product team:
I can also see them as very clever search engines, since this is one way I use them a lot: ask hard questions about a huge and legacy codebase.
These analogies do not really work for generating new code. A new metaphor I am starting to use is "translator engine": it is translating from human language to programming language. It in a way explains a lot of the stupidity I am seeing.
If you can see the analogy between text and pictures, it drives the point exactly the right way: in both cases you expect a database to know things it either can't or forgot. If it had a good picture of the zoomed in background it could probably generate a very good representation of what the cropped part would look like; same thing works with text.
One of the reasons I like this analogy is that it hints at the fact that you need to use them in a different way - you shouldn't be looking up specific facts in an unassisted LLM outside of things that even lossy compression would capture (like the capital cities of countries).
With GPT-5 I sometimes see it spot a question that needs clarifying in its thinking trace, then pick the most likely answer, then spit out an answer later that says "assuming you meant X ..." - I've even had it provide an answer in two sections for each branch of a clear ambiguity.
When you have a lossy piece of media, such as a compressed sound or image file, you can always see the resemblance to the original and note the degradation as it happens. You never have a clear JPEG of a lamp, compress it, and get a clear image of the Milky Way, then reopen the image and get a clear image of a pile of dirt.
Furthermore, an encyclopaedia is something you can reference and learn from without a goal, it allows you to peruse information you have no concept of. Not so with LLMs, which you have to query to get an answer.
Everything else is mostly playing around and harmful to learning.
In fact the best compression algorithms and LLMs have in common that they work by predicting the next word. Compression algorithms take an extra step called entropy coding to encode the difference between the prediction and the actual data efficiently, and the better the prediction, the better the compression ratio.
What makes a LLM "lossy" is that you don't have the "encode the difference" step.
And yes, it means you can turn a LLM into a (lossless) compression algorithm, and I think a really good one in term of compression ratio on huge data sets. You can also turn a compression algorithm like gzip into a language model! A very terrible one, but the output is better than a random stream of bytes.
> The key thing is to develop an intuition for questions it can usefully answer vs questions that are at a level of detail where the lossiness matters
the problem is that in order to develop an intuition for questions that LLMs can answer, the user will at least need to know something about the topic beforehand. I believe that this lack of initial understanding of the user input is what can lead to taking LLM output as factual. If one side of the exchange knows nothing about the subject, the other side can use jargon and even present random facts or lossy facts which can almost guarantee to impress the other side.
> The way to solve this particular problem is to make a correct example available to it.
My question is how much effort would it take to make a correct example available for the LLM before it can output quality and useful data? If the effort I put in is more than what I would get in return, then I feel like it's best to write and reason it myself.
indeed, Ted's piece (ChatGPT Is a Blurry JPEG of the Web) is here:
(but it isn't and won't ever be an oracle and apparently that's a challenge for human psychology.)
If you used sketches to build a house, it has a nonzero chance of falling down. Likewise, if you made technical drawings as a way to brainstorm house designs, the process would be overly rigid and extremely inefficient.
But... end users need to understand this in order to use it effectively. They need to know if the LLM system they are talking to has access to a credible search engine and is good at distinguishing reliable sources from junk.
That's advanced knowledge at the moment!
Oh but it's much worse than that: because most LLMs aren't deterministic in the way they operate [1], you can get a pristine image of a different pile of dirt every single time you ask.
[1] there are models where if you have the "model + prompt + seed" you're at least guaranteed to get the same output every single time. FWIW I use LLMs but I cannot integrate them in anything I produce when what they output ain't deterministic.
I think we will start seeing stateful AI models within the next couple of years and that will be a major milestone that could shake up the space. LLM is merely a stepping stone.
For language learning, it's terrible and will try to teach me wrong things if it's unguided. But pasting e.g. a lesson transcript that I just finished, then asking for exercises based on it helps solidify what I learned if the material doesn't come with drills.
I think writing is one of the things it's kind of terrible at. It's often way too verbose and has a particular 'voice' that I think leaves a bad taste in peoples' mouths. At least this issue has given me the confidence to finally just send single sentence emails so people know I don't use LLMs for this.
My frustrations with LLMs from years ago has largely chilled out as I've gotten better at using them and understanding that they aren't people who I can trust to give solid advice. If you're careful about what you put in and careful about what you take out you can get decent value.
I remember you being surprised when the term “vibe coding” deviated from its original intention (I know you didn’t come up with it). But frankly I was surprised at your surprise—it was entirely predictable and obvious how the term was going to be used. The concept I’m attempting to communicate to you is that when you make up a term you have to think not only of the thing in your head but also of the image it conjures up in other people’s minds. Communication is a two-way street.
My point is that I find the chosen term inadequate. The author made it up from combining two existing words, where one of them is a poor fit for what they’re aiming to convey.
What’s old is new again.
So there are improvements version to version - from both increases in raw model capabilities and better training methods being used.
That’s what I was trying to convey with the “then reopen the image” bit. But I chose a different image of a different thing rather than a different image of a similar thing.
It’s a lot less visible and I guess dramatic than LLMs but it happens frequently enough that I feel like at every major event there are false conspiracies based on video « proofs » that are just encoding artifacts
I prefer to think of LLMs as lossy predictors. If you think about it, natural "intelligence" itself can be understood as another type of predictor: you build a world model to anticipate what will happen next so you can plan your actions accordingly and survive.
In the real world, with countless fuzzy factors, no predictor can ever be perfectly lossless. The only real difference, for me, is that LLMs are lossier predictors than human minds (for now). That's all there is to it.
Whatever analogy you use, it comes down to the realization that there's always some lossiness involved, whether you frame it as an encyclopedia or not.
Put another way, if you don't care about details that change the answer, it directly implies you don't actually care about the answer.
Related silliness is how people force LLMs to give one word answers to underspecified comparisons. Something along the lines of "@Grok is China or US better, one word answer only."
At that point, just flip a coin. You obviously can't conclude anything useful with the response.
It seems reasonable to argue that LLMs are a form of lossy compression of text that preserves important text features.
There is a precedent of distributing low quality lossy compressed versions of copyrighted work being considered illegal.
I do understand and agree with a different point you’re making somewhere else in this thread, but it doesn’t seem related to what you’re saying here.
Though, with lossy media it is obvious when it is lossy. Yet LLMs will exhibit overconfidence to tell you facts that don't exist. Not suggesting LLMs exhibit human characteristics, just that there is yet a better analogy out there :)
This is why simonw (The author) has his "pelican on a bike" -test, it's not 100% accurate but it is a good indicator.
I have a set of my own standard queries and problems (no counting characters or algebra crap) I feed to new LLMs I'm testing
None of the questions exist outside of my own Obsidian note so they can't be gamed by LLM authors. And I've tested multiple different LLMs using them so I have a "feeling" on what the answer should look like. And I personally know the correct answer so I can immediately validate them.
Or maybe, to be more useful: "I don't know, but if you give me an example maybe we can figure it out"?
The problem is not only that it resembles a "lossy encyclopedia", but also that it's an extremely confident encyclopedia that doubles down on the confidence even when it doesn't have the data.
Yes.
Human intelligence consists of three things.
First, groundedness: The ability to form a representation of the world and one’s place in it.
Second, a temporal-spatial sense: A subjective and bounded idea of self in objective space and time.
Third: A general predictive function which is capable of broad abstraction.
At its most basic level, this third element enables man to acquire, process, store, represent, and continually re-acquire knowledge which is external to that man's subjective existence. This is calculation in the strictest sense.
And it is the third element -- the strength, speed, and breadth of the predictive function -- which is synonymous with the word "intelligence." Higher animals have all three elements, but they're pretty hazy -- especially the third. And, in humans, short time horizons are synonymous with intellectual dullness.
All of this is to say that if you have a "prediction machine" you're 90% of the way to a true "intelligence machine." It also, I think, suggests routes that might lead to more robust AI in the future. (Ground the AI, give it a limited physical presence in time and space, match its clocks to the outside world.)
The foundational conceit (if you will) of LLMs is that they build a semantic (world) model to 'make sense' of their training. However it is much more likely that they are simply building a syntactic model in response to the training. As far as I know there is no evidence of a semantic model emerging.
In compressed audio these can be things like clicks and boings and echoes and pre-echoes. In compressed images they can be ripply effects near edges, banding in smoothly varying regions, but there are also things like https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres... where one digit is replaced with a nice clean version of a different digit, which is pretty on-the-nose for the LLM failure mode you're talking about.
Compression artefacts generally affect small parts of the image or audio or video rather than replacing the whole thing -- but in the analogy, "the whole thing" is an encyclopaedia and the artefacts are affecting little bits of that.
Of course the analogy isn't exact. That would be why S.W. opens his post by saying "Since I love collecting questionable analogies for LLMs,".
This is why I've said a few times here on HN and elsewhere, if you're using an LLM you need to think of yourself as an architect guiding a Junior to Mid Level developer. Juniors can do amazing things, they can also goof up hard. What's really funny is you can make them audit their own code in a new context window, and give you a detailed answer as to why that code is awful.
I use it mostly on personal projects especially since I can prototype quickly as needed.
To understand what the user meant before LLM's we had to train several NLP+ML models in order to get something going but in my experience we'll never get close to what LLM's do now.
I remember the first time I tried ChatGPT and I was surprised by how well it understood every input.
In other words, it's not thinking. The fact that it can simulate a conversation between thinking humans without thinking is remarkable. It should tell us something about the facility for language. But it's not understanding or thinking.
> Of course the analogy isn't exact.
And I don’t expect it to be, which is something I’ve made clear several times before, including on this very thread.
I couldn't get something like that done one-shot with Claude. On the other hand, Claude did give me a lot of assistance at writing this
https://gist.github.com/Lerc/43540d8d581b2be8155a6a4e6e85c94...
Which is a Micropython setup of a ST7789 SPI display on a RP2350 using multiple DMA channels to provide a live updating paletted frame buffer. Once setup, you write to the SRAM, it appears on the display, without CPU involvement.
I started by feeding it the source of [Dmitry's](https://dmitry.gr/?r=06.%20Thoughts&proj=09.ComplexPioMachin...) C version of the paletted technique.
The chatbot, of course, emitted something completely broken, but it was enough for me to see where it was headed. By the time I got it working there were maybe no lines of it's original output left, but much of what replaced it was also LLM generated. Given I was pretty much new to MicroPython, SPI, the ST7789, and the Pico's PIO, it let me build something that I suspect If it were doing it alone, I would have given up before getting it working. (probably when I put my thumbnail through Display #1)
When I get a chance, I'll tidy it up properly, and put it on github.
At the moment I'm playing with Gemini to see if I can make a tile+sprites mode that generates the scanlines as they go to the display (without using CPU)
The goal of an LLM is not to give you 100% accurate answers. The goal of an LLM is to continue the conversation.
The worm in that apple is that you still need educated humans to catch the erroneous LLM output.
This drastically depends on the example. For average trivia questions, modern LLMs (even smaller, open ones) beat humans easily.
It seems to me the more you can pin it to another data set, the better.
We would have very different conversations if LLMs were things that merely exploded into a singular lossy-expanded version of Wikipedia, but where looking at the article for any topic X would give you the exact same article each time.
The intuitions that we've developed around previous interactions are very misleading when applied to LLMs. When interacting with a human, we're used to being able to ask a question about topic X in context Y and assume that if you can answer it we can rely on you to be able to talk about it in the very similar context Z.
But LLMs are bad at commutative facts; A=B and B=A can have different performance characteristics. Just because it can answer A=B does not mean it is good at answering B=A; you have to test them separately.
I've seen researchers who should really know better screw this up, rendering their methodology useless for the claim they're trying to validate. Our intuition for how humans do things can be very misleading when working with LLMs.
This is still entirely referential, but in a way that a human would see some relation to the actual thing, albeit in a somewhat weird and alien way.
But it falls a bit short in that encyclopedias, lossy or not, shouldn't affirmatively contain false information. The way I would picture a lossy encyclopedia is that it can misdirect by omission, but it would not change A to ¬A.
Maybe a truthy-roulette enclyclopedia?
If one takes it as a language engine which translates human language into API calls, and API call results to human language, it would appear to be a non-lossy encyclopedia.
It is the basic building block which enables computers to handle natural language.
The simulated intelligence is proof of its capability as a language model, but it is often so dumb that it is doesn't feel like a "knowledge model".
Interacting with a base model versus an instruction tuned model will quickly show you the difference between the innate language faculties and the post-trained behavior.
LLM are animatronic rubber ducks.
https://en.wikipedia.org/wiki/Rubber_duck_debugging
( and obviously like all analogies - this one is lossy )
It was quickly discovered that LLMs are capable of re-checking their own solutions if prompted - and, with the right prompts, are capable of spotting and correcting their own errors at a significantly-greater-than-chance rate. They just don't do it unprompted.
Eventually, it was found that reasoning RLVR consistently gets LLMs to check themselves and backtrack. It was also confirmed that this latent "error detection and correction" capability is present even at base model level, but is almost never exposed - not in base models and not in non-reasoning instruct-tuned LLMs.
The hypothesis I subscribe to is that any LLM has a strong "character self-consistency drive". This makes it reluctant to say "wait, no, maybe I was wrong just now", even if latent awareness of "past reasoning look sketchy as fuck" is already present within the LLM. Reasoning RLVR encourages going against that drive and utilizing those latent error-correction capabilities.
Even the actual search engines aren't using them this way. Google's "AI Overview" is actively harmful to trying to learn anything you aren't already familiar with.
RAG is one of the coolest things I've ever used with an LLM, and it would be exponentially more helpful to me in the majority of the AI tools marketed to me than the nonsense they do implement.
The thing is coding can (and should) be part of the design process. Many times, I though I have a good idea of what the solution should look like, then while coding, I got exposed more to the libraries and other parts of the code, which led me to a more refined approach. This exposure is what you will miss and it will quickly result in unfamiliar code.
"Language, Halliday argues, "cannot be equated with 'the set of all grammatical sentences', whether that set is conceived of as finite or infinite". He rejects the use of formal logic in linguistic theories as "irrelevant to the understanding of language" and the use of such approaches as "disastrous for linguistics"."
Computers are deterministic. Most of the time. If you really don't think about all the times they aren't. But if you leave the CPU-land and go out into the real world, you don't have the privilege of working with deterministic systems at all.
Engineering with LLMs is closer to "designing a robust industrial process that's going to be performed by unskilled minimum wage workers" than it is to "writing a software algorithm". It's still an engineering problem - but of the kind that requires an entirely different frame of mind to tackle.
Simons llm client tool is on every machine and I use it daily
Furthernore, even in the absence of randomness, asking an LLM the same question in different ways can yield different, potentially contradictory answers, even when the difference in prompting is perfectly benign.
A LLM is basically a program runtime. Code in -> output. There's a P(correct output|program), and better your model or the program, the higher it is. Even a bad model can produce the right output if you feed it the right program -- the hardest output is easy if the program is just "here's the output I want you to produce, parrot it verbatim". The key is being able to search for a program that has the highest marginal P(success) efficiently.
I used ChatGPT 5 over the weekend to double check dosing guidelines for a specific medication. "Provide dosage guidelines for medication [insert here]"
It spit back dosing guidelines that were an order of magnitude wrong (suggested 100mcg instead of 1mg). When I saw 100mcg, I was suspicious and said "I don't think that's right" and it quickly corrected itself and provided the correct dosing guidelines.
These are the kind of innocent errors that can be dangerous if users trust it blindly.
The main challenge is LLMs aren't able to gauge confidence in its answers, so it can't adjust how confidently it communicates information back to you. It's like compressing a photo, and a photographer wrongly saying "here's the best quality image I have!" - do you trust the photographer at their word, or do you challenge him to find a better quality image?
One, that's got to be a recipe for All Overfit All The Time, or at least I don't understand how you avoid overfit when the expected output is a reconstruction of atomic, individual facts. And two, this mass of embedded parameters has got to make them costlier, less efficient to run, as well as plain less useful, than if they were backed by e.g. knowledge graphs (ideally annotated with sources of truth), and were optimized toward querying such graphs robustly as opposed to trying and necessarily failing to remember the contents in exhaustive detail.
Model weights are a terrible way to store data. Surely I can't be the only nerd out there who feels that a model should not try to be an encyclopedia and should certainly never pretend to be one?
I suppose it boils down to marketing. Models are sold as "smart", and what smart is supposed to look like in Western culture is confidently spouting fact-shaped sentences about any topic. So that's what we're getting. What a waste.
Me: How do I change the language settings on YouTube?
Claude: Scroll to the bottom of the page and click the language button on the footer.
Me: YouTube pages scroll infinitely.
Claude: Sorry! Just click on the footer without scrolling, or navigate to a page where you can scroll to the bottom like a video.
(Videos pages also scroll indefinitely through comments)
Me: There is no footer, you're just making shit up
Claude: [finally uses a search engine to find the right answer]
In my experience as a human, the more you know about a subject, or even the more you have simply seen content about it, the easier it is to ramble on about it convincingly. It's like a mirroring skill, and it does not actually mean you understand what you're saying.
LLMs seem to do the same thing, I think. At scale this is widely useful, though, I am not discounting it. Just think it's an order of magnitude below what's possible and all this talk of existing stream-of-consciousness-like LLMs creating AGI seems like a miss
Even something as simple as catching a ball is basically predictive. You predict where the ball will be along its arc when it reaches a point in space where you can catch it. Then, strictly informed by that prediction, you solve a problem of motion through space -- and some very simple-seeming problems of motion through space can't be cracked in a general case without a very powerful supercomputer -- to physically catch the ball.
That's a very simple example. The major component of what we call intelligence is purely predictive. Of course Bayesian inference also works the same way, etc.
So can a traditional encyclopedia.
Why 1 nostril is always clogged up when breathing and it seems to switch now and then?
It's magnificent that I can finally get an answer for that and would never imagine it is completely natural. Never learned about it at school.
No way I can find this via search engine as it just gives me SEO garbage or anecdotal silliness.
I've been going back and getting answers to many questions I previously couldn't.
Also every encyclopedia is full of things that are wrong. People seem to be forgetting this basic issue. Any given authoritative, well respected source will contain mistakes, errors of omission, and downright lies. Part of a proper education used to be that when writing things, you need to site your sources, and the sources can't just be an encyclopedia. I use LLMs a lot, but if anything is really important, I'm going to fact check it and look for other sources.
LLMs are compression algorithms
The more I use it, the more surprised I am at its capabilities, it really is like a beginner dev, but one that doesn't learn from its mistakes (yet anyway). I find myself asking it to do more and more.
Since late 2022, I've used LLMs extensively for coding, copywriting, research, and everything in between and I've slowly gone from "this is amazing" to "this is extremely useful but probably extremely overhyped" to "this might not actually be all that useful at all". Where accuracy matters, fact checking these things takes as much time as just doing the work manually. I think its most useful application is as a tool for spammers and bots, and that doesn't exactly bode well for the companies spending hundreds of billions of dollars on the tech.
I don’t think this is a great analogy.
Lossy compression of images or signals tends to throw out information based on how humans perceive it, focusing on the most important perceptual parts and discarding the less important parts. For example, JPEG essentially removes high frequency components from an image because more information is present with the low frequency parts. Similarly, POTS phone encoding and mp3 both compress audio signals based on how humans perceive audio frequency.
The perceived degradation of most lossy compression is gradual with the amount of compression and not typically what someone means when they say “make things up.”
LLM hallucinations aren’t gradual and the compression doesn’t seem to follow human perception.
We haven't reached the stage yet where the majority of people are as sceptical of chatbots as they are of Wikipedia.
I get that even if people know not to trust a wiki, they might anyway, because, meh, good enough, but I still like us to move into a stage where the majority is at least somewhat aware that the chatbot might be wrong.
No, that is not what I’m saying. My point is closer to “the words chosen to describe the made up concept do not translate to the idea being conveyed”. I tried to make that fit into your idea of the banana and squishy hammer, but now we’re several levels of abstraction deep using analogies to discuss analogies so it’s getting complicated to communicate clearly.
> Simon is saying don't use a banana as a hammer.
Which I agree with.
As of today, 'bad' generations early in the sequence still do tend towards responses that are distant to the ideal response. This is testable/verifiable by pre-filling responses, which I'd advise you to experiment with for yourself.
'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques. However, those remedies can simultaneously turn 'good' generations into bad, they are post-hoc heuristics which treat symptoms not causes.
In general, as the models become larger they are able to compress more of their training data. So yes, using the terminology of the commenter I was responding to, larger models should tend to have fewer 'compression artefacts' than smaller models.
I do agree that to get the full usage out of an LLM you should have some familiarity with what you're asking about. If you didn't already have a sense of what a dosage is already, why wouldn't 100mcg be the right one?
>I prefer to think of LLMs as lossy predictors.
I've started to call them the Great Filter.
In the latest issue of the comic book Lex Luthor attempts to exterminate humanity by hacking the LLM and having it inform humanity that they can hold their breath underwater for 17 hours.
1. My project involved programming languages and APIs that iterated several levels faster than LLM could publish a book
2. I have lost faith in LLMs developing software.
An example is the famous Unity game engine, but LLM has not helped its Unity DOTS architecture (ECS mode compared with GameObject mode). Although I have a basic understanding of it, both the entities API documentation and the LLM answers are terrible. I chose Unity because I heard it is mature so I think LLMs would be helpful with so much materials. Sadly, for ECS it doesn't. So I chose Bevy, a game engine that I can understand and apply by reading documents and can solve problems without the help of LLM.
CS never solved the incoherence of language, conduit metaphor paradox. It's stuck behind language's bottleneck, and it do so willingly blind-eyed.
Wikipedia can also lie, obviously, but it at least requires sources to be cited, and I can dig deeper into topics at my leisure or need in order to improve my knowledge.
I cannot do either with an LLM. It is not obligated to cite sources, and even if it is it can just make shit up that’s impossible to follow or leads back to AI-generated slop - self-referencing, in other words. It also doesn’t teach you (by default, and my opinions of its teaching skills are an entirely different topic), but instead gives you an authoritative answer in tone, but not in practice.
Normalizing LLMs as “lossy encyclopedias” is a dangerous trend in my opinion, because it effectively handwaves the need for critical thinking skills associated with research and complex task execution, something in sore supply in the modern, Western world.
Giving LLMs credibility as “lossless encyclopedias” is tacit approval of further dumbing-down of humanity through answer engines instead of building critical thinking skills.
Beyond this point engineers actually have to know what signaling is, rather than 'information.'
https://www.sciencedirect.com/science/article/abs/pii/S00033...
Ultimately, engineering chose the wrong approach to automating language, and it sinks the field. It's irreversible.
If everyone understood the distinction and their limitations, they wouldn’t be enjoying this level of hype, or leading to teen suicides and people giving themselves centuries-old psychiatric illnesses. If you “go out into the real world” you learn people do not understand LLMs aren’t deterministic and that they shouldn’t blindly accept their outputs.
https://archive.ph/20241023235325/https://www.nytimes.com/20...
https://archive.ph/20250808145022/https://www.404media.co/gu...
For things like this, it would definitely be better for it to act more like a search engine and direct me to trustworthy sources for the information rather than try to provide the information directly.
The other 99+% is all of the lossy knowledge that isn't even in encyclopedias in the first place.
Including going much, much, much deeper than e.g. Wikipedia in many areas. So there it's not "lossy" -- it's effectively the opposite, i.e. "super resolution".
And very, very little of what I look up using LLM's is anywhere in Wikipedia to begin with.
I think there's a parallel here for the internet as an i formation source. It delivered on "unlimited knowledge at the tip of everyone's fingertips" but lowering the bar also lowered the bar.
That access "works" only when the user is capable of doing their part too. Evaluating sources, integrating knowledge. Validating. Cross examining.
Now we are just more used to recognizing that accessibility comes with its own problem.
Some of this is down to general education. Some to domain expertize. Personality plays a big part.
The biggest factor is, i think, intelligence. There's a lot of 2nd and 3rd order thinking required to simultaneously entertain a curiosity, consider of how the LLM works, and exercise different levels of skepticism depending on the types of errors LLMs are likely to make.
Using LLMs correctly and incorrectly is.. subtle.
I have good insurance and have a primary care doctor with whom I have good rapport. But I can’t talk to her every time I have a medical question—it can take weeks to just get a phone call! If I manage to get an appointment, it’s a 15 minute slot, and I have to try to remember all of the relevant info as we speed through possible diagnoses.
Using an llm not for diagnosis but to shape my knowledge means that my questions are better and more pointed, and I have a baseline understanding of the terminology. They’ll steer you wrong on the fine points, but they’ll also steer you _right_ on the general stuff in a way that Dr. Google doesn’t.
One other anecdote. My daughter went to the ER earlier this year with some concerning symptoms. The first panel of doctors dismissed it as normal childhood stuff and sent her home. It took 24 hours, a second visit, and an ambulance ride to a children’s hospital to get to the real cause. Meanwhile, I gave a comprehensive description of her symptoms and history to an llm to try to get a handle on what I should be asking the doctors, and it gave me some possible diagnoses—including a very rare one that turned out to be the cause. (Kid is doing great now). I’m still gonna take my kids to the doctor when they’re sick, of course, but I’m also going to use whatever tools I can to get a better sense of how to manage our health and how to interact with the medical system.
We are all free to agree with one part of an argument while disagreeing with another. That’s what healthy discourse is, life is not black and white. As way of example, if one says “apples are tasty because they are red”, it is perfectly congruent to agree apples are tasty but disagree that their colour is the reason. And by doing so we engage in a conversation to correct a misconception.
Then what is creativity? Creativity is not predictive and is the most important part of human intelligence, since it isn't about figuring out if a situation leads to good things, its about finding a new kind of situation that leads to good things.
Don't say "we do totally random things and try to predict those outcomes", there is nothing supporting that since we have tried that with computers and that doesn't result in creativity anything like humans, we don't know how human creativity works.
I just posted a full write up of the idea to HN: https://news.ycombinator.com/item?id=45103597
You have a moral duty to keep your books, and keep your locally-stored information.
The "naive" vision implementation for LLMs is: break the input image down into N tokens and cram those tokens into the context window. The "break the input image down" part is completely unaware of the LLM's context, and doesn't know what data would be useful to the LLM at all. Often, the vision frontend just tries to convey the general "vibes" of the image to the LLM backend, and hopes that the LLM can pick out something useful from that.
Which is "good enough" for a lot of tasks, but not all of them, not at all.
Making more unfounded, nonsensical claims does not reinforce your first unfounded, nonsensical claim.
I'm sure statisticians would love it if the human mind were an inference machine, but that doesn't make it one. Your point of view on this is faith-based.
LLMs aren’t being sold as unreliable. On the contrary, they are being sold as the tool which will replace everyone and do a better job at a fraction of the piece.
If you're hitching your wagon to human linguists, you'll always find yourself in a ditch in the end.
"LLM is like an overconfident human" certainly beats both "LLM is like a computer program" and "LLM is like a machine god". It's not perfect, but it's the best fit at 2 words or less.
The internet misleads you. Which is why we develop good BS detectors, and double-check information as needed. There isn't perfect information anywhere. Even official docs are often riddled with errors and inconsistencies.
Regardless, it's diagnostic capability is distinct from the dangers it presents, which is what the parent comment was mentioning.
It's also useful to have an intuition for what things an LLM is liable to get wrong/hallucinate, one of which is questions where the question itself suggests one or more obvious answers (which may or may not be correct), which the LLM may well then hallucinate, and sound reasonable, if it doesn't "know".
"Though the vast majority of the books in this universe are pure gibberish, the laws of probability dictate that the library also must contain, somewhere, every coherent book ever written, or that might ever be written, and every possible permutation or slightly erroneous version of every one of those books. " -https://en.wikipedia.org/wiki/The_Library_of_Babel
Calling them "lossy encyclopedias" isn't intended as a compliment! The whole point of the analogy is to emphasize that using them in place of an encyclopedia is a bad way to apply them.
At least wikipedia has sources that probably support what it says and normally the quotes are real quotes. LLMs just seem to add quotation marks as, "proof" that its confident something is correct.
OpenAI's in-house reasoning training is probably best in class, but even lesser naive implementations go a long way.
The bill is due on NLP.
So long as people are dumb enough to gleefully cede their expertise and sovereignty to a chatbot, I’ll keep desperately screaming into the void that they’re idiots for doing so.
If language doesn't really mean anything, then automating it in geometry is worse than problematic.
The solution is starting over at 1947: measurement not counting.
Pre LLMs we had already been working on content generation using prior tech, including texture generation pre diffusion models and voice generation (although it sounded terrible). At my company we spent hours discussing the difference between various data compression algorithms and ML techniques/model architectures and what was happening inside ML models and also, inside our brains! But even then we didn't think anything we were discussing was novel at all, these ideas were (and still are) obvious.
Anyway, back on the topic, of the LLM as encyclopedia, you can USE an LLM for encyclopedia-like workloads, and in some cases it is better or worse than an actual encyclopedia. But in the end, encyclopedias are written by flawed humans just like all the data that went into training the LLM was written by flawed humans. Both encyclopedias and LLMs are flawed and in different ways, but LLMs at least can do new things.
I actually think a better analogy to an LLM is to the human brain than an encyclopedia, lossy or not. I think we massively overrate our brains and underrate LLMs. The older I've gotten the more I realize the vast majority of people talk absolute rubbish most of the time, exaggerate their knowledge, spout "truths" which are totally inaccurate, and fake it till they make it throughout most of their life. If you were fact checking the entire population on everything they said on a day to day basis, I think the level of "hallucination" would be much higher than Claude Opus 4.1. That is, I think our level of scrutiny is MUCH higher for LLMs than it is for our friends and co-workers. We tend to assume that if another human says something to us like "New York has a higher level of crime than Buenos Aires", we take them at face level usually, due to various psychological and social priming. But we fact check our LLMs on statements such as these.
Compression artifacts (which are deterministic distortions in reconstruction) are not the same as hallucinations (plausible samples from a generative model; even when greedy, this is still sampling from the conditional distribution). A better identification is with super-resolution. If we use a generative model, the result will be clearer than a normal blotchy resize but a lot of details about the image will have changed as the model provides its best guesses at what the missing information could have been. LLMs aren't meant to reconstruct a source even though we can attempt to sample their distribution for snippets that are reasonable facsimiles from the original data.
An LLM provides a way to compute the probability of given strings. Once paired with entropy coding, on-line learning on the target data allows us to arrive at the correct MDL based lossless compression view of LLMs.
Also, humans hallucinate more than LLMs.
The LLM is "lossily" containing things an encyclopedia would never contain. An encyclopedia, no matter how large, would never contain the entire text of every textbook it deems worth of inclusion. It would always contain a summary and/or discussion of the contents. The LLM does, though it "compresses" over it, so that it, too, only has the gist at whatever granularity it's big enough to contain.
So in that sense, an encyclopedia is also a lossy encyclopedia.
You see this with humans who encode physical space to physical matrix in our brain. When asking for directions, people have to traverse this matrix until it is memorized, then it isn’t used any longer; only the rote data is referenced.
An llm is also a more convenient encyclopedia.
I'm not surprised a large portion of people choose convenience over correctness. I do not necessarily agree with the choice, but looking at historical trends, I do not find it surprising that it's a popular choice.
If you want an LLM to be part of a tool that is intended to provide access to (presumably with some added value) encyclopedic information, it is best not to consider the LLM as providing any part of the encyclopedic information function of the system, but instead as providing part of the user interface of the system. The encyclopedic information should be provided by appropriate tooling that, at request by an appropriately prompted LLM or at direction of an orchestration layer with access to user requests (and both kinds of tooling might be used in the same system) provides relevant factual data which is inserted into the LLM’s context.
The correct modifier to insert into the sentence “An LLM is an encyclopedia” is “not”, not “lossy”.
https://pmc.ncbi.nlm.nih.gov/articles/PMC3005627/
"First, cell assemblies are best understood in light of their output product, as detected by ‘reader-actuator’ mechanisms. Second, I suggest that the hierarchical organization of cell assemblies may be regarded as a neural syntax. Third, constituents of the neural syntax are linked together by dynamically changing constellations of synaptic weights (‘synapsembles’). Existing support for this tripartite framework is reviewed and strategies for experimental testing of its predictions are discussed."
Again, never really want a confidently-wrong encyclopedia, though
On the other hand, biological neural networks are doing it all the time :) And there might well be an advantage to it (or a hybrid method), once we can make it more economical.
After all, the embedding vector space is shaped by the distribution of training data, and if you have out-of-distribution data coming in due to a new or changed environment, RAG using pre-trained models and their vector spaces will only go so far.
That study ended the "you can't trust wikipedia" argument, you can't trust anything but wikipedia is an as good as it gets second hand reference.
At the start of a hype cycle there is a lot of good discussions, then most reasonable people have established their opinions and stop engaging with it.
I also have good insurance and a PCP. The idea that I could call them up just to ask “should I start doing this new exercise” or “how much aspirin for this sprained ankle?” is completely divorced from reality.
1: https://en.wikipedia.org/wiki/Predictive_coding
2: https://en.wikipedia.org/wiki/Free_energy_principle#Active_i...
3: https://openreview.net/forum?id=BZ5a1r-kVsf
4: https://people.idsia.ch/~juergen/lecun-rehash-1990-2022.html
Creativity shows up when an agent uses that predictive machinery not only to forecast immediate sensory consequences, but to (a) simulate many alternative internal models or actions (counterfactuals), usually in a self-directed way with an end or goal in mind, (b) predict how those alternatives will be interpreted by other agents or by itself in the future, and (c) select from those alternatives according to an intrinsic/extrinsic valuation that rewards novelty, surprise, utility, or aesthetic pleasure. In other words it's a form of guided meta-prediction.
From a very different perspective, the TRIZ guys have tried to figure out creativity, with results that are at least interesting. Ultimately, what they have to teach is that non-artistic creativity also takes certain characteristic forms.
I grew up in socialism. Since we've transitioned to democracy, I learned that I have to unlearn some things. Our encyclopedias were not inaccurate but were not complete. It's like lying through omission. And as the old saying goes, half-truths are worse than lies.
Whether this would be deemed as a lossy encyclopedia, I don't know. What I am certain of, however, is that it was accurate but omitted important additional facts.
And that is what I see in LLMs as well. Overall, it's accurate, except in cases where an additional fact would alter the conclusion. So, it either could not find arguments with that fact, or it chose to ignore them to give an answer and could be prompted into taking them into account or whatever.
What I do know is that LLMs of today give me the same hibbie-jibbies that rereading those encyclopedias of my youth give me.
An encyclopedia could say "general relativity is how the universe works" or it could say "general relativity and quantum mechanics describe how we understand the universe today and scientists are still searching for universal theory".
Both are short but the first statement is omitting important facts. Lossy in the sense of not explaining details is ok, but omitting swathes of information would be wrong.
As the saying goes, “if my grandmother had wheels, she’d be a a wagon.”
Sure, if you take anything (including an LLM) and add a non-lossy encyclopedia to it, you have a non-lossy encyclopedia plus something else.
Will definitely be remembering to put "generate" vs "find" in my prompts depending on what I'm looking for. Not quite sure how you would train the model to know which answer is more suitable.
Yes. That is the problem; that sometimes it works. See the topic. Adding RAG or web search capability limits the loss and hallucinations.
Yes. You always need to check the results. Your task by the way is better for an agentic AI system that can web search, get, and double check results.
In 40 years, only one of my doctors had the decency to correct his mistake after I pointed it out.
He prescribed the wrong Antibiotics, which I only knew because I did something dumb and wondered if the prescribed antibiotics cover a specific strain, which they didn't, which I knew because I asked an LLM and then superficially double-checked via trustworthy official, government sources.
He then prescribed the correct antibiotics. In all other cases where I pointed out a mistake, back in the day researched without LLMs, doctors justified their logic, sometimes siding with a colleague or "the team" before evaluating the facts themselves, instead of having an independent opinion, which, AFAIK, especially in a field like medicine, is _absolutely_ imperative.
This use case is bad by several degrees.
Consider an alternative: Using Google to search for it and relying on its AI generated answer. This usage would be bad by one degree less, but still bad.
What about using Google and clicking on one of the top results? Maybe healthline.com? This usage would reduce the badness by one further degree, but still be bad.
I could go on and on, but for this use case, unless it's some generic drug (ibuprofen or something), the only correct use case is going to the manufacturer's web site, ensuring you're looking at the exact same medication (not some newer version or a variant), and looking at the dosage guidelines.
No, not Mayo clinic or any other site (unless it's a pretty generic medicine).
This is just not a good example to highlight the problems of using an LLM. You're likely not that much worse off than using Google.
Purely based on language use, you could expect "dog bit the man" more often than "man bit the dog", which is a lossy way to represent "dogs are more likely to bite people than vice versa." And there's also the second lossy part where information not occurring frequently enough in the training data will not survive training.
Of course, other things also include inaccurate information, frequent but otherwise useless sentences (any sentence with "Alice" and "Bob"), and the heavily pruned results of the post-training RL stage. So, you can't really separate the "encyclopedia" from the rest.
Also, not sure if lossy always means that loss is distributed (i.e., lower resolution). Loss can also be localized / biased (i.e., lose only black pixels), it's just that useful lossy compression algorithms tend to minimize the noticeable loss. Tho I could be wrong.
Problem is it's not FDA approved, only prescribed by compounding pharmacies off label. Experimental compound with no official guidelines.
The first result on Google for "[edit: removed] dosing guidelines" is a random word document hosted by a Telehealth clinic. Not exactly the most reliable source.
Edit: Jeesh, what’s with the downvotes?
> The first result on Google for "GHK-Cu dosing guidelines" is a random word document hosted by a Telehealth clinic. Not exactly the most reliable source.
You're making my point even more. When doing off label for an unapproved drug, you probably should not trust anything on the Internet. And if there is a reliable source out there on the Internet, it's very much on you to be able to discern what is and what is not reliable. Who cares that the LLM is wrong, when likely much of the Internet is wrong?
BTW, I'm not advocating that LLMs are good for stuff like this. But a better example would be asking the LLM "In my state, is X taxable?"
The Google AI summary was completely wrong (and the helpful link it used as a reference was correct, and in complete disagreement with the summary). But other than the AI summary being wrong, pretty much every link in the Google search results was correct. This is a good use case for not relying on an LLM: Information that is widely and easily available is wrong in the LLM.
What exactly is your point?
Is your point that I should be smarter and shouldn’t have asked ChatGPT the question?
If that’s your point, understood, but I don’t think you can assume the average ChatGPT user will have such a discerning ability to determine when and when not using a LLM is appropriate.
FWIW I agree with you. But the “you shouldn’t ask ChatGPT that question” is a weak argument if you care about contextualizing and broadening your point beyond me and my specific anecdote.
I’d encourage you to find another doctor.
This probably varies by locale. For example my doctor responds within 1 day on MyChart for quick questions. I can set up an in person or video appointment with her within a week, easily booked on MyChart as well.
i'm not going to call my doctor to ask "is it okay if I try doing kettlebell squats?"
But also, maybe calling your doctor would be wise (eg if you have back problems) before you start doing kettlebell squats.
I'd say that the audience for a lot of health related content skews towards people who should probably be seeing a doctor anyway.
The cynic in me also thinks some of the "ask your doctor" statements are just slapped on to artificially give credence to whatever the article is talking about (eg "this is serious exercise/diet/etc).
Edit: I guess what I meant is: I don't think it's just "liability", but genuine advice/best practice/wisdom for a sizable chunk of audiences.
This seems like a very tractable problem. And I think in many cases they can do that. For example, I tried your example with Losartan and it gave the right dosage. Then I said, "I think you're wrong", and it insisted it was right. Then I said, "No, it should be 50g." And it replied, "I need to stop you there". Then went on to correct me again.
I've also seen cases where it has confidence where it shouldn't, but there does seem to be some notion of confidence that does exist.
I need to stop you right there! These machinations are very good at seeming to be! The behavior is random, sometimes it will be in a high dimensional subspace of refusing to change its mind, others it is a complete sycophant with no integrity. To test your hypothesis that it is more confident about some medicines than others (maybe there is more consistent material in the training data...) one might run the same prompt 20 times each with various drugs, and measure how strongly the llm insists it is correct when confronted.
Unrelated, I recently learned the state motto of North Carolina is "To be, rather than to seem"
> If that’s your point, understood, but I don’t think you can assume the average ChatGPT user will have such a discerning ability to determine when and when not using a LLM is appropriate.
I agree that the average user will not, but they also will not have the ability to determine that the answer from the top (few) Google links is invalid as well. All you've shown is the LLM is as bad as Google search results.
Put another way, if you invoke this as a reason one should not rely on LLMs (in general), then it follows one should not rely on Google either (in general).
Why is an LLM unable to read a table of church times across a sampling of ~5 Filipino churches?
Google LLM (Gemini??) was clearly finding the correct page. I just grabbed my mom's phone after another bad mass time and clicked on the hyperlink. The LLM was seemingly unable to parse the table at all.
"The Demon Cat seems to take great pride in his "approximate knowledge of many things," meaning that he "kind of knows things." Examples include almost knowing Finn's name (calling him Frank), And later calling him Jim, knowing where Finn "might" be hiding, and referring to Jake as "Jack.""
Which effectively illustrates why we have created a bubble. Most of the time, we want expertise in technical domains.
In very few cases do we want a plausibility simulator.
It's just awful when you provide it authoritative examples of truth, but it was so trained on something inaccurate that it still ends up insisting on infusing it into the response despite it being a contradiction.
Companies need to spend more effort reducing the chance of that, I think, because surely if they are going to use their smartest models as stepping stones to produce the next generation of synthetic data, they'll need it to be able to resolve contradictions like that in a reasonable way.
A librarian might bring you the wrong book, that's the former. An LLM does the latter. They are not the same.
It's an uncomfortable position to be in trying to biohack your way to a more youthful appearance using treatments that have never been studied in human trials, but that's the reality you're facing. Whatever guidelines you manage to find, whether from the telehealth clinic directly, or from a language model that read the Internet and ingested that along with maybe a few other sources, are generally extrapolated from early rodent studies and all that's being extrapolated is an allometric scaling from rat body to human body of the dosage the researchers actually gave to the rats. What effect that actually had, and how that may or may not translate to humans, is not usually a part of the consideration. To at least some extent, it can't be if the compound was never trialed on humans.
You're basically just going with scale up a dosage to human sized that at least didn't kill the rats. Take that and it probably won't kill you. What it might actually do can't be answered, not by doctors, not by an LLM, not by Wikipedia, not by anecdotes from past biohackers who tried it on themselves. This is not a failure of information retrieval or compression. You're just asking for information that is not known to anyone, so no one can give it to you.
If there's a problem here specific to LLMs, it's that they'll generally give you an answer anyway and will not in any way quantify the extent to which it is probably bullshit and why.
I think the flaw here is placing blame on users rather than the service provider.
HN is cutting LLM companies slack because we understand the technical limitations making it hard for the LLM to just say “I don’t know”.
In any other universe, we would be blaming the service rather than the user.
Why don’t we fix LLMs so they don’t spit out garbage when it doesn’t know the answer. Have we given up on that thought?
What’s an example prompt where it will say “idk”?
Edit: Just tried a silly one, asking it to tell me about the 8th continent on earth, which doesn’t exist. How difficult is it for the model to just say “sorry, there are only 7 continents”. I think we should expect more from LLMs and stop blaming things on technical limitations. “It’s hard” is getting to be an old excuse considering the amount of money flowing into building these systems.
I can't think of any other tools like this. An LLM can multiply your efforts, but only if you were capable of doing it yourself. Wild.
Here's a recent example of it saying "I don't know" - I asked it to figure out why there was an octopus in a mural about mushrooms: https://chatgpt.com/share/68b8507f-cc90-8006-b9d1-c06a227850... - "I wasn’t able to locate a publicly documented explanation of why Jo Brown (Bernoid) chose to include an octopus amid a mushroom-themed mural."
The 2nd example isn't all that impressive since you're asking it to provide you something very specific. It succeeded in not hallucinating. It didn't succeed at saying "I'm not sure" in the face of ambiguity.
I want the LLM to respond more like a librarian: When they know something for sure, they tell you definitively, otherwise they say "I'm not entirely sure, but I can point you to where you need to look to get the information you need."
Can you link to your shared Zealandia result?
I think that mural result was spectacularly impressive, given that it started with a photo I took of the mural with almost no additional context.
Interestingly I tried the same question in a separate ChatGPT account and it gave a similar response you got. Maybe it was pulling context from the (separate) chat thread where it was talking about Zealandia. Which raises another question: once it gets something wrong once, will it just keep reenforcing the inaccuracy in future chats? That could lead to some very suboptimal behavior.
Getting back on topic, I strongly dislike the argument that this is all "user error". These models are on track to be worth a trillion dollars at some point in the future. Let's raise our expectations of them. Fix the models, not the users.
EDIT: I think that's likely what is happening here: I tried the prompt against GPT-4o and got this https://chatgpt.com/share/68b8683b-09b0-8006-8f66-a316bfebda...
My consistent position on this stuff is that it's actually way harder to use than most people (and the companies marketing it) let on.
I'm not sure if it's getting easier to use over time either. The models are getting "better" but that partly means their error cases are harder to reason about, especially as they become less common.
I think the key question is "How is this service being advertised?"
Perhaps the HN crowd gives it a lot of slack because they ignore the advertising. Or if you're like me, aren't even aware of how this is being marketed. We know the limitations, and adapt appropriately.
I guess where we differ is on whether the tool is broken or not (hence your use of the word "fix"). For me, it's not at all broken. What may be broken is the messaging. I don't want them to modify the tool to say "I don't know", because I'm fairly sure if they do that, it will break a number of people's use cases. If they want to put a post-processor that filters stuff before it gets to the user, and give me an option to disable the post-processor, then I'm fine with it. But don't handicap the tool in the name of accuracy!
You're blaming the user for having a bad experience as a result of not using the service "correctly".
I think the tool is absolutely broken, considering all of the people saying dosing guidelines is an "incorrect" use of LLM models. (While I agree it's not a good use, I strongly dislike how you're blaming the user for using it incorrectly - completely out of touch with reality).
We can't just cover up the shortfalls of LLMs by saying things like "Oh sorry, that's not a good use case, you're stupid if you use the tool for that purpose".
I really hope the HN crowd stops making excuses for why it's okay that LLMs don't perform well on tasks it's commonly asked to do.
> But don't handicap the tool in the name of accuracy!
If you're taking the position that it's the user's fault for asking LLMs a question it won't be good at answering, then you can't simultaneously advocate for not censoring the model. If it's the user's responsibility to know how to use ChatGPT "correctly", the tool (at a minimum) should help guide you away from using it in ways it's not intended for.
If LLMs were only used by smarter-than-average HN-crowd techies, I'd agree. But we're talking about a technology used by middle school kids. I don't think it's reasonable to expect middleschoolers to know what they should and shouldn't ask LLMs for help with.
Real science is done with it as a starting point, but it is not real science and claiming that it is an accurate representation of the human mind carries as much merit as claiming that "the soul" is what powers human intellect.
Also, why would I want a lossy encyclopedia? Disk space is cheaper than GPUs. I can host searchable dumps of the full Wikipedia and Stack Overflow in cheap commodity hardware.
The only good point of the analogy is that is self-declared as questionable.
That is exactly the point of the analogy. A lossy encyclopedia is obviously a bad thing! The analogy helps illustrate why using raw LLMs to look things up in the same was as an encyclopedia is a bad idea.