←back to thread

385 points vessenes | 1 comments | | HN request time: 0.223s | source

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context
bitwize ◴[] No.43366061[source]
Ever hear of Dissociated Press? If not, try the following demonstration.

Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.

Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.

LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.

What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.

replies(1): >>43367185 #
vhantz ◴[] No.43367185[source]
+1

But there's a fundamental difference between Markov chains and transformers that should be noted. Markov chains only learn how likely it is for one token to follow another. Transformers learn how likely it is for a set of token to be seen together. Transformers add a wider context to msrkov chain. That quantitative change leads to a qualitative improvement: transformers generate text that is semantically plausible.

replies(1): >>43367848 #
1. wnoise ◴[] No.43367848[source]
Yes, but k-token lookup was already a thing with markov chains. Transformers are indeed better, but just because they model language distributions better than mostly-empty arrays of (token-count)^(context).