←back to thread

385 points vessenes | 6 comments | | HN request time: 1.18s | source | bottom

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context
ActorNightly ◴[] No.43325670[source]
Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

replies(10): >>43365410 #>>43366234 #>>43366675 #>>43366830 #>>43366868 #>>43366901 #>>43366902 #>>43366953 #>>43368585 #>>43368625 #
seanhunter ◴[] No.43365410[source]
> The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).

replies(5): >>43365608 #>>43365655 #>>43365953 #>>43366351 #>>43366485 #
1. estebarb ◴[] No.43365608[source]
The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.

replies(1): >>43365684 #
2. ortsa ◴[] No.43365684[source]
Has anybody ever messed with adding a "backspace" token?
replies(1): >>43365782 #
3. refulgentis ◴[] No.43365782[source]
Yes. (https://news.ycombinator.com/item?id=36425375, believe there's been more)

There's a quite intense backlog of new stuff that hasn't made it to prod. (I would have told you in 2023 that we would have ex. switched to Mamba-like architectures in at least one leading model)

Broadly, it's probably unhelpful that:

- absolutely no one wants the PR of releasing a model that isn't competitive with the latest peers

- absolutely everyone wants to release an incremental improvement, yesterday

- Entities with no PR constraint, and no revenue repurcussions when reallocating funds from surely-productive to experimental, don't show a significant improvement in results for the new things they try (I'm thinking of ex. Allen Institute)

Another odd property I can't quite wrap my head around is the battlefield is littered with corpses that eval okay-ish, and should have OOM increases in some areas (I'm thinking of RWKV, and how it should be faster at inference), and they're not really in the conversation either.

Makes me think either A) I'm getting old and don't really understand ML from a technical perspective anyway or B) hey, I 've been maintaining a llama.cpp wrapper that works on every platform for a year now, I should trust my instincts: the real story is UX is king and none of these things actually improve the experience of a user even if benchmarks are ~=.

replies(2): >>43365962 #>>43367533 #
4. vessenes ◴[] No.43365962{3}[source]
For sure read Stephenson’s essay on path dependence; it lays out a lot of these economic and social dynamics. TLDR - we will need a major improvement to see something novel pick up steam most likely.
replies(1): >>43367515 #
5. Ericson2314 ◴[] No.43367515{4}[source]
Yeah everyone spending way to much money in things we barely understand is a recipe for insane path dependence.
6. ortsa ◴[] No.43367533{3}[source]
Oh yeah, that's exactly what I was thinking of! Seems like it would be very useful for expert models with domains with more definite "edges" (if I'm understanding it right)

As for the fragmentation of progress, I guess that's just par the course for any tech with a such a heavy private/open source split. It would take a huge amount of work to trawl through this constant stream of 'breakthroughs' and put them all together.