385 points vessenes | 2 comments | 10 Mar 25 19:41 UTC | HN request time: 0.004s | source

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context

hnfong ◴[14 Mar 25 18:29 UTC] No.43365660[source]▶

>>43325049 (OP) #

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

replies(1): >>43366132 #

hnuser123456 ◴[14 Mar 25 19:12 UTC] No.43366132[source]▶

>>43365660 #

And they seem to be about 10x as fast as similar sized transformers.

replies(1): >>43366608 #

317070 ◴[14 Mar 25 19:58 UTC] No.43366608[source]▶

>>43366132 #

No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

replies(1): >>43366804 #

1. littlestymaar ◴[14 Mar 25 20:17 UTC] No.43366804[source]▶

>>43366608 #

If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

replies(1): >>43369185 #

2. AlexCoventry ◴[15 Mar 25 01:47 UTC] No.43369185[source]▶

>>43366804 (TP) #

The SEDD architecture [1] probably allows for parallel sampling of all tokens in a block at once, which may be faster but not necessarily less computationally demanding in terms of runtime times computational resources used.

[1] Which Inception Labs's new models may be based on; one of the cofounders is a co-author. See equations 18-20 in https://arxiv.org/abs/2310.16834

↑

Ask HN: Any insider takes on Yann LeCun's push against current architectures?