←back to thread

385 points vessenes | 1 comments | | HN request time: 0.214s | source

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

1. prats226 ◴[] No.43369868[source]
I feel like success of LLM's have been combination of multiple factors coming together favourably: 1) Hardware becoming cheap enough to train models beyond a size where we could see emergent properties. Which is going to become cheaper and cheaper. 2) Model architecture which can in computationally less expensive manner being able to look at all inputs at the same time. CNN's, RNN's all succeded at smaller scale becuase they added inductive bias in architecture favourable to the input modality, but also became less generic. Attention is simpler in computation to scale it and also has lower inductive bias. 3) Unsupervised text on internet being source of data which requires light pre-processing hence almost no efforts wrt annotations etc reaching scale wrt scaling laws corrosponding to large size models. Also text data being diverse enough to be generic to encompass variety of topics, thoughts vs imagenet etc which is highly specific and costly to produce.

Assuming that text only models will hit a bottleneck, then to have next generation models, in addition to a new architecture, we also have to find rich dataset which is even more generic and much richer in modalities and the architecture being able to natively ingest it?

However something that is not predictible is how well the emergent properties can scale with model size further. Maybe few more unlocks like model being able to retain information well in spite of really large context length, ability to SFT on super complex reasoning tasks without disrupting weights enough to loose unsupervised learning might take us much further?