Ask HN: Any insider takes on Yann LeCun's push against current architectures?

385 points vessenes | 1 comments | 10 Mar 25 19:41 UTC | HN request time: 0.282s | source

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context

bravura ◴[14 Mar 25 22:40 UTC] No.43368085[source]▶

>>43325049 (OP) #

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

replies(16): >>43368212 #>>43368251 #>>43368801 #>>43368817 #>>43369778 #>>43369887 #>>43370108 #>>43370284 #>>43371230 #>>43371304 #>>43371381 #>>43372224 #>>43372695 #>>43372927 #>>43373240 #>>43379739 #

ninetyninenine ◴[15 Mar 25 05:00 UTC] No.43370108[source]▶

>>43368085 #

>1) You can't learn an accurate world model just from text. >2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.

Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.

If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?

And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.

I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.

Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.

replies(3): >>43373437 #>>43377444 #>>43440653 #

1. MITSardine ◴[21 Mar 25 20:47 UTC] No.43440653[source]▶

>>43370108 #

Humans are more adaptable than you think:

- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation

- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...

In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).

↑