Ask HN: Any insider takes on Yann LeCun's push against current architectures?

385 points vessenes | 1 comments | 10 Mar 25 19:41 UTC | HN request time: 0.202s | source

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context

bravura ◴[14 Mar 25 22:40 UTC] No.43368085[source]▶

>>43325049 (OP) #

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

replies(16): >>43368212 #>>43368251 #>>43368801 #>>43368817 #>>43369778 #>>43369887 #>>43370108 #>>43370284 #>>43371230 #>>43371304 #>>43371381 #>>43372224 #>>43372695 #>>43372927 #>>43373240 #>>43379739 #

jcims ◴[15 Mar 25 03:38 UTC] No.43369778[source]▶

>>43368085 #

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.

replies(3): >>43369801 #>>43374002 #>>43376947 #

ninetyninenine ◴[15 Mar 25 03:44 UTC] No.43369801[source]▶

>>43369778 #

LLM is just the name. You can encode anything into the "language" including pictures video and sound.

replies(2): >>43369884 #>>43374022 #

kryogen1c ◴[15 Mar 25 17:43 UTC] No.43374022[source]▶

>>43369801 #

> You can encode anything into the "language

Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.

replies(2): >>43374970 #>>43379020 #

sturza ◴[15 Mar 25 20:32 UTC] No.43374970[source]▶

>>43374022 #

A picture is also an abstraction. If you take a picture of a tree, you have more details than the word "tree". What i think the parent is saying, is that all the information in a picture of a tree can be encoded in language, for example a description of a tree, using words. Both are abstractions but if you describe the tree well enough with text(and comprehend the description) it might have the same "value" as a picture(not for a human, but for a machine). Also, the size of the text describing the tree might be smaller than the picture.

replies(1): >>43375873 #

1. kryogen1c ◴[15 Mar 25 23:32 UTC] No.43375873[source]▶

>>43374970 #

> all the information in a picture of a tree can be encoded in language

What words would you write that would as uniquely identify this tree from any other tree in the world, like a picture would?

Now repeat for everything in the picture, like the time of day, weather, dirt on the ground, etc.

↑