Most active commenters
  • chongli(6)
  • heyjamesknight(4)
  • danielmarkbruce(3)
  • (3)
  • og_kalu(3)

←back to thread

385 points vessenes | 48 comments | | HN request time: 0.777s | source | bottom

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Show context
bravura ◴[] No.43368085[source]
Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

replies(16): >>43368212 #>>43368251 #>>43368801 #>>43368817 #>>43369778 #>>43369887 #>>43370108 #>>43370284 #>>43371230 #>>43371304 #>>43371381 #>>43372224 #>>43372695 #>>43372927 #>>43373240 #>>43379739 #
1. codenlearn ◴[] No.43368251[source]
Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand

replies(7): >>43368477 #>>43368489 #>>43368509 #>>43368574 #>>43368699 #>>43370974 #>>43373409 #
2. furyofantares ◴[] No.43368477[source]
> Doesn't Language itself encode multimodal experiences?

When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.

3. danielmarkbruce ◴[] No.43368489[source]
> Doesn't Language itself encode multimodal experiences

Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.

replies(1): >>43369425 #
4. not2b ◴[] No.43368509[source]
I'm reminded of the story of Helen Keller, and how it took a long time for her to realize that the symbols her teacher was signing into her hand had meaning, as she was blind and deaf and only experienced the world via touch and smell. She didn't get it until her teacher spelled the word "water" as water from a pump was flowing over her hand. In other words, a multimodal experience. If the model only sees text, it can appear to be brilliant but is missing a lot. If it's also fed other channels, if it can (maybe just virtually) move around, if it can interact, the way babies do, learning about gravity by dropping things and so forth, it seems that there's lots more possibility to understand the world, not just to predict what someone will type next on the Internet.
replies(2): >>43368558 #>>43369222 #
5. PaulDavisThe1st ◴[] No.43368558[source]
at least a few decades ago, this idea was called "embodied intelligence" or "embodied cognition". just FYI.
replies(1): >>43368846 #
6. ThinkBeat ◴[] No.43368574[source]
No.

> The sun feels hot on your skin.

No matter how many times you read that, you cannot understand what the experience is like.

> You can read a book about Yoga and read about the Tittibhasana pose

But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).

I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).

Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.

Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.

We also forget and change what we remember.

Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.

The way we learn and the way we remember things is incredebily more complex than text.

But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.

To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.

replies(6): >>43369173 #>>43369490 #>>43370066 #>>43370431 #>>43373489 #>>43440558 #
7. mystified5016 ◴[] No.43368699[source]
Imagine I give you a text of any arbitrary length in an unknown language with no images. With no context other than the text, what could you learn?

If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?

Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.

You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.

At what point do you start building the space ship? When do you realize the source text was fictional?

There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.

Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.

Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.

replies(2): >>43369941 #>>43370198 #
8. robwwilliams ◴[] No.43368846{3}[source]
Enactivist philosophy. Karl Friston is testing this approach as CTO of an AI startup in LA.
9. spyder ◴[] No.43369173[source]
> No.

Huh, text definitely encodes multimodal experiences, it's just not as accurate and as rich encoding as the encodings of real sensations.

replies(5): >>43369232 #>>43369251 #>>43369409 #>>43371234 #>>43373812 #
10. bmitc ◴[] No.43369222[source]
It is important to note that Helen Keller was not born blind and deaf, though. (I am not reducing the struggle she went through. Just commentary on embodied cognition and learning.) There were around 19 months of normal speech and hearing development until then and also 3D object space traversal and object manipulation.
replies(1): >>43369640 #
11. bmitc ◴[] No.43369232{3}[source]
It's just a description, not an encoding.
12. ambicapter ◴[] No.43369251{3}[source]
I don't think GP is asserting that the multimodal encoding is "more rich" or "more accurate", I think they are saying that the felt modality is a different thing than the text modality entirely, and that the former isn't contained in the latter.
13. heyjamesknight ◴[] No.43369409{3}[source]
Text describes semantic space. Not everything maps to semantic space losslessly.
14. heyjamesknight ◴[] No.43369425[source]
There are absolutely reasons that we cannot capture the entirety—or even a proper image—of human cognition in semantic space.

Cognition is not purely semantic. It is dynamic, embodied, socially distributed, culturally extended, and conscious.

LLMs are great semantic heuristic machines. But they don't even have access to those other components.

replies(1): >>43369757 #
15. deepGem ◴[] No.43369490[source]
Doesn't this imply that the future of AGI lies not just in vision and text but in tactile feelings and actions as well ?

Essentially, engineering the complete human body and mind including the nervous system. Seems highly intractable for the next couple of decades at least.

replies(1): >>43370574 #
16. ◴[] No.43369640{3}[source]
17. danielmarkbruce ◴[] No.43369757{3}[source]
The LLM embeddings for a token cover much more than semantics. There is a reason a single token embedding dimension is so large.

You are conflating the embedding layer in an LLM and an embedding model for semantic search.

replies(1): >>43384612 #
18. pessimizer ◴[] No.43369941[source]
I'm optimistic about this. I think enough pictures of an apple, chemical analyses of the air, the ability to arbitrarily move around in space, a bunch of pressure sensors, or a bunch of senses we don't even have, will solve this. I suspect there might be a continuum of more concept understanding that comes with more senses. We're bathed in senses all the time, to the point where we have many systems just to block out senses temporarily, and to constantly throw away information (but different information at different times.)

It's not a theory of consciousness, it's a theory of quality. I don't think that something can be considered conscious that is constantly encoding and decoding things into and out of binary.

19. csomar ◴[] No.43370066[source]
All of these "experiences" are encoded in your brain as electricity. So "text" can encode them, though English words might not be the proper way to do it.
replies(3): >>43370354 #>>43370552 #>>43370994 #
20. CamperBob2 ◴[] No.43370198[source]
A few GB worth of photographs of hieroglyphs? OK, you're going to need a Rosetta Stone.

A few PB worth? Relax, HAL's got this. When it comes to information, it turns out that quantity has a quality all its own.

21. chongli ◴[] No.43370354{3}[source]
No, text can only refer to them. There is not a text on this planet that encodes what the heat of the sun feels like on your skin. A person who had never been outdoors could never experience that sensation by reading text.
replies(2): >>43370498 #>>43370903 #
22. golergka ◴[] No.43370431[source]
> No matter how many times you read that, you cannot understand what the experience is like.

OK, so you don't have qualia. But if know all the data needed to complete any tasks that can be related to this knowledge, does it matter?

replies(1): >>43381296 #
23. pizza ◴[] No.43370498{4}[source]
In this case, there kind of is. It’s ‘spicy’. The TRPV1 receptor is activated by capsaicin as if it were being activated by intense heat.
24. the_arun ◴[] No.43370552{3}[source]
If texts are conveying actual message - For eg. text: This spice is very hot - reader's tongue should feel the heat! Since that doesn't happen, it is only for us to imagine. However, AI doesn't imagine the feeling/emotion - at least we don't know that yet.
25. maigret ◴[] No.43370574{3}[source]
Yes it’s why robotics is so exciting right now
26. tgma ◴[] No.43370903{4}[source]
> There is not a text on this planet that encodes what the heat of the sun feels like on your skin.

> A person who had never been outdoors could never experience that sensation by reading text.

I don't think the latter implies the former as obviously as you make it to be. Unless you believe in some sort of metaphysical description of human, you can certainly encode the feeling (as mentioned in another comment it will be reduced to electrical signals after all). The only question is how much storage you need for that encoding to get what precision. However, the latter statement, if true, is simply constrained by your input device to the brain, i.e. you cannot transfer your encoding to the hardware in this case a human brain via reading or listening. There could be higher bandwidth interfaces like neuralink that may do that to human brain and in the case of AI, an auxiliary device might not be needed and the encoding would be directly mmap'd.

replies(1): >>43371116 #
27. andsoitis ◴[] No.43370974[source]
Some aspects of experience— e.g. raw emotions, sensory perceptions, or deeply personal, ineffable states—often resist full articulation.

The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.

replies(1): >>43374325 #
28. tsimionescu ◴[] No.43370994{3}[source]
We don't know how memories are encoded in the brain, but "electricity" is definitely not a good enough abstraction.

And human language is a mechanism for referring to human experiences (both internally and between people). If you don't have the experiences, you're fundamentally limited in how useful human language can be to you.

I don't mean this in some "consciousness is beyond physics, qualia can't be explained" bullshit way. I just mean it in a very mechanistic way: language is like an API to our brains. The API allows us to work with objects in our brain, but it doesn't contain those objects itself. Just like you can't reproduce, say, the Linux kernel just by looking at the syscall API, you can't replace what our brains do by just replicating the language API.

29. chongli ◴[] No.43371116{5}[source]
Electrical signals are not the same as subjective experiences. While a machine may be able to record and play back these signals for humans to experience, that does not imply that the experiences themselves are recorded nor that the machine has any access to them.

A deaf person can use a tape recorder to record and play back a symphony but that does not encode the experience in any way the deaf person could share.

replies(1): >>43373811 #
30. Terr_ ◴[] No.43371234{3}[source]
> text definitely encodes multimodal experiences

Perhaps, but only in the same sense that brown and green wax on paper "encodes" an oak tree.

31. ◴[] No.43373409[source]
32. rcpt ◴[] No.43373489[source]
I can't see as much color as a mantis shrimp or sense electric fields like a shark but I still think I'm closer to AGI than they are
33. mietek ◴[] No.43373811{6}[source]
That’s some strong claims, given that philosophers (e.g. Chalmers vs Dennett) can’t even agree whether subjective experiences exist or not.
replies(1): >>43374410 #
34. fluidcruft ◴[] No.43373812{3}[source]
Language encodes what people need it to encode to be useful. I heard of an example of colors--there are some languages that don't even have a word for blue.

https://blog.duolingo.com/color-words-around-the-world/

35. im3w1l ◴[] No.43374325[source]
Just because we struggle to verbalize something, doesn't mean that it cannot be verbalized. The taste of a specific dish can be broken down into its components. The basic tastes: how sweet, sour, salty, bitter and savory it is. The smell of it: there are are apparently ~400 olfactory receptor types in the nose. So you could describe how strongly each of them is activated. Thermoception, the temperature of the food itself, but also fake temperature sensation produced by capsaicin and menthol. The mechanoceptors play a part, detecting both the shape of the food as well as the texture of it. The texture also contributes to a sound sensation as we hear the cracks and pops when we chew. And that is just the static part of it. Food is actually an interactive experience, where all those impressions change over time and varies over time as the food is chewed.

It is highly complex, but it can all be described.

36. chongli ◴[] No.43374410{7}[source]
Even if you’re a pure Dennettian functionalist you still commit to a functional difference between signals in transit (or at rest) and signals being processed and interpreted. Holding a cassette tape with a recording of a symphony is not the same as hearing the symphony.

Applying this case to AI gives rise to the Chinese Room argument. LLMs’ propensity for hallucinations invite this comparison.

replies(1): >>43374917 #
37. mietek ◴[] No.43374917{8}[source]
Are LLMs having subjective experiences? Surely not. But if you claim that human subjective experiences are not the result of electrical signals in the brain, then what exactly is your position? Dualism?

Personally, I think the Chinese room argument is invalid. In order for the person in the room to respond to any possible query by looking up the query in a book, the book would need to be infinite and therefore impossible as a physical object. Otherwise, if the book is supposed to describe an algorithm for the person to follow in order to compute a response, then that algorithm is the intelligent entity that is capable of understanding, and the person in the room is merely the computational substrate.

replies(1): >>43375616 #
38. chongli ◴[] No.43375616{9}[source]
The Chinese Room is a perfect analogy for what's going on with LLMs. The book is not infinite, it's flawed. And that's the point: we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning because the book can never be complete. Thus we keep getting responses that make us realize the LLM is not intelligent and has no idea what it's saying.

The only part where the book analogy falls down has to do with the technical implementation of LLMs, with their tokenization and their vast sets of weights. But that is merely an encoding for the training data. Books can be encoded similarly by using traditional compression algorithms (like LZMA).

replies(1): >>43379081 #
39. og_kalu ◴[] No.43379081{10}[source]
>The book is not infinite, it's flawed.

Oh and the human book is surely infinite and unflawed right ?

>we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning

Both things humans also do in excess

The Chinese Room is nonsensical. Can you point to any part of your brain that understands English ? I guess you are a Chinese Room then.

replies(1): >>43380249 #
40. chongli ◴[] No.43380249{11}[source]
Humans have the ability to admit when they do not know something. We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this. They either have the right answer in the book or they make up nonsense (hallucinate). And they do not even know which one they’re doing!
replies(1): >>43380799 #
41. og_kalu ◴[] No.43380799{12}[source]
>Humans have the ability to admit when they do not know something.

No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

>We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this

Yeah they can. And they can do it much better than chance. They just don't do it as well as humans.

>And they do not even know which one they’re doing!

There's plenty of research that suggests this is the case.

https://news.ycombinator.com/item?id=41418486

replies(1): >>43381871 #
42. ◴[] No.43381296{3}[source]
43. chongli ◴[] No.43381871{13}[source]
No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

Like you’re doing right now? People say “I don’t know” all the time. Especially children. That people also exaggerate, bluff, and outright lie is not proof that people don’t have this ability.

When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?

replies(1): >>43383387 #
44. og_kalu ◴[] No.43383387{14}[source]
>Like you’re doing right now?

Lol Okay

>When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

Good thing I wasn't talking about that. There's a lot of evidence that human explanations are regularly post-hoc rationalizations they fully believe in. They're not lieing to anyone, they just fully believe the nonsense their brain has concocted.

Experiments on choice and preferences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/

Split Brain Experiments https://www.nature.com/articles/483260a

>As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?

Maybe read any of them ? If you weren't interested in evidence to the contrary of your points then you could have just said so and I wouldn't have wasted my time. The 1st and 6th Links make it quite clear current post-training processes hurt calibration a lot.

45. heyjamesknight ◴[] No.43384612{4}[source]
I don't think we're using the term semantic in the same way. I mean "relating to meaning in language."
replies(1): >>43385144 #
46. danielmarkbruce ◴[] No.43385144{5}[source]
The embedding layer in an llm deals with much more than the meaning. It has to capture syntax, grammar, morphology, style and sentiment cues, phonetic and orthographic relationships and 500 other things that humans can't even reason about but exist in words combinations.
replies(1): >>43429400 #
47. heyjamesknight ◴[] No.43429400{6}[source]
I'll give you that. I was including those in "semantic space," but the distinction is fair.

My original point still stands: the space you've described cannot capture a full image of human cognition.

48. MITSardine ◴[] No.43440558[source]
Even beyond sensations (which are never described except circumstantially, as in "the taste of chocolate" says nothing of the taste, only of the circumstances in which the sensation is felt), it's very often people don't understand something another person says (typically a work of art) until they have lived the relevant experiences to connect to the meaning behind the (whatever medium of communication).