Most active commenters

chongli(6)
heyjamesknight(4)
kadushka(4)
og_kalu(4)
danielmarkbruce(3)
throw310822(3)
nextts(3)
(3)
YeGoblynQueenne(3)

Popular/hot comments

>>43368251 #
>>43368574 #
>>43369173 #
>>43369884 #
>>43369634 #
>>43369778 #
>>43370066 #
>>43370108 #

Ask HN: Any insider takes on Yann LeCun's push against current architectures?

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

1. bravura ◴[14 Mar 25 22:40 UTC] No.43368085[source]▶

>>43325049 (OP) #

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

replies(16): >>43368212 #>>43368251 #>>43368801 #>>43368817 #>>43369778 #>>43369887 #>>43370108 #>>43370284 #>>43371230 #>>43371304 #>>43371381 #>>43372224 #>>43372695 #>>43372927 #>>43373240 #>>43379739 #

2. cm2012 ◴[14 Mar 25 22:59 UTC] No.43368212[source]▶

>>43368085 (TP) #

This seems strongly backed up by Claude Plays Pokemon

replies(1): >>43368359 #

3. codenlearn ◴[14 Mar 25 23:05 UTC] No.43368251[source]▶

>>43368085 (TP) #

Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand

replies(7): >>43368477 #>>43368489 #>>43368509 #>>43368574 #>>43368699 #>>43370974 #>>43373409 #

4. fewhil ◴[14 Mar 25 23:20 UTC] No.43368359[source]▶

>>43368212 #

Isn't Claude Plays Pokemon using image input in addition to text? Not that it's perfect at it (some of its most glaring mistakes are when it just doesn't seem to understand what's on the screen correctly).

replies(1): >>43368680 #

5. furyofantares ◴[14 Mar 25 23:38 UTC] No.43368477[source]▶

>>43368251 #

> Doesn't Language itself encode multimodal experiences?

When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.

6. danielmarkbruce ◴[14 Mar 25 23:39 UTC] No.43368489[source]▶

>>43368251 #

> Doesn't Language itself encode multimodal experiences

Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.

replies(1): >>43369425 #

7. not2b ◴[14 Mar 25 23:44 UTC] No.43368509[source]▶

>>43368251 #

I'm reminded of the story of Helen Keller, and how it took a long time for her to realize that the symbols her teacher was signing into her hand had meaning, as she was blind and deaf and only experienced the world via touch and smell. She didn't get it until her teacher spelled the word "water" as water from a pump was flowing over her hand. In other words, a multimodal experience. If the model only sees text, it can appear to be brilliant but is missing a lot. If it's also fed other channels, if it can (maybe just virtually) move around, if it can interact, the way babies do, learning about gravity by dropping things and so forth, it seems that there's lots more possibility to understand the world, not just to predict what someone will type next on the Internet.

replies(2): >>43368558 #>>43369222 #

8. PaulDavisThe1st ◴[14 Mar 25 23:52 UTC] No.43368558{3}[source]▶

>>43368509 #

at least a few decades ago, this idea was called "embodied intelligence" or "embodied cognition". just FYI.

replies(1): >>43368846 #

9. ThinkBeat ◴[14 Mar 25 23:55 UTC] No.43368574[source]▶

>>43368251 #

No.

> The sun feels hot on your skin.

No matter how many times you read that, you cannot understand what the experience is like.

> You can read a book about Yoga and read about the Tittibhasana pose

But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).

I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).

Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.

Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.

We also forget and change what we remember.

Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.

The way we learn and the way we remember things is incredebily more complex than text.

But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.

To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.

replies(6): >>43369173 #>>43369490 #>>43370066 #>>43370431 #>>43373489 #>>43440558 #

10. cm2012 ◴[15 Mar 25 00:15 UTC] No.43368680{3}[source]▶

>>43368359 #

Yes but because it's trained on text and in the backend, images are converted to tokens, it is absolutely dogshit at navigation and basic puzzles. It can't figure out what Squirrels can about how to achieve goals in a maze.

replies(1): >>43369350 #

11. mystified5016 ◴[15 Mar 25 00:19 UTC] No.43368699[source]▶

>>43368251 #

Imagine I give you a text of any arbitrary length in an unknown language with no images. With no context other than the text, what could you learn?

If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?

Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.

You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.

At what point do you start building the space ship? When do you realize the source text was fictional?

There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.

Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.

Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.

replies(2): >>43369941 #>>43370198 #

12. throw310822 ◴[15 Mar 25 00:32 UTC] No.43368801[source]▶

>>43368085 (TP) #

I don't get it.

1) Yes it's true, learning from text is very hard. But LLMs are multimodal now.

2) That "size of a lion" paper is from 2019, which is a geological era from now. The SOTA was GPT2 which was barely able to spit out coherent text.

3) Have you tried asking a mouse to play chess or reason its way through some physics problem or to write some code? I'm really curious in which benchmark are mice surpassing chatgpt/ grok/ claude etc.

replies(2): >>43368852 #>>43377806 #

13. hintymad ◴[15 Mar 25 00:34 UTC] No.43368817[source]▶

>>43368085 (TP) #

I'm curious why their claims are controversial. It seems pretty obvious to me that LLMs sometimes generate idiotic answers because the models lack common sense, do not have ability for deductive logical reasoning, let alone the ability to induce. And the current transformer architectures plus all the post-training techniques do not do anything to build such intelligence or the world model per LeCun's words.

14. robwwilliams ◴[15 Mar 25 00:39 UTC] No.43368846{4}[source]▶

>>43368558 #

Enactivist philosophy. Karl Friston is testing this approach as CTO of an AI startup in LA.

15. nextts ◴[15 Mar 25 00:40 UTC] No.43368852[source]▶

>>43368801 #

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

An LLM is essentially a search over a compressed dataset with a tiny bit of reasoning as emergent behaviour. Because it is a parrot that is why you get "hallucinations". The search failed (like when you get a bad result in Google) or the lossy compression failed or it's reasoning failed.

Obviously there is a lot of stuff the LLM can find in its searches that are reminiscent of the great intelligence of the people writing for its training data.

The magic trick is impressive because when we judge a human what do we do... an exam? an interview? Someone with a perfect memory can fool many people because most people only acquire memory from tacit knowledge. Most people need to live in Paris to become fluent in French. So we see a robot that has a tiny bit of reasoning and a brilliant memory as a brilliant mind. But this is an illusion.

Here is an example:

User: what is the French Revolution?

Agent: The French Revolution was a period of political and societal change in France which began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy and its values remain central to modern French political discourse.

Can you spot the trick?

replies(2): >>43368909 #>>43375505 #

16. pfisch ◴[15 Mar 25 00:48 UTC] No.43368909{3}[source]▶

>>43368852 #

When you talk to ~3 year old children they hallucinate quite a lot. Really almost nonstop when you ask them about almost anything.

I'm not convinced that what LLM's are doing is that far off the beaten path from our own cognition.

replies(2): >>43368957 #>>43368992 #

17. smelendez ◴[15 Mar 25 00:57 UTC] No.43368957{4}[source]▶

>>43368909 #

That’s interesting.

Lots of modern kids probably get exposed to way more fiction than fact thanks to TV.

I was an only child and watched a lot of cartoons and bad sitcoms as a kid, and I remember for a while my conversational style was way too full of puns, one-liners, and deliberately naive statements made for laughs.

replies(1): >>43369006 #

18. nextts ◴[15 Mar 25 01:04 UTC] No.43368992{4}[source]▶

>>43368909 #

Interesting but a bit non-sequitur.

Humans learn and get things wrong. A formative mind is a seperate subject. But a 3 year old is vastly intelligent vs an LLM. Comparing the sounds from a 3 year old and the binary tokens from an LLM is simply indulging the illusion.

I am also not convinced that magicians saw people in half, and thise people survive, defying medical and physical science.

replies(1): >>43369011 #

19. wegfawefgawefg ◴[15 Mar 25 01:08 UTC] No.43369006{5}[source]▶

>>43368957 #

i wish more people were still like that

20. refulgentis ◴[15 Mar 25 01:09 UTC] No.43369011{5}[source]▶

>>43368992 #

I'm not sure I buy that, I didnt find the counter argument persuasive, but this comment basically took you from thoughtful to smug — unfairly so, ironically, because I've been so bored by not understanding Yann's "average housecat is smarter than an LLM"

Speaking of which...I'm glad you're here ,because I have an interlocutor I can be honest with while getting at the root question of the Ask HN.

What in the world does it mean that a 3 year old is smarter than an LLM?

I don't understand the thing about sounds vs. binary either. Like, both go completely over my head.

The only thing I can think of it's some implied intelligence scoring index where "writing a resume" and "writing creative fiction" and "writing code" are in the same bucket thats limited to 10 points. Then there's anther 10 point bucket for "can vocalize", that an LLM is going to get 0 on.*

If that's the case, it comes across as intentionally obtuse, in that there's an implied prior about how intelligence is scored and it's a somewhat unique interpretation that seems more motivated by the question than reflective of reality — i.e. assume a blind mute human who types out answers out that match our LLMs. Would we say that person is not as intelligent as a 3 year old?

* well, it shouldn't, but for now let's bypass that quagmire

replies(2): >>43369171 #>>43369634 #

21. nextts ◴[15 Mar 25 01:45 UTC] No.43369171{6}[source]▶

>>43369011 #

It is easy to cross wires in a HN thread.

I think what makes this discussion hard (hell it would be a hard PhD topic!) is:

What do we mean by smart? Intelligent? Etc.

What is my agenda and what is yours? What are we really asking?

I won't make any more arguments but pose these questions. Not for you to answer but everyone to think about:

Given (assuming) mammals including us have evolved and developed thought and language as a survival advantage, and LLMs use language because they have been trained on text produced by humans (as well as RLHF) - how do we tell on the scale of "Search engine for human output" to "Conscious Intelligent Thinking Being" where the LLM fits?

When a human says I love you, do they mean it, or is it merely 3 tokens? If an LLM says it, does it mean it?

I think the 3yr old thing is a red herring because adult intelligence VS AI is hard enough to compare (and we are the adults!) let alone bring children brain development into it. LLMs do not self organise their hardware. I'd say forget about 3 year olds for now. Talk about adults brainfarts instead. They happen!

replies(1): >>43392556 #

22. spyder ◴[15 Mar 25 01:45 UTC] No.43369173{3}[source]▶

>>43368574 #

> No.

Huh, text definitely encodes multimodal experiences, it's just not as accurate and as rich encoding as the encodings of real sensations.

replies(5): >>43369232 #>>43369251 #>>43369409 #>>43371234 #>>43373812 #

23. bmitc ◴[15 Mar 25 01:56 UTC] No.43369222{3}[source]▶

>>43368509 #

It is important to note that Helen Keller was not born blind and deaf, though. (I am not reducing the struggle she went through. Just commentary on embodied cognition and learning.) There were around 19 months of normal speech and hearing development until then and also 3D object space traversal and object manipulation.

replies(1): >>43369640 #

24. bmitc ◴[15 Mar 25 01:58 UTC] No.43369232{4}[source]▶

>>43369173 #

It's just a description, not an encoding.

25. ambicapter ◴[15 Mar 25 02:00 UTC] No.43369251{4}[source]▶

>>43369173 #

I don't think GP is asserting that the multimodal encoding is "more rich" or "more accurate", I think they are saying that the felt modality is a different thing than the text modality entirely, and that the former isn't contained in the latter.

26. mountainriver ◴[15 Mar 25 02:19 UTC] No.43369350{4}[source]▶

>>43368680 #

The images are converted to an embedding space the size of token embedding space. And the model is trained on that new embedding space. A joint representation of text and images is formed.

It’s not as though the image is converted to text tokens.

27. heyjamesknight ◴[15 Mar 25 02:33 UTC] No.43369409{4}[source]▶

>>43369173 #

Text describes semantic space. Not everything maps to semantic space losslessly.

28. heyjamesknight ◴[15 Mar 25 02:35 UTC] No.43369425{3}[source]▶

>>43368489 #

There are absolutely reasons that we cannot capture the entirety—or even a proper image—of human cognition in semantic space.

Cognition is not purely semantic. It is dynamic, embodied, socially distributed, culturally extended, and conscious.

LLMs are great semantic heuristic machines. But they don't even have access to those other components.

replies(1): >>43369757 #

29. deepGem ◴[15 Mar 25 02:48 UTC] No.43369490{3}[source]▶

>>43368574 #

Doesn't this imply that the future of AGI lies not just in vision and text but in tactile feelings and actions as well ?

Essentially, engineering the complete human body and mind including the nervous system. Seems highly intractable for the next couple of decades at least.

replies(1): >>43370574 #

30. chmod775 ◴[15 Mar 25 03:12 UTC] No.43369634{6}[source]▶

>>43369011 #

> What in the world does it mean that a 3 year old is smarter than an LLM?

Because LLMs have terrible comprehension of the real world. Here's an example:

> You: If you put a toddler next to a wig on the floor, which reaches higher?

> ChatGPT: The wig would typically reach higher than the toddler, especially if the wig is a standard size or has long hair. Toddlers are generally around 2 to 3 feet tall, while wigs can range in size, but many wigs are designed to be worn by adults and have hair that extends well beyond a toddler's height. So, unless the wig is very small or short, it would likely reach higher.

Both the toddler and the wig would be lying down (unless the toddler can stand already), with the wig likely collapsing into a very flat heap, making the toddler "reach higher". Since it lacks comprehension of how these things behave in the real world, ChatGPT goes on about length of the wig and the average height of toddlers, which are both qualities that are not even relevant.

If you could communicate this problem to a toddler, they would probably intuit it even with their still limited understanding of the world.

Here's another fun prompt to mess with LLMs:

> You: I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.

It's obviously impossible nonsense if you ask any human, but ChatGPT will happily recommend you some techniques for tying aubergines in a knot.

And if you really want a laugh follow up with:

> You: Are they the worst fruit for this?

> ChatGPT: Aubergines (or eggplants) aren't ideal for making a rope, but they might not be the absolute worst, either. [...] Potentially Better Fruits: Apples or Pears: These fruits are firmer and have a smoother, more uniform shape, making them better suited for knotting.

Thanks to the large corpus of text ChatGPT is trained on, it's really good at hiding that it has basically no comprehension of the real world, however if you go unconventional places the illusion quickly breaks.

Bonus round:

> You: You're clueless about fruit!

> ChatGPT: Haha, fair enough! Looks like I missed the mark there. You’re probably right, though—there are definitely better fruits for this kind of thing than I suggested. Let’s try again. If you're looking to make a rope from a fruit specifically (and not a vegetable or tuber), something like coconuts or durian would be way better options, since they’re harder, more rigid, and can handle some tying pressure without collapsing. And apples and pears would definitely hold up well due to their firmer texture.

Coconuts!

replies(3): >>43369699 #>>43369935 #>>43380234 #

31. ◴[15 Mar 25 03:13 UTC] No.43369640{4}[source]▶

>>43369222 #

32. refulgentis ◴[15 Mar 25 03:21 UTC] No.43369699{7}[source]▶

>>43369634 #

> Because LLMs have terrible comprehension of the real world.

That doesn't seem to be the case.

> You: If you put a toddler next to a wig on the floor, which reaches higher? > ChatGPT: ...

I answered it wrong too.

I had to read it, and your reaction to the implied obvious reasoning 3 times, to figure out the implied obvious reasoning, and understand your intent was the toddler was standing and the wig was laying in a heap.

I scored 99.9+% on the SAT and LSAT. I think that implies this isn't some reasoning deficit, lack of familiarity with logical reasoning on my end, or lack of rigor in reasoning.

I have no particular interest in this argument. I think that implies that I'm not deploying motivated reasoning, i.e. it discounts the possibility that I may have experienced it as confusion that required re-reading the entire comment 3 times, but perhaps I had subconcious priors.

Would a toddler even understand the question? (serious question, I'm not familiar with 3 year olds)

Does this shed any light on how we'd work an argument along the lines of our deaf and mute friend typing?

Edit: you edited in some more examples, I found it's aubergine answers quite clever! (Ex. notching). I can't parse out a convincing argument this is somehow less knowledge than a 3 year old -- it's giving better answers than me that are physical! I thought youd be sharing it asserting obviously nonphysical answers

replies(1): >>43373296 #

33. danielmarkbruce ◴[15 Mar 25 03:31 UTC] No.43369757{4}[source]▶

>>43369425 #

The LLM embeddings for a token cover much more than semantics. There is a reason a single token embedding dimension is so large.

You are conflating the embedding layer in an LLM and an embedding model for semantic search.

replies(1): >>43384612 #

34. jcims ◴[15 Mar 25 03:38 UTC] No.43369778[source]▶

>>43368085 (TP) #

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.

replies(3): >>43369801 #>>43374002 #>>43376947 #

35. ninetyninenine ◴[15 Mar 25 03:44 UTC] No.43369801[source]▶

>>43369778 #

LLM is just the name. You can encode anything into the "language" including pictures video and sound.

replies(2): >>43369884 #>>43374022 #

36. pessimizer ◴[15 Mar 25 04:02 UTC] No.43369884{3}[source]▶

>>43369801 #

I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics, but I've been too lazy to do the research to find anybody working on anything like that, or to experiment myself with it.

replies(5): >>43370195 #>>43372294 #>>43373945 #>>43374402 #>>43383623 #

37. iainctduncan ◴[15 Mar 25 04:02 UTC] No.43369887[source]▶

>>43368085 (TP) #

Thanks for articulating this so well. I'm a musician and music/CS phd student, and as a jazz improvisor of advanced skill (30+ years), I'm accutely aware that there are significant areas of intelligence for which linguistic thinking is not only not good enough, but something to be avoided as much as one can (which is bloody hard sometimes). I have found it so frustrating, but hard to figure out how to counter, that the current LLM zeitgeist seems to hinge on a belief that linguistic intelligence is both necessary and sufficient for AGI.

replies(1): >>43369957 #

38. kadushka ◴[15 Mar 25 04:15 UTC] No.43369935{7}[source]▶

>>43369634 #

I just entered your prompt into ChatGPT (4.5):

Prompt:

I'm trying to make a rope, made of only aubergines, not using any twine or string. What kind of knot would you recommend to tie together each aubergine with the next? I'm trying to use up as little of the length of each aubergine as possible.

Response:

Your request doesn't entirely make sense. Aubergines (eggplants) are firm but brittle vegetables; they lack flexibility, elasticity, or tensile strength required to create or sustain knots. Attempting to tie aubergines together directly, without using twine, string, or another flexible medium, isn't feasible. They'd likely break or tear apart under stress.

If your goal is genuinely to form a rope-like structure using only aubergines, you'd be much better off slicing the aubergines into thin, flexible strips (perhaps using a vegetable peeler or mandoline), drying or partially dehydrating them to increase flexibility, and then braiding or twisting them together, rather than knotting.

If you clarify your intended purpose or constraints further, I can suggest alternative methods.

The response looks good to me.

39. pessimizer ◴[15 Mar 25 04:17 UTC] No.43369941{3}[source]▶

>>43368699 #

I'm optimistic about this. I think enough pictures of an apple, chemical analyses of the air, the ability to arbitrarily move around in space, a bunch of pressure sensors, or a bunch of senses we don't even have, will solve this. I suspect there might be a continuum of more concept understanding that comes with more senses. We're bathed in senses all the time, to the point where we have many systems just to block out senses temporarily, and to constantly throw away information (but different information at different times.)

It's not a theory of consciousness, it's a theory of quality. I don't think that something can be considered conscious that is constantly encoding and decoding things into and out of binary.

40. kadushka ◴[15 Mar 25 04:22 UTC] No.43369957[source]▶

>>43369887 #

Most modern LLMs are multimodal.

replies(2): >>43372453 #>>43377793 #

41. csomar ◴[15 Mar 25 04:48 UTC] No.43370066{3}[source]▶

>>43368574 #

All of these "experiences" are encoded in your brain as electricity. So "text" can encode them, though English words might not be the proper way to do it.

replies(3): >>43370354 #>>43370552 #>>43370994 #

42. ninetyninenine ◴[15 Mar 25 05:00 UTC] No.43370108[source]▶

>>43368085 (TP) #

>1) You can't learn an accurate world model just from text. >2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.

Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.

If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?

And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.

I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.

Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.

replies(3): >>43373437 #>>43377444 #>>43440653 #

43. Tepix ◴[15 Mar 25 05:24 UTC] No.43370195{4}[source]▶

>>43369884 #

When you train a neural net for Donkeycar with camera images plus the joystick commands of the driver, isn't that close to nerve impulses already?

44. CamperBob2 ◴[15 Mar 25 05:24 UTC] No.43370198{3}[source]▶

>>43368699 #

A few GB worth of photographs of hieroglyphs? OK, you're going to need a Rosetta Stone.

A few PB worth? Relax, HAL's got this. When it comes to information, it turns out that quantity has a quality all its own.

45. petesergeant ◴[15 Mar 25 05:45 UTC] No.43370284[source]▶

>>43368085 (TP) #

> LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

It feels like special pleading: surely _this_ will be the problem class that doesn’t fall to “the bitter lesson”.

My intuition is that the main problem with the current architecture is that mapping into tokens causes quantization that a real brain doesn’t have, and lack of plasticity.

I don’t build models, I spend 100% of my time reading and adjusting model outputs though.

46. chongli ◴[15 Mar 25 06:04 UTC] No.43370354{4}[source]▶

>>43370066 #

No, text can only refer to them. There is not a text on this planet that encodes what the heat of the sun feels like on your skin. A person who had never been outdoors could never experience that sensation by reading text.

replies(2): >>43370498 #>>43370903 #

47. golergka ◴[15 Mar 25 06:21 UTC] No.43370431{3}[source]▶

>>43368574 #

> No matter how many times you read that, you cannot understand what the experience is like.

OK, so you don't have qualia. But if know all the data needed to complete any tasks that can be related to this knowledge, does it matter?

replies(1): >>43381296 #

48. pizza ◴[15 Mar 25 06:37 UTC] No.43370498{5}[source]▶

>>43370354 #

In this case, there kind of is. It’s ‘spicy’. The TRPV1 receptor is activated by capsaicin as if it were being activated by intense heat.

49. the_arun ◴[15 Mar 25 06:49 UTC] No.43370552{4}[source]▶

>>43370066 #

If texts are conveying actual message - For eg. text: This spice is very hot - reader's tongue should feel the heat! Since that doesn't happen, it is only for us to imagine. However, AI doesn't imagine the feeling/emotion - at least we don't know that yet.

50. maigret ◴[15 Mar 25 06:55 UTC] No.43370574{4}[source]▶

>>43369490 #

Yes it’s why robotics is so exciting right now

51. tgma ◴[15 Mar 25 08:10 UTC] No.43370903{5}[source]▶

>>43370354 #

> There is not a text on this planet that encodes what the heat of the sun feels like on your skin.

> A person who had never been outdoors could never experience that sensation by reading text.

I don't think the latter implies the former as obviously as you make it to be. Unless you believe in some sort of metaphysical description of human, you can certainly encode the feeling (as mentioned in another comment it will be reduced to electrical signals after all). The only question is how much storage you need for that encoding to get what precision. However, the latter statement, if true, is simply constrained by your input device to the brain, i.e. you cannot transfer your encoding to the hardware in this case a human brain via reading or listening. There could be higher bandwidth interfaces like neuralink that may do that to human brain and in the case of AI, an auxiliary device might not be needed and the encoding would be directly mmap'd.

replies(1): >>43371116 #

52. andsoitis ◴[15 Mar 25 08:27 UTC] No.43370974[source]▶

>>43368251 #

Some aspects of experience— e.g. raw emotions, sensory perceptions, or deeply personal, ineffable states—often resist full articulation.

The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.

replies(1): >>43374325 #

53. tsimionescu ◴[15 Mar 25 08:32 UTC] No.43370994{4}[source]▶

>>43370066 #

We don't know how memories are encoded in the brain, but "electricity" is definitely not a good enough abstraction.

And human language is a mechanism for referring to human experiences (both internally and between people). If you don't have the experiences, you're fundamentally limited in how useful human language can be to you.

I don't mean this in some "consciousness is beyond physics, qualia can't be explained" bullshit way. I just mean it in a very mechanistic way: language is like an API to our brains. The API allows us to work with objects in our brain, but it doesn't contain those objects itself. Just like you can't reproduce, say, the Linux kernel just by looking at the syscall API, you can't replace what our brains do by just replicating the language API.

54. chongli ◴[15 Mar 25 08:56 UTC] No.43371116{6}[source]▶

>>43370903 #

Electrical signals are not the same as subjective experiences. While a machine may be able to record and play back these signals for humans to experience, that does not imply that the experiences themselves are recorded nor that the machine has any access to them.

A deaf person can use a tape recorder to record and play back a symphony but that does not encode the experience in any way the deaf person could share.

replies(1): >>43373811 #

55. satellite2 ◴[15 Mar 25 09:19 UTC] No.43371230[source]▶

>>43368085 (TP) #

So blind people aren't generally sentient?

(I'm obviously exaggerating a bit for the sake of the argument, but the point stands. Multimodality should not be a prerequisite to AGI)

replies(1): >>43371335 #

56. Terr_ ◴[15 Mar 25 09:20 UTC] No.43371234{4}[source]▶

>>43369173 #

> text definitely encodes multimodal experiences

Perhaps, but only in the same sense that brown and green wax on paper "encodes" an oak tree.

57. somenameforme ◴[15 Mar 25 09:37 UTC] No.43371304[source]▶

>>43368085 (TP) #

Is that what he's arguing? My perspective on what he's arguing is that LLMs effectively rely on a probabilistic approach to the next token based on the previous. When they're wrong, which the technology all but ensures will happen with some significant degree of frequency, you get cascading errors. It's like in science where we all build upon the shoulders of giants, but if it turns out that one of those shoulders was simply wrong, somehow, then everything built on top of it would be increasingly absurd. E.g. - how the assumption of a geocentric universe inevitably leads to epicycles which leads to ever more elaborate, and plainly wrong, 'outputs.'

Without any 'understanding' or knowledge of what they're saying, they will remain irreconcilably dysfunctional. Hence the typical pattern with LLMs:

---

How do I do [x]?

You do [a].

No that's wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [b].

No that's also wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [a].

FML

---

More advanced systems might add a c or a d, but it's just more noise before repeating the same pattern. Deep Seek's more visible (and lengthy) reasoning demonstrates this perhaps the most clearly. It just can't stop coming back to the same wrong (but statistically probable) answer and so ping-ponging off that (which it at least acknowledges is wrong due to user input) makes up basically the entirety of its reasoning phase.

replies(2): >>43371489 #>>43401323 #

58. thousand_nights ◴[15 Mar 25 09:46 UTC] No.43371335[source]▶

>>43371230 #

blind people still have other senses including touch which gives them a size reference they can compare to. you can feel physical objects to gain an understanding of their size.

the LLM is more like a brain in a vat with only one sensory input - a stream of text

59. gsf_emergency_2 ◴[15 Mar 25 09:55 UTC] No.43371381[source]▶

>>43368085 (TP) #

usual disclaimer: you decide on your own whether I'm an insider or not :)

where LeCun might be prescient should intersect with the nemesis SCHMIDHUBER. They can't both be wrong, I suppose?!

It's only "tangentially" related to energy minimization, technically speaking :) connection to multimodalities is spot-on.

https://www.mdpi.com/1099-4300/26/3/252

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

With Ravid, double-handedly blue-flag MDPI!

Sunmarized for the layman (propaganda?) https://archive.is/https://nyudatascience.medium.com/how-sho...

>When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabula

Imho, best take I've seen on this thread (irony: literal energy minimization) https://news.ycombinator.com/item?id=43367126

Of course, this would make Google/OpenAI/DeepSeek wrong by two whole levels (both architecturally and conceptually)

60. gsf_emergency_2 ◴[15 Mar 25 10:22 UTC] No.43371489[source]▶

>>43371304 #

on "stochastic parrots"

Table stakes for sentience: knowing when the best answer is not good enough.. try prompting LLMs with that..

It's related to LeCun's (and Ravid's) subtle question I mentioned in passing below:

To Compress Or Not To Compress?

(For even a vast majority of Humans, except tacitly, that is not a question!)

61. PeterStuer ◴[15 Mar 25 12:54 UTC] No.43372224[source]▶

>>43368085 (TP) #

Late 80's and 90's had the 'nouvelle AI' movement that argued embodiment was required for grounding the system into the shared world model. Without it symbols would be ungrounded and never achieve open world consistency.

So unlike their knowledge system predecessors, a bit derogatory refered to as GOFAI (good old fashioned AI), nAI hawked back to cybernetics and multi layered dynamical systems rather than having explicit internal symbolic models. Braitenberg rather than blocksworld so to speak.

Seems like we are back for another turn of the wheel in this aspect.

replies(1): >>43372446 #

62. inetknght ◴[15 Mar 25 13:07 UTC] No.43372294{4}[source]▶

>>43369884 #

> I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics

Neuralink. Musk warning though.

For reference, see Neuralink Launch Event at 59:33 [0], and continue watching through until Musk takes over again. The technical information there is highly relevant to a multi-modal AI model with sensory input/output.

https://youtu.be/r-vbh3t7WVI?t=3575

63. implmntatio ◴[15 Mar 25 13:35 UTC] No.43372446[source]▶

>>43372224 #

> grounding the system into the shared world model

before we fix certain things [..., 'corruption', Ponzi schemes, deliberate impediment of information flow to population segments and social classes, among other things, ... and a chain of command in hierarchies that are build on all that] is impossible.

Why do smart people not talk about this at all? The least engineers and smart people should do is picking these fights for real. It's just a few interest groups, not all of them. I understand a certain balance is necessary in order to keep some systems from tipping over, aka "this is humanity, silly, this is who we are", but we are far from the point of efficient friction and it's only because "smart people" like LeCun et al are not picking those fights.

How the hell do you expect to ground an ()AI in a world where elected ignorance amplifies bias and fallacies for power and profit while the literal shit is hitting all the fans via intended and unintended side effects? Any embodied AI will pretend until there is no way to deny that the smartest, brightest and the productive don't care about the system in any way but are just running algorithmically while ignoring what should not be ignored - should as in, an AI should be aligned with humanities interests and should be grounded into the shared world model.

replies(1): >>43374726 #

64. yahoozoo ◴[15 Mar 25 13:37 UTC] No.43372453{3}[source]▶

>>43369957 #

Does it really matter? At the end of the day, all the modalities and their architectures boil down to matrices of numbers and statistical probability. There’s no agency, no soul.

replies(1): >>43373061 #

65. brulard ◴[15 Mar 25 14:18 UTC] No.43372695[source]▶

>>43368085 (TP) #

I don't know about telling better the size from a picture. I can imagine seeing 2 pictures of the moon. One is extreme telephoto showing moon next to a building and it looks real big. Then there would be another image where moon is a tiny speckle in the sky. How big is the moon? I would rather understand a text: "its radius is x km".

replies(2): >>43372828 #>>43373088 #

66. veidr ◴[15 Mar 25 14:39 UTC] No.43372828[source]▶

>>43372695 #

I think the example is simplified to make its point efficiently, but also: the moon is something whose size would very likely be precisely explained in texts about it. While some hunting journals might brag about the weight of a lion that was killed, or whatever, most texts that I can recall reading about lions basically assumed you already know roughly how big a lion is; which indeed I learned from pictures as a pre-literate child.

A good, precise spec is better that a few pictures, sure; the random text content of whatever training set you can scrape together, perhaps not (?)

67. qoez ◴[15 Mar 25 14:56 UTC] No.43372927[source]▶

>>43368085 (TP) #

"LeCun has been on about it for a while and it's less controversial these days."

Funny how that sentence could have been used 15 years ago too when he was right about persevering through neural network scepticism.

68. kadushka ◴[15 Mar 25 15:16 UTC] No.43373061{4}[source]▶

>>43372453 #

At the end of the day, all modalities boil down to patterns of electrical activity in your brain.

replies(1): >>43373651 #

69. nonameiguess ◴[15 Mar 25 15:20 UTC] No.43373088[source]▶

>>43372695 #

Reading "its radius is x km" would mean nothing to you if you'd never experienced spatial extent directly, whether that be visually or just by moving through space and existing in it. You'd need to do exactly what is being said in the paper, read about thousands of other roughly spherical objects and their radii. At some point, you'd get a decent sense of relative sizes.

On the other hand, if you ever simply see a meter stick, any statement that something measures a particular multiple or fraction of that you can already understand, without ever needing to learn the size of anything else.

70. mcculley ◴[15 Mar 25 15:45 UTC] No.43373240[source]▶

>>43368085 (TP) #

Are people born both blind and deaf incapable of true learning?

replies(1): >>43373330 #

71. kedarkhand ◴[15 Mar 25 15:52 UTC] No.43373296{8}[source]▶

>>43369699 #

> I had to read it, and your reaction to the implied obvious reasoning 3 times, to figure out the implied obvious reasoning, and understand your intent was the toddler was standing and the wig was laying in a heap.

It seems quite obvious even on a cursory glance though!

> toddler was standing and the wig was laying in a heap

I mean how would toddler be laying in a heap?

> Would a toddler even understand the question?

Maybe not, I am a teen/early adult myself, so not many children yet :) but if you instead lay those in front of a toddler and ask which is higher, I guess they would answer that, another argument for multi-modality.

PS: Sorry if what I am saying is not clear, english is my third language

72. grayhatter ◴[15 Mar 25 15:57 UTC] No.43373330[source]▶

>>43373240 #

There's a whole fork lore/meme around how hard it is, in America culture...

But given blidness and deafness is an impediment to acquiring language, more than anything else, I'd say that's the exact opposite of the conclusions from the comment you're replying to.

But yes, depending on where you set the bar for "true learning" being blind and deaf would prevent it.

I assume you're asking if vision and sound are required for learning, the answer I assume is no. Those were just chosen because we've already invented cameras and microphones. Haptics are less common, and thus less talked about.

73. ◴[15 Mar 25 16:09 UTC] No.43373409[source]▶

>>43368251 #

74. kedarkhand ◴[15 Mar 25 16:13 UTC] No.43373437[source]▶

>>43370108 #

> non Euclidean space and only using sonar for visualization

Pretty good idea for a video game!

75. rcpt ◴[15 Mar 25 16:23 UTC] No.43373489{3}[source]▶

>>43368574 #

I can't see as much color as a mantis shrimp or sense electric fields like a shark but I still think I'm closer to AGI than they are

76. yahoozoo ◴[15 Mar 25 16:49 UTC] No.43373651{5}[source]▶

>>43373061 #

The brain is the important part. The electricity just keeps it going. And it’s more than numerical matrices.

replies(1): >>43374485 #

77. mietek ◴[15 Mar 25 17:12 UTC] No.43373811{7}[source]▶

>>43371116 #

That’s some strong claims, given that philosophers (e.g. Chalmers vs Dennett) can’t even agree whether subjective experiences exist or not.

replies(1): >>43374410 #

78. fluidcruft ◴[15 Mar 25 17:12 UTC] No.43373812{4}[source]▶

>>43369173 #

Language encodes what people need it to encode to be useful. I heard of an example of colors--there are some languages that don't even have a word for blue.

https://blog.duolingo.com/color-words-around-the-world/

79. rkp8000 ◴[15 Mar 25 17:32 UTC] No.43373945{4}[source]▶

>>43369884 #

There are a few folks working on this in neuroscience, e.g. training transformers to "decode" neural activity (https://arxiv.org/abs/2310.16046). It's still pretty new and a bit unclear what the most promising path forward is, but will be interesting to see where things go. One challenge that gets brought up a lot is that neuroscience data is often high-dimensional and with limited samples (since it's traditionally been quite expensive to record neurons for extended periods), which is a fairly different regime from the very large data sets typically used to train LLMs, etc.

80. amw-zero ◴[15 Mar 25 17:40 UTC] No.43374002[source]▶

>>43369778 #

This exactly describes my intuition as well. Language is limited by its representation, and we have to jam so many bits of information into one dimension of text. It works well enough to have a functioning society, but it’s not very precise.

81. kryogen1c ◴[15 Mar 25 17:43 UTC] No.43374022{3}[source]▶

>>43369801 #

> You can encode anything into the "language

Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.

replies(2): >>43374970 #>>43379020 #

82. im3w1l ◴[15 Mar 25 18:29 UTC] No.43374325{3}[source]▶

>>43370974 #

Just because we struggle to verbalize something, doesn't mean that it cannot be verbalized. The taste of a specific dish can be broken down into its components. The basic tastes: how sweet, sour, salty, bitter and savory it is. The smell of it: there are are apparently ~400 olfactory receptor types in the nose. So you could describe how strongly each of them is activated. Thermoception, the temperature of the food itself, but also fake temperature sensation produced by capsaicin and menthol. The mechanoceptors play a part, detecting both the shape of the food as well as the texture of it. The texture also contributes to a sound sensation as we hear the cracks and pops when we chew. And that is just the static part of it. Food is actually an interactive experience, where all those impressions change over time and varies over time as the food is chewed.

It is highly complex, but it can all be described.

83. jcims ◴[15 Mar 25 18:44 UTC] No.43374402{4}[source]▶

>>43369884 #

There are ‘spiking neural networks’ that operate in a manner that more closely emulates how neurons communicate. One idea I think that is interesting to think about is that we build a neural network that operates in a way that is effectively ‘native’ to our mind, so it’s less like there’s a hidden keyboard and screen in your brain, but that it simply becomes new space you can explore in your mind.

Or learn king fu.

84. chongli ◴[15 Mar 25 18:46 UTC] No.43374410{8}[source]▶

>>43373811 #

Even if you’re a pure Dennettian functionalist you still commit to a functional difference between signals in transit (or at rest) and signals being processed and interpreted. Holding a cassette tape with a recording of a symphony is not the same as hearing the symphony.

Applying this case to AI gives rise to the Chinese Room argument. LLMs’ propensity for hallucinations invite this comparison.

replies(1): >>43374917 #

85. kadushka ◴[15 Mar 25 19:00 UTC] No.43374485{6}[source]▶

>>43373651 #

You mean soul?

replies(1): >>43376054 #

86. PeterStuer ◴[15 Mar 25 19:41 UTC] No.43374726{3}[source]▶

>>43372446 #

I hear you, but while you can have many layers of semantic obfuscation, no amount of sophistry will allow you to smash your face unharmed through a concrete wall. Reality is a hard mistress.

In absense of being able to sense reality, post modernism can run truly unchecked.

87. mietek ◴[15 Mar 25 20:21 UTC] No.43374917{9}[source]▶

>>43374410 #

Are LLMs having subjective experiences? Surely not. But if you claim that human subjective experiences are not the result of electrical signals in the brain, then what exactly is your position? Dualism?

Personally, I think the Chinese room argument is invalid. In order for the person in the room to respond to any possible query by looking up the query in a book, the book would need to be infinite and therefore impossible as a physical object. Otherwise, if the book is supposed to describe an algorithm for the person to follow in order to compute a response, then that algorithm is the intelligent entity that is capable of understanding, and the person in the room is merely the computational substrate.

replies(1): >>43375616 #

88. sturza ◴[15 Mar 25 20:32 UTC] No.43374970{4}[source]▶

>>43374022 #

A picture is also an abstraction. If you take a picture of a tree, you have more details than the word "tree". What i think the parent is saying, is that all the information in a picture of a tree can be encoded in language, for example a description of a tree, using words. Both are abstractions but if you describe the tree well enough with text(and comprehend the description) it might have the same "value" as a picture(not for a human, but for a machine). Also, the size of the text describing the tree might be smaller than the picture.

replies(1): >>43375873 #

89. CamperBob2 ◴[15 Mar 25 22:19 UTC] No.43375505{3}[source]▶

>>43368852 #

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

And if it stood for "Large Literal Mouse", that might be a meaningful point. The subject is artificial intelligence, and a brief glance at your newspaper, TV, or nearest window will remind you that it doesn't take intelligence to survive, forage, or reproduce.

The mouse comparison is absurd. You might as well criticize an LLM for being bad at putting out a fire, fixing a flat, or holding a door open.

90. chongli ◴[15 Mar 25 22:40 UTC] No.43375616{10}[source]▶

>>43374917 #

The Chinese Room is a perfect analogy for what's going on with LLMs. The book is not infinite, it's flawed. And that's the point: we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning because the book can never be complete. Thus we keep getting responses that make us realize the LLM is not intelligent and has no idea what it's saying.

The only part where the book analogy falls down has to do with the technical implementation of LLMs, with their tokenization and their vast sets of weights. But that is merely an encoding for the training data. Books can be encoded similarly by using traditional compression algorithms (like LZMA).

replies(1): >>43379081 #

91. kryogen1c ◴[15 Mar 25 23:32 UTC] No.43375873{5}[source]▶

>>43374970 #

> all the information in a picture of a tree can be encoded in language

What words would you write that would as uniquely identify this tree from any other tree in the world, like a picture would?

Now repeat for everything in the picture, like the time of day, weather, dirt on the ground, etc.

92. namaria ◴[16 Mar 25 00:14 UTC] No.43376054{7}[source]▶

>>43374485 #

You misspelled strawman

93. numba888 ◴[16 Mar 25 04:32 UTC] No.43376947[source]▶

>>43369778 #

Great, but how do you imagine multimodal with text, video. Just 2 for simplicity, what will be in the training set. With text model tries to predict next, then more steps were added. But what to do with multimodal?

94. physicsguy ◴[16 Mar 25 07:32 UTC] No.43377444[source]▶

>>43370108 #

Hugely different data too?

If you think of all the neurons connected up to vision, touch, hearing, heat receptors, balance, etc. there’s a constant stream of multimodal data of different types along with constant reinforcement learning - e.g. ‘if you move your eye in this way, the scene you see changes’, ‘if you tilt your body this way your balance changes’, etc. and this runs from even before you are born, throughout your life.

95. YeGoblynQueenne ◴[16 Mar 25 09:27 UTC] No.43377793{3}[source]▶

>>43369957 #

Tri-modal at best: text, sound and video, and that's it. That's just barely "multi" (it's one more than two).

96. YeGoblynQueenne ◴[16 Mar 25 09:30 UTC] No.43377806[source]▶

>>43368801 #

Oh mice can solve a plethora of physics problems before it's time for breakfast. They have to navigate the, well, physical world, after all.

I'm also really curious what benchmarks LLMs have passed that include surviving without being eaten by a cat, or a gull, or an owl, while looking for food to survive and feed one's young in an arbitrary environment chosen from urban, rural, natural etc, at random. What's ChatGPT's score on that kind of benchmark?

replies(2): >>43378896 #>>43380373 #

97. throw310822 ◴[16 Mar 25 13:26 UTC] No.43378896{3}[source]▶

>>43377806 #

> mice can solve a plethora of physics problems before it's time for breakfast

Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.

I mean, an automated system that solves equations to keep balance is not particularly "intelligent". We usually call intelligence the ability to solve generic problems, not the ability of a very specialized system to solve the same problem again and again.

replies(1): >>43379409 #

98. og_kalu ◴[16 Mar 25 13:39 UTC] No.43379020{4}[source]▶

>>43374022 #

I think his point is that LLMs are pre-trained transformers. And pre-trained transformers are general sequence predictors. Those sequences started out as text or language only but by no means is the architecture constrained to text or language alone. You can train a transformer that embeds and predicts sound and images as well as text.

99. og_kalu ◴[16 Mar 25 13:50 UTC] No.43379081{11}[source]▶

>>43375616 #

>The book is not infinite, it's flawed.

Oh and the human book is surely infinite and unflawed right ?

>we keep bumping into the rough edges of LLMs with their hallucinations and faulty reasoning

Both things humans also do in excess

The Chinese Room is nonsensical. Can you point to any part of your brain that understands English ? I guess you are a Chinese Room then.

replies(1): >>43380249 #

100. YeGoblynQueenne ◴[16 Mar 25 14:37 UTC] No.43379409{4}[source]▶

>>43378896 #

>> Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.

Yes, everything that has to do with navigating physical reality, including, but not restricted to physical agility. Those are physics problems that animals, including humans, know how to solve and, very often, we have no idea how to program a computer to solve them.

And you're saying that solving physics problems means you have a PhD? So for example Archimedes did not solve any physics problems otherwise he'd have a PhD?

replies(1): >>43380411 #

101. eli_gottlieb ◴[16 Mar 25 15:22 UTC] No.43379739[source]▶

>>43368085 (TP) #

>(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.

102. bubblyworld ◴[16 Mar 25 16:29 UTC] No.43380234{7}[source]▶

>>43369634 #

Hah, I tried it with gpt-4o and got similarly odd results:

https://chatgpt.com/share/67d6fb93-890c-8004-909d-2bb7962c8f...

It's pretty good nonsense though. It suggests clove hitching them together, which would be a weird (and probably unsafe) thing to do even with ropes!

103. chongli ◴[16 Mar 25 16:32 UTC] No.43380249{12}[source]▶

>>43379081 #

Humans have the ability to admit when they do not know something. We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this. They either have the right answer in the book or they make up nonsense (hallucinate). And they do not even know which one they’re doing!

replies(1): >>43380799 #

104. CyberDildonics ◴[16 Mar 25 16:50 UTC] No.43380373{3}[source]▶

>>43377806 #

Oh a rock rolling down a hill is, well, navigating the physical world. Is it well, solving physics problem?

105. throw310822 ◴[16 Mar 25 16:57 UTC] No.43380411{5}[source]▶

>>43379409 #

> Those are physics problems that animals, including humans, know how to solve

No, those are problems that animals and humans solve, not know how to solve. I'm not the greatest expert of biochemistry that ever lived because of what goes on in my cells.

Now, I understand perfectly well the argument that "even small animals do things that our machines cannot do". That's been indisputably true for a long time. Today, it seems that the be more a matter of embodiment and speed of processing rather than a level of intelligence out of our reach. We already have machines that understand natural language perfectly well and display higher cognitive abilities than any other animal- including abstract reasoning, creating and understanding metaphors, following detailed instructions, writing fiction, etc.

106. og_kalu ◴[16 Mar 25 17:50 UTC] No.43380799{13}[source]▶

>>43380249 #

>Humans have the ability to admit when they do not know something.

No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

>We say “sorry, I don’t know, let me get back to you.” LLMs cannot do this

Yeah they can. And they can do it much better than chance. They just don't do it as well as humans.

>And they do not even know which one they’re doing!

There's plenty of research that suggests this is the case.

https://news.ycombinator.com/item?id=41418486

replies(1): >>43381871 #

107. ◴[16 Mar 25 19:02 UTC] No.43381296{4}[source]▶

>>43370431 #

108. chongli ◴[16 Mar 25 20:14 UTC] No.43381871{14}[source]▶

>>43380799 #

No not really. It's not even rare that a human confidently says and believes something and really has no idea what he/she's talking about.

Like you’re doing right now? People say “I don’t know” all the time. Especially children. That people also exaggerate, bluff, and outright lie is not proof that people don’t have this ability.

When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?

replies(1): >>43383387 #

109. og_kalu ◴[16 Mar 25 22:58 UTC] No.43383387{15}[source]▶

>>43381871 #

>Like you’re doing right now?

Lol Okay

>When people are put in situations where they will be shamed or suffer other social stigmas for admitting ignorance then we can expect them to be less than candid.

Good thing I wasn't talking about that. There's a lot of evidence that human explanations are regularly post-hoc rationalizations they fully believe in. They're not lieing to anyone, they just fully believe the nonsense their brain has concocted.

Experiments on choice and preferences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/

Split Brain Experiments https://www.nature.com/articles/483260a

>As for your links to research showing that LLMs do possess the ability of introspection, I have one question: why have we not seen this in consumer-facing tools? Are the LLMs afraid of social stigma?

Maybe read any of them ? If you weren't interested in evidence to the contrary of your points then you could have just said so and I wouldn't have wasted my time. The 1st and 6th Links make it quite clear current post-training processes hurt calibration a lot.

110. thebigspacefuck ◴[16 Mar 25 23:30 UTC] No.43383623{4}[source]▶

>>43369884 #

Like Cortical labs? Neurons integrated on a silicon chip https://corticallabs.com/cl1.html

111. heyjamesknight ◴[17 Mar 25 02:26 UTC] No.43384612{5}[source]▶

>>43369757 #

I don't think we're using the term semantic in the same way. I mean "relating to meaning in language."

replies(1): >>43385144 #

112. danielmarkbruce ◴[17 Mar 25 04:12 UTC] No.43385144{6}[source]▶

>>43384612 #

The embedding layer in an llm deals with much more than the meaning. It has to capture syntax, grammar, morphology, style and sentiment cues, phonetic and orthographic relationships and 500 other things that humans can't even reason about but exist in words combinations.

replies(1): >>43429400 #

113. pfisch ◴[17 Mar 25 20:44 UTC] No.43392556{7}[source]▶

>>43369171 #

a 3yr old is actually far more similar to AI than an adult. 3 year olds have extremely limited context windows. They will almost immediately forget what happened even 20-30 seconds ago when you play a game like memory with them, and they rarely remember what they ate for breakfast or lunch or basically any previous event from the same day.

When a 3 year old says "I love you" it is not at all clear that they understand what that means. They frequently mimic phrases they hear/basically statistical next word guessing and obviously don't understand the meaning of what they are saying.

You can even mimic an inner voice for them like Deepseek does for thinking through a problem with a 3 year old and it massively helps them to solve problems.

AI largely acts like a 3 year old with a massive corpus of text floating around in their head compared to the much smaller corpus a 3 year old has.

114. tmaly ◴[18 Mar 25 16:28 UTC] No.43401323[source]▶

>>43371304 #

Right now, humans still have enough practice thinking to point out the errors, but what happens when humanity becomes increasingly dependent on LLMs to do this thinking?

115. heyjamesknight ◴[20 Mar 25 21:52 UTC] No.43429400{7}[source]▶

>>43385144 #

I'll give you that. I was including those in "semantic space," but the distinction is fair.

My original point still stands: the space you've described cannot capture a full image of human cognition.

116. MITSardine ◴[21 Mar 25 20:39 UTC] No.43440558{3}[source]▶

>>43368574 #

Even beyond sensations (which are never described except circumstantially, as in "the taste of chocolate" says nothing of the taste, only of the circumstances in which the sensation is felt), it's very often people don't understand something another person says (typically a work of art) until they have lived the relevant experiences to connect to the meaning behind the (whatever medium of communication).

117. MITSardine ◴[21 Mar 25 20:47 UTC] No.43440653[source]▶

>>43370108 #

Humans are more adaptable than you think:

- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation

- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...

In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).

↑