The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 2 comments | 03 Sep 25 11:40 UTC | HN request time: 0s | source

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

Anon84 ◴[03 Sep 25 21:31 UTC] No.45120628[source]▶

>>45120049 #

There definitely is, but Marcus is not the only one talking about it. For example, we covered this paper in one of our internal journal clubs a few weeks ago: https://arxiv.org/abs/2410.02724

replies(1): >>45122852 #

godelski ◴[04 Sep 25 02:31 UTC] No.45122852[source]▶

>>45120628 #

I just want to highlight this comment and stress how big of a field ML actually is. I think even much bigger than most people in ML research even know. It's really unfortunate that the hype has grown so much that even in the research community these areas are being overshadowed and even dismissed[0]. It's been interesting watching this evolution and how we're reapproaching symbolic reasoning while avoiding that phrase.

There's lots of people doing theory in ML and a lot of these people are making strides which others stand on (ViT and DDPM are great examples of this). But I never expect these works to get into the public eye as the barrier to entry tends to be much higher[1]. But they certainly should be something more ML researchers are looking at.

That is to say: Marcus is far from alone. He's just loud

[0] I'll never let go how Yi Tay said "fuck theorists" and just spent his time on Twitter calling the KAN paper garbage instead of making any actual critique. There seems to be too many who are happy to let the black box remain a black box because low level research has yet to accumulate to the point it can fully explain an LLM.

[1] You get tons of comments like this (the math being referenced is pretty basic, comparatively. Even if more advanced than what most people are familiar with) https://news.ycombinator.com/item?id=45052227

replies(2): >>45124050 #>>45125796 #

calf ◴[04 Sep 25 06:02 UTC] No.45124050[source]▶

>>45122852 #

I hazard to imagine that LLMs are a special subset of Markov chains, and this subset has interesting properties; it seems a bit reductive to dismiss LLMs as "merely' Markov chains. It's what we can do with this unusual subset (e.g. maybe incorporate in a larger AI system) that is the interesting question.

replies(1): >>45124319 #

measurablefunc ◴[04 Sep 25 06:47 UTC] No.45124319[source]▶

>>45124050 #

You don't have to imagine, there is a logically rigorous argument¹ that establishes the equivalence. There is also nothing unusual about neural networks or Markov chains. You've just been mystified by the marketing around them so you think there is something special about them when they're just another algorithm for approximating different kinds of compressible signals & observations about the real world.

¹https://markov.dk.workers.dev/

replies(1): >>45133873 #

calf ◴[05 Sep 25 00:45 UTC] No.45133873[source]▶

>>45124319 #

I'll have you realize you are replying (quite arrogantly by the way) to someone who wrote part of their PhD dissertation on models of computation. Try again :)

Besides, it is patently false. Not every Markov chain is an LLM, an actual LLM outputs human-readable English, while the vast majority of Markov chains do not map onto that set of models.

replies(1): >>45134747 #

measurablefunc ◴[05 Sep 25 03:29 UTC] No.45134747[source]▶

>>45133873 #

Appeals to authority do not change the logical content of an argument. You are welcome to point to the part of the linked argument that is incorrect & present a counter-example to demonstrate the error.

replies(1): >>45136257 #

godelski ◴[05 Sep 25 08:23 UTC] No.45136257[source]▶

>>45134747 #

Calf isn't making an appeal to authority. They are saying "I'm not the idiot you think I am." Two very different things. Likely also a request to talk more mathy to them.

I read your link btw and I just don't know how someone can do all that work and not establish the Markov Property. That's like the first step. Speaking of which, I'm not sure I even understand the first definition of your link. I've never heard the phrase "computably countable" before, but I have head "computable number," which these numbers are countable. This does seem to be what it is referring to? So I'll assume that? (My dissertation wasn't on models of computation, it was on neural architectures) In 1.2.2 is there a reason for strictly uniform noise? It also seems to run counter to the deterministic setting.

Regardless, I agree with Calf, it's very clear MCs are not equivalent to LLMs. That is trivially a false statement. But the question of if an LLM can be represented via a MC is a different question. I did find this paper on the topic[0], but I need to give it a better read. Does look like it was rejected from ICLR[1], though ML review is very noisy. Including the link as comments are more informative than the accept/reject signal.

(@Calf, sorry, I didn't respond to your comment because I wasn't trying to make a comment about the relationship of LLMs and MCs. Only that there was more fundamental research being overshadowed)

[0] https://arxiv.org/abs/2410.02724

[1] https://openreview.net/forum?id=RDFkGZ9Dkh

replies(1): >>45142605 #

measurablefunc ◴[05 Sep 25 19:26 UTC] No.45142605[source]▶

>>45136257 #

If it's trivially false then you should be able to present a counter-example but so far no one has done that but there has been a lot of hand-waving about "trivialities" of one sort or another.

Neural networks are stateless, the output only depends on the current input so the Markov property is trivially/vacuously true. The reason for the uniform random number for sampling from the CDF¹ is b/c if you have the cumulative distribution function of a probability density then you can sample from the distribution by using a uniformly distributed RNG.

¹https://stackoverflow.com/questions/60559616/how-to-sample-f...

replies(1): >>45147349 #

godelski ◴[06 Sep 25 07:32 UTC] No.45147349[source]▶

>>45142605 #

You want me to show that it is trivially false that all Neural Networks are not Markov Chains? I mean we could point to a RNN which doesn't have the Markov Property. I mean another trivial case is when the rows do not sum to 1. I mean the internal states of neural networks are not required to be probability distributions. In fact, this isn't a requirement anywhere in a neural network. So whatever you want to call the transition matrix you're going to have issues.

Or the inverse of this? That all Markov Chains are Neural Networks? Sure. Well sure, here's my transition matrix [1].

I'm quite positive an LLM would be able to give you more examples.

  > the output only depends on the current input so the Markov property is trivially/vacuously true.

It's pretty clear you did not get your PhD in ML.

  > The reason for the uniform random number

I think you're misunderstanding. Maybe I'm misunderstanding. But I'm failing to understand why you're jumping to the CDF. I also don't understand why this answers my question since there are other ways to sample from a distribution knowing only its CDF and without using the uniform distribution. I mean you can always convert to the uniform distribution and there's lots of tricks to do that. Or I mean the distribution in that SO post is the Rayleigh Distribution so we don't even need to do that. My question was not about that uniform is clean, but that it is a requirement. But this just doesn't seem relevant at all.

replies(1): >>45151801 #

measurablefunc ◴[06 Sep 25 18:39 UTC] No.45151801[source]▶

>>45147349 #

Either find the exact error in the proof or stop running around in circles. The proof is very simple so if there is an error in any of it you should be able to find one very easily but you haven't done that. You have only asked for unrelated clarifications & gone on unrelated tangents.

replies(1): >>45161734 #

1. godelski ◴[07 Sep 25 20:15 UTC] No.45161734[source]▶

>>45151801 #

  > Either find the exact error in the proof

I think I did

  > You have only asked for unrelated clarifications & gone on unrelated tangents.

I see the problem...

replies(1): >>45161979 #

2. measurablefunc ◴[07 Sep 25 20:47 UTC] No.45161979[source]▶

>>45161734 (TP) #

> I see the problem

That's great, so you should be able to spell out the error & why it is an error. Go ahead.

↑