The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 1 comments | 03 Sep 25 11:40 UTC | HN request time: 0s | source

Show context

Scene_Cast2 ◴[03 Sep 25 17:58 UTC] No.45118686[source]▶

The paper is hard to read. There is no concrete worked-through example, the prose is over the top, and the equations don't really help. I can't make head or tail of this paper.

replies(3): >>45118775 #>>45119154 #>>45120083 #

lumost ◴[03 Sep 25 18:06 UTC] No.45118775[source]▶

>>45118686 #

This appears to be a position paper written by authors outside of their core field. The presentation of "the wall" is only through analogy to derivatives on the discrete values computer's operate in.

replies(2): >>45119119 #>>45119709 #

joe_the_user ◴[03 Sep 25 18:43 UTC] No.45119119[source]▶

>>45118775 #

Paper seems to involve a series of analogies and equations. However, I think if the equations accepted, the "wall" is actually derived.

The authors are computer scientists and people who work with large scale dynamic system. They aren't people who've actually produced an industry-scale LLM. However, I have to note that despite lots of practical progress in deep learning/transformers/etc systems, all the theory involved just analogies and equations of a similar sort, it's all alchemy and so people really good at producing these models seem to be using a bunch of effective rules of thumb and not any full or established models (despite books claiming to offer a mathematical foundation for enterprise, etc).

Which is to say, "outside of core competence" doesn't mean as much as it would for medicine or something.

replies(2): >>45119694 #>>45127357 #

1. lumost ◴[04 Sep 25 13:58 UTC] No.45127357[source]▶

>>45119119 #

I will venture my 2 cents, the equations kinda sorta look like something - but in no way approach a derivation of the wall. Specifically, I would have looked for a derivation which proved for one of/all of

1. Sequence Models relying on a markov chain, with and without summarization to extend beyond fixed length horizons. 2. All forms of attention mechanisms/dense layers. 3. A specific Transformer architecture.

That there exists a limit on the representation or prediction powers of the model for tasks of all input/output token lengths or fixed size N input tokens/M output tokens. *Based On* a derived cost growth schedule for model size, data size, compute budgets.

Separately, I would have expected a clear literature review of existing mathematical studies on LLM capabilities and limitations - for which there are *many*. Including studies that purport that Transformers can represent any program of finite pre-determined execution length.

↑