We're not innovating, we're just forgetting slower

(Not GP, but:)

LLMs' initial training is specifically for token-prediction.

However, this doesn't mean that what they end up doing is specifically token-prediction (except in the sense that anything that generates textual output can be described as doing token-prediction). Nor does it mean that the only things they can do are tasks most naturally described in terms of token-prediction.

For instance, suppose you successfully train something to predict the next token given input of the form "[lengthy number] x [lengthy number] = ", where "successfully" means that the system ends up able to predict correctly almost all the time even when the numbers are ones it hasn't seen before. How could it do that? Only by, in some sense, "learning to multiply". (I haven't checked but my hazy recollection is that somewhere around GPT-3.5 or GPT-4 LLMs went from not being able to do this at all to being able to do it fairly well on moderate-sized numbers.)

Or suppose you successfully train something to complete things of the form "The SHA256 hash of [lengthy string] is "; again, a system that could do that correctly would have to have, in some sense, "learned to implement SHA256". (I am pretty sure that today's LLMs cannot do this, though of course they might have learned to call out to a tool that can.)

If you successfully train something to complete things of the form "One grammatical English sentence whose SHA256 hash is [value] is " then that thing has to have "learned to break SHA256". (I am very sure that today's LLMs cannot do this and I think it enormously unlikely that any ever will be able to.)

If you successfully train something to complete things of the form "The complete source code for a program written in idiomatic Rust that does [difficult task] is " then that thing has to have "learned to write code in Rust". (Today's LLMs can kinda do some tasks like this, and there are a lot of people yelling at one another about just how much they can do.)

That is: some token-prediction tasks can only be accomplished by doing things that we would not normally think of as being about token prediction. This is essentially the point of the "Turing test".

For the avoidance of doubt, I am making no particular claims (beyond the illustrative ones explicitly made above) about what if anything today's LLMs, or plausible near-future LLMs, or other further-future AI systems, are able to do that goes beyond what we would normally think of as token prediction. The point is that whether or not today's LLMs are "just stochastic parrots" in some useful sense, it doesn't follow from the fact that they are trained on token-prediction that that's all they are.