The wall confronting large language models

That article is weird. They seem obsessed with nuclear reactors. Also, they misunderstand how floating point works.

As one learns at high school, the continuous derivative is the limit of the discrete version as the displacement h is sent to zero. If our computers could afford infinite precision, this statement would be equally good in practice as it is in continuum mathematics. But no computer can afford infinite precision, in fact, the standard double-precision IEEE representation of floating numbers offers an accuracy around the 16th digit, meaning that numbers below 10−16 are basically treated as pure noise. This means that upon sending the displacement h below machine precision, the discrete derivatives start to diverge from the continuum value as roundoff errors then dominate the discretization errors.

Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck. A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.

  > A big insight in machine learning optimization was that

I think the big insight was how useful this low order method still is. I think many people don't appreciate how new the study of high dimensional mathematics (let alone high dimensional statistics) actually is. I mean metric theory didn't really start till around the early 1900's. The big reason these systems are still mostly black boxes is because we still have a long way to go when it comes to understanding these spaces.

But I think it is worth mentioning that low order approximations can still lock you out of different optima. While I agree the (Latent) Manifold Hypothesis pretty likely applies to many problems, this doesn't change the fact that even at relatively low dimensions (like 10D) are quite complex and have lots of properties that are unintuitive. With topics like language and images, I think it is safe to say that these still require operating in high dimensions. You're still going to have to contend with the complexities of the concentration of measure (an idea from the 70's).

Still, I don't think anyone expected things to have worked out as well as they have. If anything I think it is more surprising we haven't run into issues earlier! I think there are still some pretty grand problems for AI/ML left. Personally this is why I push back against much of the hype. The hype machine is good if the end is in sight. But a hype machine creates a bubble. The gamble is if you call fill the bubble before it pops. But the risk is that if it pops before then, then it all comes crashing down. It's been a very hot summer but I'm worried that the hype will lead to a winter. I'd rather have had a longer summer than a hotter summer and a winter.