The wall confronting large language models

That article is weird. They seem obsessed with nuclear reactors. Also, they misunderstand how floating point works.

As one learns at high school, the continuous derivative is the limit of the discrete version as the displacement h is sent to zero. If our computers could afford infinite precision, this statement would be equally good in practice as it is in continuum mathematics. But no computer can afford infinite precision, in fact, the standard double-precision IEEE representation of floating numbers offers an accuracy around the 16th digit, meaning that numbers below 10−16 are basically treated as pure noise. This means that upon sending the displacement h below machine precision, the discrete derivatives start to diverge from the continuum value as roundoff errors then dominate the discretization errors.

Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck. A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.