The End of Moore's Law for AI? Gemini Flash Offers a Warning

The article does a great job of highlighting the core disconnect in the LLM API economy: linear pricing for a service with non-linear, quadratic compute costs. The traffic analogy is an excellent framing.

One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.

For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.

Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.

That obviously should and will be fixed architecturally.

>For each new token generated, the model must access the intermediate state of all previous tokens.

Not all the previous tokens are equal, not all deserve the same attention so to speak. The farther the tokens, the more opportunity for many of them to be pruned and/or collapsed with other similarly distant and lesser meaningful tokens in a given context. So instead of O(n^2) it would be more like O(nlog(n))

I mean, you'd expect that for example "knowlegde worker" models (vs. say "poetry" models) would posses some perturbative stability wrt. changes to/pruning of the remote previous tokens, at least to those tokens which are less meaningful in the current context.

Personally, i feel the situation is good - performance engineering work again becomes somewhat valuable as we're reaching N where O(n^2) forces management to throw some money at engineers instead of at the hardware :)