Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters.
For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops
Compute per token = 2(P + L × W × D)
P: total parameters L: Number of Layers W: context size D: Embedding dimension
For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.