I think I agree with you. My only rebuttal would be it's this kind of thinking that's kept any leading players form trying other architectures in the first place. As far as I know, SOTA for SSM's just doesn't suggest significant enough potential upsides warrant significant R&D. Not compared to the tried and true established LLM methods. The decision might be something like: "Pay X to train a competitive LLM" vs "Pay 2X to MAYBE train a competitive SSM".