I'm a bit bearish on SSMs (and hybrid SSM/transformers) because the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are all transformers. There's just no way none of them tried SSMs.
replies(5):
Additionally, we just don’t have training data at the size and scope that exceeds today’s transformer context lengths. Most training rollouts are fairly information dense. Its not like “look at this camera feed for four hours and tell me what interesting stuff happened”, those are extremely expensive data to generate and train on.