The Tradeoffs of SSMs and Transformers

1. Herring ◴[08 Jul 25 21:04 UTC] No.44504065[source]▶

I'm a bit bearish on SSMs (and hybrid SSM/transformers) because the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are all transformers. There's just no way none of them tried SSMs.

replies(5): >>44504164 #>>44504299 #>>44504738 #>>44505203 #>>44506694 #

2. visarga ◴[08 Jul 25 21:20 UTC] No.44504164[source]▶

>>44504065 (TP) #

Yes, until serious adoption I am reserved too, both on SSMs and diffusion based LLMs.

3. nextos ◴[08 Jul 25 21:41 UTC] No.44504299[source]▶

>>44504065 (TP) #

Second-generation LSTMs (xLSTM) do have leading performance on zero-shot time series forecasting: https://arxiv.org/abs/2505.23719.

I think other architectures, aside from the transformer, might lead to SOTA performance, but they remain a bit unexplored.

4. programjames ◴[08 Jul 25 22:51 UTC] No.44504738[source]▶

>>44504065 (TP) #

I mean, everyone is still using variational autoencoders for their latent flow models instead of the information bottleneck. It's because it's cheaper (in founder time) to raise 10(0)x more money instead of having to design your own algorithms and architectures for a novel idea that might work in theory, but could be a dead end six months down the line. Just look at LiquidAI. Brilliant idea, but it took them ~5 years to do all the research and another to get their first models to market... which don't yet seem to be any better than models with a similar compute requirement. I find it pretty plausible that none of the "big" LLM companies seriously tried SSMs, because they already have plenty enough money to throw at transformers, or took a quick path to get a big valuation.

5. mbowcut2 ◴[09 Jul 25 00:19 UTC] No.44505203[source]▶

>>44504065 (TP) #

I think I agree with you. My only rebuttal would be it's this kind of thinking that's kept any leading players form trying other architectures in the first place. As far as I know, SOTA for SSM's just doesn't suggest significant enough potential upsides warrant significant R&D. Not compared to the tried and true established LLM methods. The decision might be something like: "Pay X to train a competitive LLM" vs "Pay 2X to MAYBE train a competitive SSM".

6. aabhay ◴[09 Jul 25 05:49 UTC] No.44506694[source]▶

>>44504065 (TP) #

As Albert mentioned, the benchmarks and data we use today heavily prioritize recall. Transformers are really really good at remembering parts of the context.

Additionally, we just don’t have training data at the size and scope that exceeds today’s transformer context lengths. Most training rollouts are fairly information dense. Its not like “look at this camera feed for four hours and tell me what interesting stuff happened”, those are extremely expensive data to generate and train on.