(goombalab.github.io)

66 points jxmorris12 | 1 comments | 08 Jul 25 19:12 UTC | HN request time: 0.274s | source

Show context

Herring ◴[08 Jul 25 21:04 UTC] No.44504065[source]▶

I'm a bit bearish on SSMs (and hybrid SSM/transformers) because the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are all transformers. There's just no way none of them tried SSMs.

replies(5): >>44504164 #>>44504299 #>>44504738 #>>44505203 #>>44506694 #

1. aabhay ◴[09 Jul 25 05:49 UTC] No.44506694[source]▶

>>44504065 #

As Albert mentioned, the benchmarks and data we use today heavily prioritize recall. Transformers are really really good at remembering parts of the context.

Additionally, we just don’t have training data at the size and scope that exceeds today’s transformer context lengths. Most training rollouts are fairly information dense. Its not like “look at this camera feed for four hours and tell me what interesting stuff happened”, those are extremely expensive data to generate and train on.

↑

The Tradeoffs of SSMs and Transformers