←back to thread

66 points jxmorris12 | 1 comments | | HN request time: 0.274s | source
Show context
Herring ◴[] No.44504065[source]
I'm a bit bearish on SSMs (and hybrid SSM/transformers) because the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are all transformers. There's just no way none of them tried SSMs.
replies(5): >>44504164 #>>44504299 #>>44504738 #>>44505203 #>>44506694 #
1. aabhay ◴[] No.44506694[source]
As Albert mentioned, the benchmarks and data we use today heavily prioritize recall. Transformers are really really good at remembering parts of the context.

Additionally, we just don’t have training data at the size and scope that exceeds today’s transformer context lengths. Most training rollouts are fairly information dense. Its not like “look at this camera feed for four hours and tell me what interesting stuff happened”, those are extremely expensive data to generate and train on.