←back to thread

Zamba2-7B

(www.zyphra.com)
282 points dataminer | 4 comments | | HN request time: 0.616s | source
1. iamronaldo ◴[] No.41843139[source]
Not transformer based?
replies(3): >>41843175 #>>41843177 #>>41843268 #
2. oatsandsugar ◴[] No.41843175[source]
On the page it states:

Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.

so sounds like it is transformer based?

3. epistasis ◴[] No.41843177[source]
Tri Gao and Albert Gu say "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"

https://arxiv.org/abs/2405.21060

Mamba-2 is used in Zamab2.

4. lhl ◴[] No.41843268[source]
Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712

Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."

Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...

The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k