Zamba2-7B | slacker news

1. iamronaldo ◴[14 Oct 24 23:05 UTC] No.41843139[source]▶

Not transformer based?

replies(3): >>41843175 #>>41843177 #>>41843268 #

2. oatsandsugar ◴[14 Oct 24 23:10 UTC] No.41843175[source]▶

On the page it states:

Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.

so sounds like it is transformer based?

3. epistasis ◴[14 Oct 24 23:10 UTC] No.41843177[source]▶

>>41843139 (TP) #

Tri Gao and Albert Gu say "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"

https://arxiv.org/abs/2405.21060

Mamba-2 is used in Zamab2.

4. lhl ◴[14 Oct 24 23:24 UTC] No.41843268[source]▶

>>41843139 (TP) #

Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712

Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."

Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...

The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k