←back to thread

Zamba2-7B

(www.zyphra.com)
282 points dataminer | 1 comments | | HN request time: 0.301s | source
Show context
iamronaldo ◴[] No.41843139[source]
Not transformer based?
replies(3): >>41843175 #>>41843177 #>>41843268 #
1. lhl ◴[] No.41843268[source]
Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712

Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."

Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...

The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k