Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
so sounds like it is transformer based?
https://arxiv.org/abs/2405.21060
Mamba-2 is used in Zamab2.
Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."
Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...
The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k