←back to thread

111 points mirrir | 3 comments | | HN request time: 0.403s | source
Show context
adityashankar ◴[] No.46176854[source]
Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out

that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne

*previously miswrote and said computational efficiency will go down

replies(3): >>46176877 #>>46176899 #>>46177234 #
credit_guy ◴[] No.46176899[source]
Like this?

https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT

replies(4): >>46176995 #>>46177440 #>>46177468 #>>46177471 #
jychang ◴[] No.46177471[source]
Or like this: https://api-docs.deepseek.com/news/news251201

I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

replies(5): >>46178034 #>>46178243 #>>46178315 #>>46179397 #>>46179630 #
1. T-A ◴[] No.46178034[source]
From your link: DeepSeek-V3.2 Release 2025/12/01

From Zebra-Llama's arXiv page: Submitted on 22 May 2025

replies(2): >>46179594 #>>46180680 #
2. jychang ◴[] No.46179594[source]
That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.

> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025

I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.

[1] https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre...

[2] https://youtu.be/Q5baLehv5So?t=6075

3. Palmik ◴[] No.46180680[source]
DeepSeek's MLA paper was published in 2024: https://arxiv.org/abs/2405.04434

DeepSeek's Sparse Attention paper was published in February: https://arxiv.org/abs/2502.11089

DeepSeek 3.2 Exp (combining MLA and DSA) was released in September.

You also had several other Chinese hybrid models, like Qwen3 Next and Minimax M1.