←back to thread

111 points mirrir | 1 comments | | HN request time: 0s | source
Show context
adityashankar ◴[] No.46176854[source]
Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out

that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne

*previously miswrote and said computational efficiency will go down

replies(3): >>46176877 #>>46176899 #>>46177234 #
credit_guy ◴[] No.46176899[source]
Like this?

https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT

replies(4): >>46176995 #>>46177440 #>>46177468 #>>46177471 #
jychang ◴[] No.46177471[source]
Or like this: https://api-docs.deepseek.com/news/news251201

I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

replies(5): >>46178034 #>>46178243 #>>46178315 #>>46179397 #>>46179630 #
SilverElfin ◴[] No.46179397[source]
How did you get all this info about how each is trained? Is that something they admit now or is it through leaks?
replies(1): >>46179673 #
jychang ◴[] No.46179673[source]
Deepseek? It's literally in their research papers.

OpenAI? The OpenAI head of research @markchen90 straight up admitted it in a podcast.

https://x.com/petergostev/status/1995744289079656834

"In the last 2 years we've put so much resourcing into, into reasoning and one byproduct of that is you lose a little bit of muscle on pre training and post training." "In the last six months, @merettm and I have done a lot of work to build that muscle back up." "With all the focus on RL, there's an alpha for us because we think there's so much room left in pre training." "As a result of these efforts, we've been training much stronger models. And that also gives us a lot of confidence carrying into Gemini 3 and other releases coming this end of the year."

Note, "alpha" in the quote above is referring to https://en.wikipedia.org/wiki/Alpha_(finance)

But it's pretty clear that the last full pretrain run they've released is for gpt-4o 2 years ago*, and since then they've just been iterating RL for their models. You don't need any insider information to notice that, it's pretty obvious.

*Excluding GPT-4.5 of course, but even OpenAI probably wants us to forget about that.

replies(1): >>46181264 #
1. nl ◴[] No.46181264[source]
Semi-analysis also believes they haven't done a fill pretraining run since 4o (except for GPT-4.5): https://open.substack.com/pub/semianalysis/p/tpuv7-google-ta...