Grok 3: Another win for the bitter lesson

1. smy20011 ◴[20 Feb 25 08:06 UTC] No.43112235[source]▶

>>43111963 (OP) #

Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

replies(10): >>43112269 #>>43112330 #>>43112430 #>>43112606 #>>43112625 #>>43112895 #>>43112963 #>>43115065 #>>43116618 #>>43123381 #

2. sigmoid10 ◴[20 Feb 25 08:12 UTC] No.43112269[source]▶

>>43112235 (TP) #

>It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

The article explains how in reality the opposite is true. Especially when you look at it long term. Compute power grows exponentially, humans do not.

replies(4): >>43112294 #>>43112314 #>>43112378 #>>43112615 #

3. smy20011 ◴[20 Feb 25 08:16 UTC] No.43112294[source]▶

>>43112269 #

Human do write code that scalable with compute.

The performance is always raw performance * software efficiency. You can use shitty software and waste all these FLOPs.

4. llm_trw ◴[20 Feb 25 08:20 UTC] No.43112314[source]▶

>>43112269 #

If the bitter lesson were true we'd be getting sota results out of two layer neural networks using tanh as activation functions.

It's a lazy blog post that should be thrown out after a minute of thought by anyone in the field.

replies(1): >>43120854 #

5. PeterStuer ◴[20 Feb 25 08:23 UTC] No.43112330[source]▶

>>43112235 (TP) #

It's not an either/or. Your hiring of talent is only limited by your GPU spend if you can't hire because you ran out of money.

In reality pushing the frontier on datacenters will tend to attract the best talent, not turn them away.

And in talent, it is the quality rather than the quantity that counts.

A 10x breakthrough in algorithm will compound with a 10x scaleout in compute, not hinder it.

I am a big fan of Deepseek, Meta and other open model groups. I also admire what the Grok team is doing, especially their astounding execution velocity.

And it seems like Grok 2 is scheduled to be opened as promised.

replies(2): >>43112355 #>>43112985 #

6. smy20011 ◴[20 Feb 25 08:26 UTC] No.43112355[source]▶

>>43112330 #

Not that simple, It could cause resource curse [1] for developers. Why optimize algorithm when you have nearly infinity resources? For deepseek, their constrains is one of the reason they achieve breakthrough. One of their contribution, fp8 training, is to find a way to train models with GPUs that limit fp32 performance due to export control.

[1]: https://www.investopedia.com/terms/r/resource-curse.asp#:~:t...

7. OtherShrezzing ◴[20 Feb 25 08:28 UTC] No.43112378[source]▶

>>43112269 #

Humans don't grow exponentially indefinitely. But there's only something in the order of 100k AI researchers employed in the big labs right now. Meanwhile, there's around 20mn software engineers globally, and around 200k math graduates per year.

The number of humans who could feasibly work on this problem is pretty high, and the labs could grow an order of magnitude, and still only be tapping into the top 1-2% of engineers & mathematicians. They could grow two orders of magnitude before they've absorbed all of the above-average engineers & mathematicians in the world.

replies(1): >>43112418 #

8. sigmoid10 ◴[20 Feb 25 08:33 UTC] No.43112418{3}[source]▶

>>43112378 #

I'd actually say the market is stretched pretty thin by now. I've been an AI researcher for a decade and what passes as AI researcher or engineer these days is borderline worthless. You can get a lot of people who can use scripts and middleware like frontend lego sets to build things, but I'd say there are less than 1k people in the world right now who can actually meaningfully improve algorithmic design. There are a lot more people out there who do systems design and cloud ops, so only when you choose to go for scaling, you'll find a plentiful set of human brainpower.

replies(1): >>43112879 #

9. dogma1138 ◴[20 Feb 25 08:34 UTC] No.43112430[source]▶

>>43112235 (TP) #

Deepseek didn’t seem to invest in talent as much as it did in smuggling restricted GPUs into China via 3rd countries.

Also not for nothing scaling compute x100 or even x1000 is much easier than scaling talent by x10 or even x2 since you don’t need workers you need discovery.

replies(1): >>43113483 #

10. mike_hearn ◴[20 Feb 25 09:04 UTC] No.43112606[source]▶

>>43112235 (TP) #

We don't actually know how much money DeepSeek spent or how much compute they used. The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

In any AI R&D operation the bulk of the compute goes on doing experiments, not on the final training run for whatever models they choose to make available.

replies(2): >>43113071 #>>43113472 #

11. alecco ◴[20 Feb 25 09:04 UTC] No.43112615[source]▶

>>43112269 #

Algorithmic improvements in new fields are often bigger than hardware improvements.

12. wordofx ◴[20 Feb 25 09:06 UTC] No.43112625[source]▶

>>43112235 (TP) #

Deepseek was a crypto mining operation before they pivoted to AI. They have an insane amount of GPUs laying around. So we have no idea how much compute they have compared to xAI.

replies(2): >>43114096 #>>43116339 #

13. llm_trw ◴[20 Feb 25 09:48 UTC] No.43112879{4}[source]▶

>>43112418 #

Do you know what places people who are interested in research congregate at? Every forum, meet up or journal gets overwhelmed by bullshit with a year of being good.

replies(1): >>43120770 #

14. mirekrusin ◴[20 Feb 25 09:51 UTC] No.43112895[source]▶

>>43112235 (TP) #

Deepseek innovation is applicable to xAI setup - results are simply multiply of their compute scale.

Deepseek didn’t have option A or B available, they only had extreme optimisation option to work with.

It’s weird that people present those two approaches as mutually exclusive ones.

15. stpedgwdgfhgdd ◴[20 Feb 25 10:02 UTC] No.43112963[source]▶

>>43112235 (TP) #

Large amounts of teams are very hard to scale.

There is a reason why startups innovate and large companies follow.

16. krainboltgreene ◴[20 Feb 25 10:05 UTC] No.43112985[source]▶

>>43112330 #

Have fun hiring any talent after three years of advertising to students that all programming/data jobs are going to be obsolete.

17. wallaBBB ◴[20 Feb 25 10:19 UTC] No.43113071[source]▶

>>43112606 #

One thing I (intuitively) don't doubt - that they spent less money for developing R1 than OpenAI spent on marketing, lobbying and management compensation.

replies(1): >>43113097 #

18. pertymcpert ◴[20 Feb 25 10:23 UTC] No.43113097{3}[source]▶

>>43113071 #

What makes you say that? Do you think Chinese top tier talent is cheap?

replies(4): >>43113116 #>>43113224 #>>43113305 #>>43113503 #

19. victorbjorklund ◴[20 Feb 25 10:26 UTC] No.43113116{4}[source]▶

>>43113097 #

I'm sure the salaries at Deepseek in China were lower than the salaries at OpenAI.

replies(1): >>43118334 #

20. amunozo ◴[20 Feb 25 10:45 UTC] No.43113224{4}[source]▶

>>43113097 #

Definitely cheaper than American top tier talent

replies(1): >>43118181 #

21. anonzzzies ◴[20 Feb 25 10:58 UTC] No.43113305{4}[source]▶

>>43113097 #

What is cheap? But compared to the US, yes. Almost everywhere talent is 'cheap' compared to the US unless they move to the US.

replies(1): >>43118195 #

22. tw1984 ◴[20 Feb 25 11:28 UTC] No.43113472[source]▶

>>43112606 #

> The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

did any lab release such figure? will be interesting to see.

23. tw1984 ◴[20 Feb 25 11:29 UTC] No.43113483[source]▶

>>43112430 #

Talent is not something you can just freely pick up from your local Walmart.

24. wallaBBB ◴[20 Feb 25 11:32 UTC] No.43113503{4}[source]▶

>>43113097 #

I did not refer to the talent directly contributing to the technical progress.

P.S. - clarification: I mean not referring to talent at OpenAI. And yes I have very little doubt talent at DeepSeek is a lot cheaper than the things I listed above for OpenAI. I would be interested in a breakdown of the cost of OpenAI and seeing if even their technical talent costs more than the things I mentioned.

replies(1): >>43118200 #

25. miki123211 ◴[20 Feb 25 12:55 UTC] No.43114096[source]▶

>>43112625 #

Crypto GPUs have nothing to do with AI GPUs.

Crypto mining is an embarassingly parallel problem, requiring little to no communication between GPUs. To a first approximation, in crypto, 10x-ing the amount of "cores" per GPU, 10x-ing the number of GPUs per rig and 10X-ing the number of rigs you own is basically equivalent. An infinite amount of extremely slow GPUs would do just as well as one infinitely fast GPU. This is why consumer GPUs are great for crypto.

AI is the opposite. In AI, you need extremely fast communication between GPUs. This means getting as much memory per GPU as possible (to make communication less necessary), and putting all the GPUs all in one datacenter.

Consumer GPUs, which were used for crypto, don't support the fast communication technologies needed for AI training, and they don't come in the 80gb memory versions that AI labs need. This is Nvidia's price differentiation strategy.

replies(1): >>43114483 #

26. miohtama ◴[20 Feb 25 13:39 UTC] No.43114483{3}[source]▶

>>43114096 #

Any relevant crypto has been not mined on GPUs for a long time.

But a point was made to make it less parallel. For example, Ethereum uses DAG, making requirement to have 1 GB RAM, GPU was not enough.

https://ethereum.stackexchange.com/questions/1993/what-actua...

Also any GPUs are now several generations old, so their FLOPS/watt is likely irrelevant.

27. SamPatt ◴[20 Feb 25 14:29 UTC] No.43115065[source]▶

>>43112235 (TP) #

R1 came out when Grok 3's training was still ongoing. They shared their techniques freely, so you would expect the next round of models to incorporate as many of those techniques as possible. The bump you would get from the extra compute occurs in the next cycle.

If Musk really can get 1 million GPUs and they incorporate some algorithmic improvements, it'll be exciting to see what comes out.

28. oskarkk ◴[20 Feb 25 16:04 UTC] No.43116339[source]▶

>>43112625 #

Do you have any sources for that? When I searched "DeepSeek crypto mining" the first result was your comment, the other results were just about the wide tech market selloff after DeepSeek appeared (that also affected crypto). As far as I know, they had many GPUs because their parent company was using AI algorithms for trading for many years.

https://en.wikipedia.org/wiki/High-Flyer

replies(1): >>43118699 #

29. oskarkk ◴[20 Feb 25 16:23 UTC] No.43116618[source]▶

>>43112235 (TP) #

> While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

I'm not sure if it's close to 100x more. xAI had 100K Nvidia H100s, while this is what SemiAnalysis writes about DeepSeek:

> We believe they have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth.

> We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months. These GPUs are shared between High-Flyer and DeepSeek and geographically distributed to an extent. They are used for trading, inference, training, and research. For more specific detailed analysis, please refer to our Accelerator Model.

> Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location.

https://semianalysis.com/2025/01/31/deepseek-debates/

I don't know how much slower are these GPUs that they have, but if they have 50K of them, that doesn't sound like 100x less compute to me. Also, a company that has N GPUs and trains AI on them for 2 months can achieve the same results as a company that has 2N GPUs and trains for 1 month. So DeepSeek could spend a longer time training to offset the fact that have less GPUs than competitors.

replies(1): >>43117987 #

30. cma ◴[20 Feb 25 18:01 UTC] No.43117987[source]▶

>>43116618 #

Having 50K of them isn't the same thing as 50K in one high bandwidth cluster, right? x.AI has all theirs so far in one connected cluster, and all of homogenous H100s right?

31. pertymcpert ◴[20 Feb 25 18:19 UTC] No.43118181{5}[source]▶

>>43113224 #

How much cheaper? I’m curious because I’ve seen the offers that Chinese tech companies pay and it’s in the millions for the top talent.

32. pertymcpert ◴[20 Feb 25 18:20 UTC] No.43118195{5}[source]▶

>>43113305 #

How experienced are you with Chinese AI talent compensation?

33. pertymcpert ◴[20 Feb 25 18:21 UTC] No.43118200{5}[source]▶

>>43113503 #

Do you think 1.5M a year compensation is cheap? That’s in the range of OpenAI offers.

34. pertymcpert ◴[20 Feb 25 18:35 UTC] No.43118334{5}[source]▶

>>43113116 #

How are you sure about that?

replies(1): >>43125212 #

35. wordofx ◴[20 Feb 25 19:03 UTC] No.43118699{3}[source]▶

>>43116339 #

You know crypto mining is illegal in China right? Of course they avoid mentioning it. Discussion boards in China had ex employees mention doing crypto mining but it’s all been wiped.

replies(1): >>43126521 #

36. sigmoid10 ◴[20 Feb 25 21:53 UTC] No.43120770{5}[source]▶

>>43112879 #

Universities (at least certain ones) and startups (more in absolute terms than universities, but there's also a much bigger fraction of swindlers). Most blogs and forums are garbage. If you're not inside these ecosystems, try to find out who the smart/talented people are by reading influential papers. Then you can start following them on X, linkedin etc. and often you'll see what they're up to next. For example, there's a pretty clear research paper and hiring trail of certain people that eventually led to GPT-4, even though OpenAI never published anything on the architecture.

replies(1): >>43120865 #

37. sigmoid10 ◴[20 Feb 25 22:00 UTC] No.43120854{3}[source]▶

>>43112314 #

That's not how the economics work. There has been a lot of research that showed how deeper nets are more efficient. So if you spend a ton of compute money on a model, you'll want the best output - even though you could just as well build something shallow that may well be state of the art for its depth, but can't hold up with the competition on real tasks.

replies(1): >>43120921 #

38. llm_trw ◴[20 Feb 25 22:02 UTC] No.43120865{6}[source]▶

>>43120770 #

I am in correspondence with a number of worth while authors, it's just that there isn't any place where they congregate in the (semi) open and without the weirdos who do stuff with the models you're missing out on a lot.

My favorite example I can never share in polite company is that the (still sota) best image segmentation algorithm I ever saw was done by a guy labeling parts of the vagina for his stable diffusion fine tune pipeline. I used what he'd done as the basis for a (also sota 2 years later) document segmentation model.

Found him on a subreddit about stable diffusion that's now completely overrun by shitesters and he's been banned (of course).

replies(1): >>43125935 #

39. llm_trw ◴[20 Feb 25 22:07 UTC] No.43120921{4}[source]▶

>>43120854 #

Which is my point.

You need a ton of specialized knowledge to use compute effectively.

If we had infinite memory and infinite compute we'd just throw every problem of length n to a tensor of size R^(n^n).

The issue is that we don't have enough memory in the world to store that tensor for something as trivial as mnist (and won't until the 2100s). And as you can imagine the exponentiated exponential grows a bit faster than the exponential so we never will.

replies(1): >>43125138 #

40. _giorgio_ ◴[21 Feb 25 02:38 UTC] No.43123381[source]▶

>>43112235 (TP) #

Deepseek spent at least 1.5 billion on hardware.

41. sigmoid10 ◴[21 Feb 25 07:52 UTC] No.43125138{5}[source]▶

>>43120921 #

Then how does this invalidate the bitter lesson? It's like you're saying if aerodynamics were true, we'd have planes flying like insects by now. But that's simply not how it works at large scales - in particular if you want to build something economical.

replies(1): >>43145859 #

42. victorbjorklund ◴[21 Feb 25 08:06 UTC] No.43125212{6}[source]▶

>>43118334 #

A qualified guess. Do you have something that indicates dev salaries are lower in US vs China?

replies(1): >>43147740 #

43. sigmoid10 ◴[21 Feb 25 10:16 UTC] No.43125935{7}[source]▶

>>43120865 #

It's pretty easy nowadays to come up with a narrow domain SOTA in image tasks. All you need to do is label some pictures and do a bit of hyperparameter search. This can literally be done by high schoolers on a laptop. And that's exactly what they do in those subreddits where everyone primarily cares about creating explicit content. The real frontier for algorithmic development is large domains (which need a lot more data by default as well). But there actually are some big-game explicit content platforms engaged in research in this area and they have shown somewhat interesting results.

44. llm_trw ◴[23 Feb 25 02:42 UTC] No.43145859{6}[source]▶

>>43125138 #

Because is the bitter lesson were true no one would be wasting their time with convolutions or attention blocks. You'd just replace them with the general tensor that allows every hyper relation possible between all points instead.

45. pertymcpert ◴[23 Feb 25 08:25 UTC] No.43147740{7}[source]▶

>>43125212 #

One example is that I've received offers to work in big tech in China at or exceeding my FAANG compensation here in the Bay Area. I have other reasons to believe as well but I can't talk about that in public.