Grok 3: Another win for the bitter lesson

(www.thealgorithmicbridge.com)

132 points kiyanwang | 3 comments | 20 Feb 25 07:15 UTC | HN request time: 0.673s | source

Show context

smy20011 ◴[20 Feb 25 08:06 UTC] No.43112235[source]▶

>>43111963 (OP) #

Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

replies(10): >>43112269 #>>43112330 #>>43112430 #>>43112606 #>>43112625 #>>43112895 #>>43112963 #>>43115065 #>>43116618 #>>43123381 #

sigmoid10 ◴[20 Feb 25 08:12 UTC] No.43112269[source]▶

>>43112235 #

>It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

The article explains how in reality the opposite is true. Especially when you look at it long term. Compute power grows exponentially, humans do not.

replies(4): >>43112294 #>>43112314 #>>43112378 #>>43112615 #

llm_trw ◴[20 Feb 25 08:20 UTC] No.43112314[source]▶

>>43112269 #

If the bitter lesson were true we'd be getting sota results out of two layer neural networks using tanh as activation functions.

It's a lazy blog post that should be thrown out after a minute of thought by anyone in the field.

replies(1): >>43120854 #

sigmoid10 ◴[20 Feb 25 22:00 UTC] No.43120854[source]▶

>>43112314 #

That's not how the economics work. There has been a lot of research that showed how deeper nets are more efficient. So if you spend a ton of compute money on a model, you'll want the best output - even though you could just as well build something shallow that may well be state of the art for its depth, but can't hold up with the competition on real tasks.

replies(1): >>43120921 #

1. llm_trw ◴[20 Feb 25 22:07 UTC] No.43120921[source]▶

>>43120854 #

Which is my point.

You need a ton of specialized knowledge to use compute effectively.

If we had infinite memory and infinite compute we'd just throw every problem of length n to a tensor of size R^(n^n).

The issue is that we don't have enough memory in the world to store that tensor for something as trivial as mnist (and won't until the 2100s). And as you can imagine the exponentiated exponential grows a bit faster than the exponential so we never will.

replies(1): >>43125138 #

2. sigmoid10 ◴[21 Feb 25 07:52 UTC] No.43125138[source]▶

>>43120921 (TP) #

Then how does this invalidate the bitter lesson? It's like you're saying if aerodynamics were true, we'd have planes flying like insects by now. But that's simply not how it works at large scales - in particular if you want to build something economical.

replies(1): >>43145859 #

3. llm_trw ◴[23 Feb 25 02:42 UTC] No.43145859[source]▶

>>43125138 #

Because is the bitter lesson were true no one would be wasting their time with convolutions or attention blocks. You'd just replace them with the general tensor that allows every hyper relation possible between all points instead.

↑