(timkellogg.me)

851 points tkellogg | 1 comments | 05 Feb 25 11:05 UTC | HN request time: 0.416s | source

Show context

swiftcoder ◴[05 Feb 25 13:25 UTC] No.42948127[source]▶

> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...

replies(5): >>42948369 #>>42948616 #>>42948712 #>>42949773 #>>42953287 #

svantana ◴[05 Feb 25 14:04 UTC] No.42948616[source]▶

>>42948127 #

Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.

replies(1): >>42953005 #

1. janalsncm ◴[05 Feb 25 18:32 UTC] No.42953005[source]▶

>>42948616 #

What? It definitely is.

Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.

But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.

↑

S1: A $6 R1 competitor?