←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 1 comments | | HN request time: 0.416s | source
Show context
swiftcoder ◴[] No.42948127[source]
> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...

replies(5): >>42948369 #>>42948616 #>>42948712 #>>42949773 #>>42953287 #
svantana ◴[] No.42948616[source]
Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.
replies(1): >>42953005 #
1. janalsncm ◴[] No.42953005[source]
What? It definitely is.

Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.

But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.