←back to thread

Anthropic raises $13B Series F

(www.anthropic.com)
585 points meetpateltech | 3 comments | | HN request time: 0.629s | source
Show context
llamasushi ◴[] No.45105325[source]
The compute moat is getting absolutely insane. We're basically at the point where you need a small country's GDP just to stay in the game for one more generation of models.

What gets me is that this isn't even a software moat anymore - it's literally just whoever can get their hands on enough GPUs and power infrastructure. TSMC and the power companies are the real kingmakers here. You can have all the talent in the world but if you can't get 100k H100s and a dedicated power plant, you're out.

Wonder how much of this $13B is just prepaying for compute vs actual opex. If it's mostly compute, we're watching something weird happen - like the privatization of Manhattan Project-scale infrastructure. Except instead of enriching uranium we're computing gradient descents lol

The wildest part is we might look back at this as cheap. GPT-4 training was what, $100M? GPT-5/Opus-4 class probably $1B+? At this rate GPT-7 will need its own sovereign wealth fund

replies(48): >>45105396 #>>45105412 #>>45105420 #>>45105480 #>>45105535 #>>45105549 #>>45105604 #>>45105619 #>>45105641 #>>45105679 #>>45105738 #>>45105766 #>>45105797 #>>45105848 #>>45105855 #>>45105915 #>>45105960 #>>45105963 #>>45105985 #>>45106070 #>>45106096 #>>45106150 #>>45106272 #>>45106285 #>>45106679 #>>45106851 #>>45106897 #>>45106940 #>>45107085 #>>45107239 #>>45107242 #>>45107347 #>>45107622 #>>45107915 #>>45108298 #>>45108477 #>>45109495 #>>45110545 #>>45110824 #>>45110882 #>>45111336 #>>45111695 #>>45111885 #>>45111904 #>>45111971 #>>45112441 #>>45112552 #>>45113827 #
powerapple ◴[] No.45107915[source]
Also not all compute was necessary for the final model, a large chunk of it is trial and error research. In theory, for $1B you spent training the latest model, a competitor will be able to do it after 6 months with $100M.
replies(1): >>45110291 #
SchemaLoad ◴[] No.45110291[source]
Not only are the actual models rapidly devaluing, the hardware is too. Spend $1B on GPUs and next year there's a much better model out that's massively devalued your existing datacenter. These companies are building mountains of quicksand that they have to constantly pour more cash on else they be reduced to having no advantage rapidly.
replies(2): >>45113911 #>>45120332 #
1. chermi ◴[] No.45120332[source]
Ignoring energy costs(!), I'm interested in the following. Say every server generation from nvda is 25% "better at training", by whatever metric (1). Could you not theoretically wire together 1.25 + delta more of the previous generation to get the same compute? The delta accounts for latency/bandwidth from interconnects. I'm guessing delta is fairly large given my impression of how important HBM and networking are.

I don't know the efficiency gains per generation, but let's just say to get the same compute with this 1.25+delta system requires 2x energy. My impression is that while energy is a substantial cost, the total cost for a training run is still dominated by the actual hardware+infrastructure.

It seems like there must be some break even point where you could use older generation servers and come out ahead. Probably everyone has this figured out and consequently the resale value of previous gen chips is quite high?

What's the lifespan at full load of these servers? I think I read coreweave deprecates them (somewhat controversially) over 4 years.

Assuming the chips last long enough, even if they're not usable for LLM training/serving inference, can't they be reused for scientific loads? I'm not exactly old, but back in my PhD days people were building our own little GPU clusters for MD simulations. I don't think long MD simulations are the best use of compute these days, but there's many similar problems like weather modeling, high dimensional optimization problems, materials/radiation studies, and generic simulations like FEA or simply large systems of ODEs.

Are these big clusters being turned into hand-me-downs for other scientific/engineering problems like above, or do they simply burn them out? What's a realistic expected lifespan for a B200? Or maybe it's as simple as they immediately turn their last gen servers over to serve inference?

Lot of questions, but my main question is just how much the hardware is devalued once it becomes previous gen. Any guidance/references appreciated!

Also, anyone still in the academic computing world, do people like de shaw still exist trying to run massive MD simulations or similar? Do the big national computing centers use the latest greatest big Nvidia AI servers or something a little more modest? Or maybe even they're still just massive CPU servers?

While I have anyone who might know, whatever happened to that fad from 10+ years ago saying a lot of compute/algorithms would be shifting toward more memory-heavy models(2). Seems like it kind of happened in AI at least.

(1) Yes I know it's complicated, especially with memory stuff.

(2) I wanna say it was ibm Almaden championing the idea.

replies(1): >>45121676 #
2. SchemaLoad ◴[] No.45121676[source]
I'm not the one building out datacenters but I believe the power consumption is the reason for the devaluation. It's the same reasons we saw bitcoin miners throw all their ASICs in the bin every 6 months. At some point it becomes cheaper to buy new hardware than to keep running the old inefficient chips, when the power savings of new chips exceed the purchase price of the new hardware.

These AI data centers are chewing up unimaginable amounts of power, so if nvidia releases a new chip that does the same work in half the power consumption. That whole datacenter of GPUs is massively devalued.

The whole AI industry is looking like there won't be a first movers advantage, and if anything there will be a late mover advantage when you can buy the better chips and skip burning money on the old generations.

replies(1): >>45143098 #
3. chermi ◴[] No.45143098[source]
The rule of thumb i heard was that over the "useful lifetime" of a chip the cost of energy+maintenance/infrastructure is about the same as the cost of the chip. IIRC the energy was some like 15-20% of the overall cost over the "useful lifetime". I'm putting it in quotes because it's kind of a cyclical definition.

That makes me wonder if it's more a performance thing than an energy cost thing. I guess you also have to factor the finite supply of reliable energy hookups for these things, so if you're constrained on total kwH consumption your only way to more TOPs is upgrade. Probably ties in with real estate/permitting difficulty too. I guess what I'm picturing is if energy availability(not cost) and real estate availability/permitting timelines weren't issues, that %20 of cost probably wouldn't look too bad. So it's probably those factors combined. Market pricing dynamics are hard :/

I didn't know the recycle time of the Asics was that fast! That's an interesting point about the first-mover. I would counter that a large part of the first move value in this case is ai engineer experience and grabbing exception talent early. But with Facebook and these guys paying for them like athletes, I'd guess that experience build up and talent retention aren't as robust. Openai lost most of it's best, but maybe that's an exceptional example..

All of that to say, yeah, first mover advantage seems dulled in this situation.

Who knows, maybe apple is playing the long game and betting on just that.