←back to thread

183 points spacebanana7 | 2 comments | | HN request time: 0.211s | source

I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.

Show context
spmurrayzzz ◴[] No.43548054[source]
CUDA isn't the moat people think it is. NVIDIA absolutely has the best dev ergonomics for machine learning, there's no question about that. Their driver is also far more stable than AMD's. But AMD is also improving, they've made some significant strides over the last 12-18 months.

But I think more importantly, what is often missed in this analysis is that most programmers doing ML work aren't writing their own custom kernels. They're just using pytorch (or maybe something even more abstracted/multi-backend like keras 3.x) and let the library deal with implementation details related to their GPU.

That doesn't mean there aren't footguns in that particular land of abstraction, but the delta between the two providers is not nearly as stark as its often portrayed. At least not for the average programmer working with ML tooling.

(EDIT: also worth noting that the work being done in the MLIR project has a role to play in closing the gap as well for similar reasons)

replies(1): >>43548642 #
martinpw ◴[] No.43548642[source]
> But I think more importantly, what is often missed in this analysis is that most programmers doing ML work aren't writing their own custom kernels. They're just using pytorch (or maybe something even more abstracted/multi-backend like keras 3.x) and let the library deal with implementation details related to their GPU.

That would imply that AMD could just focus on implementing good PyTorch support on their hardware and they would be able to start taking market share. Which doesn't sound like much work compared with writing a full CUDA competitor. But that does not seem to be the strategy, which implies it is not so simple?

I am not an ML engineer so don't have first hand experience, but those I have talked to say they depend on a lot more than just one or two key libraries. But my sample size is small. Interested in other perspectives...

replies(1): >>43549088 #
spmurrayzzz ◴[] No.43549088[source]
> But that does not seem to be the strategy, which implies it is not so simple?

That is exactly what has been happening [1], and not just in pytorch. Geohot has been very dedicated in working with AMD to upgrade their station in this space [2]. If you hang out in the tinygrad discord, you can see this happening in real time.

> those I have talked to say they depend on a lot more than just one or two key libraries.

Theres a ton of libraries out there yes, but if we're talking about python and the libraries in question are talking to GPUs its going to be exceedingly rare that theyre not using one of these under the hood: pytorch, tensorflow, jax, keras, et al.

There are of course exceptions to this, particularly if you're not using python for your ML work (which is actually common for many companies running inference at scale and want better runtime performance, training is a different story). But ultimately the core ecosystem does work just fine with AMD GPUs, provided you're not doing any exotic custom kernel work.

(EDIT: just realized my initial comment unintentionally borrowed the "moat" commentary from geohot's blog. A happy accident in this case, but still very much rings true for my day to day ML dev experience)

[1] https://github.com/pytorch/pytorch/pulls?q=is%3Aopen+is%3Apr...

[2] https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-...

replies(1): >>43550058 #
1. martinpw ◴[] No.43550058[source]
Thanks for the additional information. I am still puzzled though. This sounds like it is a third party (maybe just a small group of devs?) doing all the work, and from your link they have had to beg AMD just to send them hardware? If this work was a significant piece of what is required to get ML users onto AMD hardware, wouldn't AMD just invest in doing this themselves, or at least provide much more support to these guys?
replies(1): >>43550151 #
2. spmurrayzzz ◴[] No.43550151[source]
> This sounds like it is a third party (maybe just a small group of devs?) doing all the work

Just as a quantitative side note here — tinygrad has almost 400 contributors, pytorch has almost 4,000. This might seem small, but both projects have a larger people footprint than most tech companies' headcount that are operating at significant scale.

On top of that, consider that pytorch is a project with its origins at Meta, and Meta has internal teams that spend 100% of their time supporting the project. Coupled with the fact that Meta just purchased nearly 200k units worth of AMD inference gear (MI300X), there is a massive groundswell of tech effort being pushed in AMD's direction.

> wouldn't AMD just invest in doing this themselves, or at least provide much more support to these guys?

That was actually the point of George Hotz's "cultural test" (as he put it). He wanted to see if they were willing to part with some expensive gear in the spirit of enabling him to help them with more velocity. And they came through, so I think that's a win no matter which lens you analyze this through.

Since resources are finite, especially in terms of human capital, there's only so much to go around. AMD naturally can now focus more on the software closer to the metal as a result, namely the driver. They still have significant stability issues in that layer they need to overcome, so letting the greater ML community help them shore up the deltas in other areas is great.