Most active commenters

kkielhofner(3)
latchkey(3)

Popular/hot comments

>>39345894 #

←back to thread

AMD funded a drop-in CUDA implementation built on ROCm: It's now open-source

(www.phoronix.com)

Show context

btown ◴[12 Feb 24 14:37 UTC] No.39345221[source]▶

>>39344815 (OP) #

Why would this not be AMD’s top priority among priorities? Someone recently likened the situation to an Iron Age where NVIDIA owns all the iron. And this sounds like AMD knowing about a new source of ore and not even being willing to sink a single engineer’s salary into exploration.

My only guess is they have a parallel skunkworks working on the same thing, but in a way that they can keep it closed-source - that this was a hedge they think they no longer need, and they are missing the forest for the trees on the benefits of cross-pollination and open source ethos to their business.

replies(14): >>39345241 #>>39345302 #>>39345393 #>>39345400 #>>39345458 #>>39345853 #>>39345857 #>>39345893 #>>39346210 #>>39346792 #>>39346857 #>>39347433 #>>39347900 #>>39347927 #

fariszr ◴[12 Feb 24 14:39 UTC] No.39345241[source]▶

>>39345221 #

According to the article, AMD seems to have pulled the plug on this as they think it will hinder ROCMv6 adoption, which still btw only supports two consumer cards out of their entire line up[1]

1. https://www.phoronix.com/news/AMD-ROCm-6.0-Released

replies(4): >>39345503 #>>39345558 #>>39346200 #>>39346480 #

1. kkielhofner ◴[12 Feb 24 15:07 UTC] No.39345558[source]▶

>>39345241 #

With the most recent card being their one year old flagship ($1k) consumer GPU...

Meanwhile CUDA supports anything with Nvidia stamped on it before it's even released. They'll even go as far as doing things like adding support for new GPUs/compute families to older CUDA versions (see Hopper/Ada and CUDA 11.8).

You can go out and buy any Nvidia GPU the day of release, take it home, plug it in, and everything just works. This is what people expect.

AMD seems to have no clue that this level of usability is what it will take to actually compete with Nvidia and it's a real shame - their hardware is great.

replies(5): >>39345774 #>>39345894 #>>39346438 #>>39346550 #>>39346788 #

2. KingOfCoders ◴[12 Feb 24 15:21 UTC] No.39345774[source]▶

>>39345558 (TP) #

AMD thinks the reason Nvidia is ahead of them is bad marketing on their part, and good marketing (All is AI) by Nvidia. They don't see the difference in software stacks.

For years I want to get off the Nvidia train for AI, but I'm forced to buy another Nvidia card b/c AMD stuff just doesn't work, and all examples work with Nvidia cards as they should.

replies(1): >>39346187 #

3. roenxi ◴[12 Feb 24 15:28 UTC] No.39345894[source]▶

>>39345558 (TP) #

You've got to remember that AMD are behind at all aspects of this, including documenting their work in an easily digestible way.

"Support" means that the card is actively tested and presumably has some sort of SLA-style push to fix bugs for. As their stack matures, a bunch of cards that don't have official support will work well [0]. I have an unsupported card. There are horrible bugs. But the evidence I've seen is that the card will work better with time even though it is never going to be officially supported. I don't think any of my hardware is officially supported by the manufacturer, but the kernel drivers still work fine.

> Meanwhile CUDA supports anything with Nvidia stamped on it before it's even released...

A lot of older Nvidia cards don't support CUDA v9 [1]. It isn't like everything supports everything, particularly in the early part of building out capability. The impression I'm getting is that in practice the gap in strategy here is not as large as the current state makes it seem.

[0] If anyone has bought an AMD card for their machine to multiply matrices they've been gambling on whether the capability is there. This comment is reasonable speculation, but I want to caveat the optimism by asserting that I'm not going to put money into AMD compute until there is some some actual evidence on the table that GPU lockups are rare.

[1] https://en.wikipedia.org/wiki/CUDA#GPUs_supported

replies(3): >>39346802 #>>39347041 #>>39347408 #

4. fortran77 ◴[12 Feb 24 15:48 UTC] No.39346187[source]▶

>>39345774 #

At the risk of sounding like Jeff Ballmer, the reason I only use NVIDIA for GPGPU work (our company does a lot of it!) is the developer support. They have compilers, tools, documentation, and tech support for developers who want to do any type of GPGPU computing on their hardware that just isn't matched on any other platform.

5. Certhas ◴[12 Feb 24 16:05 UTC] No.39346438[source]▶

>>39345558 (TP) #

The most recent "card" is their MI300 line.

It's annoying as hell to you and me that they are not catering to the market of people who want to run stuff on their gaming cards.

But it's not clear it's bad strategy to focus on executing in the high-end first. They have been very successful landing MI300s in the HPC space...

Edit: I just looked it up: 25% of the GPU Compute in the current Top500 Supercomputers is AMD

https://www.top500.org/statistics/list/

Even though the list has plenty of V100 and A100s which came out (much) earlier. Don't have the data at hand, but I wouldn't be surprised if AMD got more of the Top500 new installations than nVidia in the last two years.

replies(2): >>39347800 #>>39351254 #

6. voakbasda ◴[12 Feb 24 16:12 UTC] No.39346550[source]▶

>>39345558 (TP) #

In the embedded space, Nvidia regularly drops support for older hardware. The last supported kernel for their Jetson TX2 was 4.9. Their newer Jetson Xavier line is stuck on 5.10.

The hardware may be great, but their software ecosystem is utter crap. As long as they stay the unchallenged leader in hardware, I expect Nvidia will continue to produce crap software.

I would push to switch our products in a heartbeat, if AMD actually gets their act together. If this alternative offers a path to evaluate our current application software stack on an AMD devkit, I would buy one tomorrow.

replies(1): >>39348665 #

7. streb-lo ◴[12 Feb 24 16:28 UTC] No.39346788[source]▶

>>39345558 (TP) #

I have been using rocm on my 7800xt, it seems to be supported just fine.

8. spookie ◴[12 Feb 24 16:29 UTC] No.39346802[source]▶

>>39345894 #

To be fair, if anything, that table still shows you'll have compatibility with at least 3 major releases. Either way, I agree their strategy is getting results, it just takes time. I do prefer their open source commitment, I just hope they continue.

9. paulmd ◴[12 Feb 24 16:45 UTC] No.39347041[source]▶

>>39345894 #

All versions of CUDA support PTX, which is an intermediate bytecode/compiler representation that can be finally-compiled by even CUDA 1.0.

So the contract is: as long as your future program does not touch any intrinsics etc that do not exist in CUDA 1.0, you can export the new program from CUDA 27.0 as PTX, and the GTX 6800 driver will read the PTX and let your gpu run it as CUDA 1.0 code… so it is quite literally just as they describe, unlimited forward and backward capability/support as long as you go through PTX in the middle.

https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-ex...

https://en.wikipedia.org/wiki/Parallel_Thread_Execution

10. ColonelPhantom ◴[12 Feb 24 17:10 UTC] No.39347408[source]▶

>>39345894 #

CUDA dropped Tesla (from 2006!) only as of 7.0, which seems to have released around 2015. Fermi support lasted from 2010 until 2017, giving it a solid 7 years still. Kepler support was dropped around 2020, and the first cards were released in 2012.

As such Fermi seems to be the shortest supported architecture, and it was around for 7 years. GCN4 (Polaris) was introduced in 2016, and seems to have been officially dropped around 2021, just 5 years in. While you could still get it working with various workarounds, I don't see the evidence of Nvidia being even remotely as hasty as AMD with removing support, even for early architectures like Tesla and Fermi.

replies(1): >>39347791 #

11. hedgehog ◴[12 Feb 24 17:39 UTC] No.39347791{3}[source]▶

>>39347408 #

On top of this some Kepler support (for K80s etc) is still maintained in CUDA 11 which was last updated late 2022, and libraries like PyTorch and TensorFlow still support CUDA 11.8 out of the box.

12. latchkey ◴[12 Feb 24 17:39 UTC] No.39347800[source]▶

>>39346438 #

I'm building a bare metal business around MI300x and top end Epyc CPUs. We will have them for rental soon. The goal is to build a public super computer that isn't just available to researchers in HPC.

replies(1): >>39348839 #

13. kkielhofner ◴[12 Feb 24 18:45 UTC] No.39348665[source]▶

>>39346550 #

In the embedded space customers develop bespoke solutions to well, embed them in products where they (essentially) bake the firmware image and more-or-less freeze the entire software stack less incremental updates. The next version of your product uses the next fresh Jetson and Jetpack release. Repeat. Using the latest and greatest kernel is far from a top consideration in these applications...

I was actually advising an HN user against using Jetson just the other day because it's such an extreme outlier when it comes to Nvidia and software support. Frankly Jetson makes no sense unless you really need the power efficiency and form-factor.

Meanwhile, any seven year old >= Pascal card is fully supported in CUDA 12 and the most recent driver releases. That combined with my initial data points and others people have chimed in with on this thread is far from "utter crap".

Use the right tool for the job.

14. beebeepka ◴[12 Feb 24 19:03 UTC] No.39348839{3}[source]▶

>>39347800 #

Is it true MI300 line is 3-4x cheaper for similar performance than whatever nvidia is selling in highest segment?

replies(1): >>39349170 #

15. latchkey ◴[12 Feb 24 19:28 UTC] No.39349170{4}[source]▶

>>39348839 #

I probably can't comment on that, but what I can comment on is this:

H100's are hard to get. Nearly impossible. CoreWeave and others have scooped them all up for the foreseeable future. So, if you are looking at only price as the factor, then it becomes somewhat irrelevant, if you can't even buy them [0]. I don't really understand the focus on price because of this fact.

Even if you do manage to score yourself some H100's. You also need to factor in the networking between nodes. IB (Infiniband) made by Mellanox, is owned by NVIDIA. Lead times on that equipment are 50+ weeks. Again, price becomes irrelevant if you can't even network your boxes together.

As someone building a business around MI300x (and future products), I don't care that much about price [!]. We know going in that this is a super capital intensive business and have secured the backing to support that. It is one of those things where "if you have to ask, you can't afford it."

We buy cards by the chassis, it is one price. I actually don't know the exact prices of the cards (but I can infer it). It is a lot about who you know and what you're doing. You buy more chassis, you get better pricing. Azure is probably paying half of what I'm paying [1]. But I'd also say that from what I've seen so far, their chassis aren't nearly as nice as mine. I have dual 9754's, 2x bonded 400G, 3TB ram, and 122TB nvme... plus the 8x MI300x. These are top of the top. They have Intel and I don't know what else inside.

[!] Before you harp on me, of course I care about price... but at the end of the day, it isn't what I'm focused on today as much as just being focused on investing all of the capex/opex that I can get my hands on, into building a sustainable business that provides as much value as possible to our customers.

[0] https://www.tomshardware.com/news/tsmc-shortage-of-nvidias-a...

[1] https://www.techradar.com/pro/instincts-are-massively-cheape...

replies(1): >>39351396 #

16. kkielhofner ◴[12 Feb 24 22:13 UTC] No.39351254[source]▶

>>39346438 #

Indeed, but this is extremely short-sighted.

You don't win an overall market by focusing on several hundred million dollar bespoke HPC builds where the platform (frankly) doesn't matter at all. I'm working on a project on an AMD platform on the list (won't say - for now) and needless to say you build whatever you have to what's there, regardless of what it takes and the operators/owners and vendor support teams pour in whatever resources are necessary to make it work.

You win a market a generation at a time - supporting low end cards for tinkerers, the educational market, etc. AMD should focus on the low-end because that's where the next generation of AI devs, startups, innovation, etc is coming from and for now that's going to continue to be CUDA/Nvidia.

replies(1): >>39356010 #

17. beebeepka ◴[12 Feb 24 22:28 UTC] No.39351396{5}[source]▶

>>39349170 #

Pretty sweet. I do envy you. For what it's worth, I would prefer AMD to charge as much as possible for these little beasts.

replies(1): >>39351419 #

18. latchkey ◴[12 Feb 24 22:30 UTC] No.39351419{6}[source]▶

>>39351396 #

They definitely aren't cheap.

19. Certhas ◴[13 Feb 24 09:28 UTC] No.39356010{3}[source]▶

>>39351254 #

Those people didn't build the CUDA ecosystem. nVidia, Google and Facebook did. I think your hypothesis is pretty self-serving.

nVidia is dominant now. The question is, what's your wedge.

↑