←back to thread

183 points spacebanana7 | 1 comments | | HN request time: 0s | source

I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.

Show context
fancyfredbot ◴[] No.43547461[source]
There is more than one way to answer this.

They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.

You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.

You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.

I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.

This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

replies(6): >>43547568 #>>43547675 #>>43547799 #>>43547827 #>>43549724 #>>43558036 #
trod1234 ◴[] No.43547675[source]
It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.

There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.

The CPU itself must support this feature, but the mainboard must as well (in firmware).

One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.

PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.

This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.

In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.

All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.

replies(5): >>43547751 #>>43547777 #>>43547796 #>>43549200 #>>43549341 #
fancyfredbot ◴[] No.43547796[source]
Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.
replies(1): >>43548795 #
trod1234 ◴[] No.43548795[source]
While we won't know for sure, unless someone from AMD comments on this; in fairness there may not have been any other way.

Nvidia has a large number of GPU related patents.

The fact that AMD chose to design their system this way, in such a roundabout and brittle manner, which is contrary to how engineer's approach things, may have been a direct result of being unable to design such systems any other way because of broad patents tied to the interface/GPU.

replies(1): >>43549136 #
fancyfredbot ◴[] No.43549136[source]
I feel like this issue is to at least some extent a red herring. Even accepting that ROCm doesn't work on some motherboards, this can't explain why so few of AMD's GPUs have official ROCm support.

I notice that at one point there was a ROCm release which said it didn't require atomics for gfx9 GPUs, but the requirement was reintroduced in a later version of ROCm. Not sure what happened there but this seems to suggest AMD might have had a workaround at some point (though possibly it didn't work).

If this really is due to patent issues AMD can likely afford to licence or cross-license the patent given potential upside.

It would be in line with other decisions taken by AMD if they took this decision because it works well with their datacentre/high-end GPUs, and they don't (or didn't) really care about offering GPGPU to the mass/consumer GPU market.

replies(3): >>43549331 #>>43549410 #>>43553565 #
1. zozbot234 ◴[] No.43549410{3}[source]
> why so few of AMD's GPUs have official ROCm support

Because "official ROCm support" means "you can rely on AMD to make this work on your system for your critical needs". If you want "support" in the "you can goof around with this stuff on your own and don't care if there's any breakage" sense, ROCm "supports" a whole lot of AMD hardware. They should just introduce a new "experimental, unsupported" tier and make this official on their end.