←back to thread

183 points spacebanana7 | 2 comments | | HN request time: 0s | source

I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.

Show context
fancyfredbot ◴[] No.43547461[source]
There is more than one way to answer this.

They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.

You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.

You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.

I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.

This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

replies(6): >>43547568 #>>43547675 #>>43547799 #>>43547827 #>>43549724 #>>43558036 #
trod1234 ◴[] No.43547675[source]
It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.

There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.

The CPU itself must support this feature, but the mainboard must as well (in firmware).

One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.

PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.

This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.

In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.

All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.

replies(5): >>43547751 #>>43547777 #>>43547796 #>>43549200 #>>43549341 #
fancyfredbot ◴[] No.43547796[source]
Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.
replies(1): >>43548795 #
trod1234 ◴[] No.43548795[source]
While we won't know for sure, unless someone from AMD comments on this; in fairness there may not have been any other way.

Nvidia has a large number of GPU related patents.

The fact that AMD chose to design their system this way, in such a roundabout and brittle manner, which is contrary to how engineer's approach things, may have been a direct result of being unable to design such systems any other way because of broad patents tied to the interface/GPU.

replies(1): >>43549136 #
fancyfredbot ◴[] No.43549136[source]
I feel like this issue is to at least some extent a red herring. Even accepting that ROCm doesn't work on some motherboards, this can't explain why so few of AMD's GPUs have official ROCm support.

I notice that at one point there was a ROCm release which said it didn't require atomics for gfx9 GPUs, but the requirement was reintroduced in a later version of ROCm. Not sure what happened there but this seems to suggest AMD might have had a workaround at some point (though possibly it didn't work).

If this really is due to patent issues AMD can likely afford to licence or cross-license the patent given potential upside.

It would be in line with other decisions taken by AMD if they took this decision because it works well with their datacentre/high-end GPUs, and they don't (or didn't) really care about offering GPGPU to the mass/consumer GPU market.

replies(3): >>43549331 #>>43549410 #>>43553565 #
1. trod1234 ◴[] No.43549331{3}[source]
> I feel like this issue is to at least some extent a red herring.

I don't see that, these two issues adequately explain why so few GPUs have official support. They don't want to get hit with a lawsuit, as a result of issues outside their sphere of control.

> If this really is due to patent issues AMD can likely afford to license or cross-license the patent given potential upside.

Have you ever known any company willing to cede market dominance and license or cross-license a patent letting competition into a market that they hold an absolute monopoly over, let alone in an environment where antitrust is non-existent and fang-less?

There is no upside for NVIDIA to do that. If you want to do serious AI/ML work you currently need to use NVIDIA hardware, and they can charge whatever they want for that.

The moment you have a competitor, demand is halved at a bare minimum depending on how much the competitor undercuts you by. Any agreement on coordinating prices leads to price-fixing indictments.

replies(1): >>43549484 #
2. fancyfredbot ◴[] No.43549484[source]
> I don't see that, these two issues adequately explain why so few GPUs have official support.

I'm sorry I don't follow this. Surely if all AMD GPUs have the same problem with atomics then this can't explain why some GPUs are supported and others aren't?

> There is no upside for NVIDIA to do that.

If NVIDIA felt this patent was actually protecting them from competition then there would be no upside. But NVIDIA has competiton from AMD, Intel, Google, and Amazon. Intel have managed to engineer OneAPI support for their GPUs without licensing this patent or relying on PCIe atomics.

AMD have patents NVIDIA would be interested in. For example multi-chiplet GPUs.