Most active commenters
  • wkat4242(5)
  • fancyfredbot(4)
  • trod1234(4)

←back to thread

183 points spacebanana7 | 46 comments | | HN request time: 0.641s | source | bottom

I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.

1. fancyfredbot ◴[] No.43547461[source]
There is more than one way to answer this.

They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.

You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.

You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.

I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.

This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

replies(6): >>43547568 #>>43547675 #>>43547799 #>>43547827 #>>43549724 #>>43558036 #
2. thrtythreeforty ◴[] No.43547568[source]
This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:

https://github.com/ROCm/ROCm/issues/1714

Users complaining that the docs don't even specify which cards work.

But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.

replies(6): >>43547700 #>>43547940 #>>43547988 #>>43548203 #>>43549097 #>>43550313 #
3. trod1234 ◴[] No.43547675[source]
It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.

There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.

The CPU itself must support this feature, but the mainboard must as well (in firmware).

One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.

PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.

This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.

In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.

All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.

replies(5): >>43547751 #>>43547777 #>>43547796 #>>43549200 #>>43549341 #
4. CoastalCoder ◴[] No.43547700[source]
I had a similar (I think) experience when building LLVM from source a few years ago.

I kept running into some problem with LLVM's support for HIP code, even though I had not interest in having that functionality.

I realize this isn't exactly an AMD problem, but IIRC it was they were who contributed the troublesome code to LLVM, and it remained unfixed.

Apologies if there's something unfair or uninformed in what I wrote, it's been a while.

5. spacebanana7 ◴[] No.43547751[source]
How does NVIDIA manage this issue? I wonder whether they have a very different supply chain or just design software that puts less trust in the reliability of those advertised features.
replies(2): >>43547816 #>>43549279 #
6. pjc50 ◴[] No.43547777[source]
So .. how's Nvidia dealing with this? Or do they benefit from motherboard manufacturers doing preferential integration testing?
7. fancyfredbot ◴[] No.43547796[source]
Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.
replies(1): >>43548795 #
8. sigmoid10 ◴[] No.43547799[source]
>This is a management and leadership problem.

It's easy (and mostly correct) to blame management for this, but it's such a foundational issue that even if everyone up to the CEO pivoted on every topic, it wouldn't change anything. They simply don't have the engineering talent to pull this off, because they somehow concluded that making stuff open source means someone else will magically do the work for you. Nvidia on the other hand has accrued top talent for more than a decade and carefully developed their ecosystem to reach this point. And there are only so many talented engineers on the planet. So even if AMD leadership wakes up tomorrow, they won't go anywhere for a looong time.

replies(2): >>43550127 #>>43554678 #
9. trod1234 ◴[] No.43547816{3}[source]
Its an open question they have never answered afaik.

I would speculate that their design is self-contained in hardware.

10. pjc50 ◴[] No.43547827[source]
> This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

Yes. This kind of thing is unfortunately endemic in hardware companies, which don't "get" software. It's cultural and requires (a) a leader who does Get It and (b) one of those Amazon memos stating "anyone who does not Get With The Program will be fired".

11. tomrod ◴[] No.43547940[source]
Geez. If I were Berkshire Hathaway looking to invest in the GPU market, this would be a major red flag in my fundamentals analysis.
12. Covzire ◴[] No.43547988[source]
That reeks of gross incompetence somewhere in the organization. Like a hosting company that has a customer dealing with very poor performance, over pays greatly to avoid it while the whole time nobody even thinks to check what the linux swap file is doing.
replies(1): >>43549801 #
13. mook ◴[] No.43548203[source]
I suspect part of it is also that Nvidia actually does a lot of things in firmware that can be upgraded. The new Nvidia Linux drivers (the "open" ones) support Turing cards from 2018. That means chips that old already do much of the processing in firmware.

AMD keeps having issues because their drivers talk to the hardware directly so their drivers are massive bloated messes, famous for pages of auto-generated register definitions. Likely it's much more difficult to fix anything.

replies(2): >>43551710 #>>43553189 #
14. trod1234 ◴[] No.43548795{3}[source]
While we won't know for sure, unless someone from AMD comments on this; in fairness there may not have been any other way.

Nvidia has a large number of GPU related patents.

The fact that AMD chose to design their system this way, in such a roundabout and brittle manner, which is contrary to how engineer's approach things, may have been a direct result of being unable to design such systems any other way because of broad patents tied to the interface/GPU.

replies(1): >>43549136 #
15. iforgotpassword ◴[] No.43549097[source]
What I don't get is why they don't at least assign a dev or two to make the poster child of this work: llama.cpp

It's the first thing anyone tries when trying to dabble in AI or compute on the gpu, yet it's a clusterfuck to get to work. A few blessed cards work, with proper drivers and kernel; others just crash, perform horribly slow, or output GGGGGGGGGGGGGG to every input (I'm not making this up!) Then you LOL, dump it and go buy nvidia et voila, stuff works first try.

replies(1): >>43553556 #
16. fancyfredbot ◴[] No.43549136{4}[source]
I feel like this issue is to at least some extent a red herring. Even accepting that ROCm doesn't work on some motherboards, this can't explain why so few of AMD's GPUs have official ROCm support.

I notice that at one point there was a ROCm release which said it didn't require atomics for gfx9 GPUs, but the requirement was reintroduced in a later version of ROCm. Not sure what happened there but this seems to suggest AMD might have had a workaround at some point (though possibly it didn't work).

If this really is due to patent issues AMD can likely afford to licence or cross-license the patent given potential upside.

It would be in line with other decisions taken by AMD if they took this decision because it works well with their datacentre/high-end GPUs, and they don't (or didn't) really care about offering GPGPU to the mass/consumer GPU market.

replies(3): >>43549331 #>>43549410 #>>43553565 #
17. wongarsu ◴[] No.43549200[source]
There are so many hardware certification programs out there, why doesn't AMD run one to fix this?

Create a "ROCm compatible" logo and a list of criteria. Motherboard manufacturers can send a pre-production sample to AMD along with a check for some token amount (let's say $1000). AMD runs a comprehensive test suite to check actual compatibility, if it passes the mainboard is allowed to be advertised and sold with the previously mentioned logo. Then just tell consumers to look for that logo if they want to use ROCm. If things go wrong on a mainboard without the certification, communicate that it's probably the mainboard's fault.

Maybe add some kind of versioning scheme to allow updating requirements in the future

18. bigyabai ◴[] No.43549279{3}[source]
I should point out here, if nobody has already; Nvidia's GPU designs are extremely complicated compared to what AMD and Apple ship. The "standard" is to ship a PCIe card with display handling drivers and some streaming multiprocessor hardware to process your framebuffers. Nvidia goes even further by adding additional accelerators (ALUs by way of CUDA core and tensor cores), onboard RTOS management hardware (what Nvidia calls GPU System Processor), and more complex userland drivers that very well might be able to manage atomics without any PCIe standards.

This is also one of the reasons AMD and Apple can't simply turn their ship around right now. They've both invested heavily in simplifying their GPU and removing a lot of the creature-comforts people pay Nvidia for. 10 years ago we could at least all standardize on OpenCL, but these days it's all about proprietary frameworks and throwing competitors under the bus.

replies(1): >>43550634 #
19. trod1234 ◴[] No.43549331{5}[source]
> I feel like this issue is to at least some extent a red herring.

I don't see that, these two issues adequately explain why so few GPUs have official support. They don't want to get hit with a lawsuit, as a result of issues outside their sphere of control.

> If this really is due to patent issues AMD can likely afford to license or cross-license the patent given potential upside.

Have you ever known any company willing to cede market dominance and license or cross-license a patent letting competition into a market that they hold an absolute monopoly over, let alone in an environment where antitrust is non-existent and fang-less?

There is no upside for NVIDIA to do that. If you want to do serious AI/ML work you currently need to use NVIDIA hardware, and they can charge whatever they want for that.

The moment you have a competitor, demand is halved at a bare minimum depending on how much the competitor undercuts you by. Any agreement on coordinating prices leads to price-fixing indictments.

replies(1): >>43549484 #
20. zozbot234 ◴[] No.43549341[source]
AIUI, AMD documentation claims that the requirement for PCIe Atomics is due to ROCm being based on Heterogeneous System Architecture, https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... which allows for a sort of "unified memory" (strictly speaking, a unified address space) across CPU and GPU RAM. Other compute API's such as CUDA, OpenCL, SYCL or Vulkan Compute don't have HSA as a strict requirement but ROCm apparently does.
21. zozbot234 ◴[] No.43549410{5}[source]
> why so few of AMD's GPUs have official ROCm support

Because "official ROCm support" means "you can rely on AMD to make this work on your system for your critical needs". If you want "support" in the "you can goof around with this stuff on your own and don't care if there's any breakage" sense, ROCm "supports" a whole lot of AMD hardware. They should just introduce a new "experimental, unsupported" tier and make this official on their end.

22. fancyfredbot ◴[] No.43549484{6}[source]
> I don't see that, these two issues adequately explain why so few GPUs have official support.

I'm sorry I don't follow this. Surely if all AMD GPUs have the same problem with atomics then this can't explain why some GPUs are supported and others aren't?

> There is no upside for NVIDIA to do that.

If NVIDIA felt this patent was actually protecting them from competition then there would be no upside. But NVIDIA has competiton from AMD, Intel, Google, and Amazon. Intel have managed to engineer OneAPI support for their GPUs without licensing this patent or relying on PCIe atomics.

AMD have patents NVIDIA would be interested in. For example multi-chiplet GPUs.

23. flutetornado ◴[] No.43549724[source]
I was able to compile ollama for AMD Radeon 780M GPUs and I use it regularly on my AMD mini-PC which cost me 500$. It does require a bit more work. I get pretty decent performance with LLMs - just making a qualitative statement as I didn't do any formal testing, but I got comparable performance vibes as a NVIDIA 4050 GPU laptop I use as well.

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M...

replies(1): >>43549769 #
24. vkazanov ◴[] No.43549769[source]
Same here on lenovo thinkpad 14s with AMD Ryzen™ AI 7 PRO 360 that has a Radeon 880M iGPU. Works OK on ubuntu.

Not saying it works everywhere but it wasn't even that hard to setup, comparable to cuda.

Hate the name though.

replies(1): >>43554153 #
25. zombiwoof ◴[] No.43549801{3}[source]
This
26. jlundberg ◴[] No.43550127[source]
I wonder what would happen if they hired John Carmack to lead this effort.

He would probably be able to attract some really good hardware and driver talent.

replies(1): >>43556670 #
27. citizenpaul ◴[] No.43550313[source]
I've thought about this myself and come to a conclusion that your link reinforces. As I understand it most companies doing (EE)hardware design and production consider (CS) software to be a second class citizen at the the company. It looks like AMD after all this time competing with NVIDIA has not learned the lesson. That said I have never worked in hardware so I'm taking what I've heard from other people.

NVIDIA while far from perfect has always easily kept stride in software quality ahead of AMD for over 20 years. While AMD repeatedly keeps falling on their face and getting egg all over themselves again and again and again as far as software goes.

My guess is NVIDIA internally has found a way to keep the software people from feeling like they are "less than" the people designing the hardware.

Sounds easily but apparently not. AKA management problems.

replies(1): >>43551785 #
28. kimixa ◴[] No.43550634{4}[source]
FYI AMD also has similar "accelerators", with the 9070 having separate ALU paths for wmma ("tensor") operations much like Nvidia's model - older RDNA2/3 architectures had accelerated instructions but used the "normal" shader ALUs, if a bit beefed up and tweaked to support multiple smaller data types. And CUDA cores are just what Nvidia call their normal shader cores. Pretty much every subunit on a geforce has a direct equivalent on a radeon - they might be faster/slower or more/less capable, but they're there and often at an extremely similar level of the design.

AMD also have on-die microcontrollers (multiple, actually) that do things like scheduling or pipeline management, again just like Nvidia's GSP. It's been able to schedule new work on-GPU with zero host system involvement since the original GCN, something that Nvidia advertise as "new" with them introducing their GSP (which just replaced a slightly older, slightly less capable controller rather than being /completely/ new too)

The problem is that AMD are a software follower right now - after decades of under-investment they're behind on the treadmill just trying to keep up, so when the Next Big Thing inevitably pops up they're still busy polishing off the Last Big Thing.

I've always seen AMD as a hardware company, with the "build it and they will come" approach - which seems to have worked for the big supercomputers who likely find it worth investing in their own modified stack to get that last few %, but clearly falls down selling to "mere" professionals. Nvidia, however, support the same software APIs on even the lowest end hardware, while nobody is likely running much on their laptop's 3050m in anger, it offers a super easy on-ramp for developers - and it's easy to mistake familiarity with superiority - you already know to avoid the warts so you don't get burned by them. And believe me, CUDA has plenty of warts.

And marketing - remember "Technical Marketing" is still marketing - and to this day lots of people believe that the marketing name for something, or branding a feature, implies anything about the underlying architecture design - go to an "enthusiast" forum and you'll easily find people claiming that because Nvidia call their accelerator a "core" means it's somehow superior/better/"more accelerated" than the direct equivalent on a competitor, or actually believe that it just doesn't support hardware video encoding as it "Doesn't Have NVENC" (again, GCN with video encoding was released before a Geforce with NVENC). Same with branding - AMD hardware can already read the display block's state and timestamp in-shader, but Everyone Knows Nvidia Introduced "Flip Metering" With Blackwell!

29. bgnn ◴[] No.43551710{3}[source]
Hmm that is interesting. Can you elaborate what is exactly different between them?

I'm asking because I think a firmware has to directly talk to hardware through lower HAL (hardware abstraction layer), while customer facing parts should be fairly isolated in the upper HAL. Some companies like to add direct HW acces to customer interface via more complex functions (often a recipe made out of lower HAL functions), which I always disliked. I prefer to isolate lower level functions and memory space from the user.

In any case, both Nvidia and AMD should have very similar FW capabilities. I don't know what I'm missing here.

replies(2): >>43553244 #>>43554658 #
30. bgnn ◴[] No.43551785{3}[source]
This is correct but one of the reasons is the SWE at HW companies are living in their own bubble. They somehow don't follow the rest of the SW developments.

I'm a chip design engineer and I get frustrated with the garbage SW/FW team come up with, to the extent that I write my own FW library for my blocks. While doing that I try to learn the best practices, do quite a bit of research.

One other reason is, SW was only FW till not long ago, which was serving the HW. So there was almost no input from SW to HW development. This is clearly changing but some companies, like Nvidia, are ahead of the pack. Even Apple SoC team is quite HW centric compared to Nvidia.

31. Evil_Saint ◴[] No.43553189{3}[source]
Having worked at both Nvidia and AMD I can assure you that they both feature lots of generated header files.
32. Evil_Saint ◴[] No.43553244{4}[source]
I worked on at both companies on drivers. The programming models are quite different. Both make GPUs but they were designed by different groups of people who made different decisions. For example:

Nvidia cards are much easier to program in the user mode driver. You cannot hang a Nvidia GPU with a bad memory access. You can hang the display engine with one though. At least when I was there.

You can hang an AMD GPU with a bad memory access. At least up to the Navi 3x.

33. wkat4242 ◴[] No.43553556{3}[source]
It does work, I have it running on my Radeon VII Pro
replies(1): >>43555944 #
34. wkat4242 ◴[] No.43553565{5}[source]
And why the support is dropped so quickly too.
35. Our_Benefactors ◴[] No.43554153{3}[source]
Nobody will come after you for omitting the tm
replies(1): >>43556359 #
36. raxxorraxor ◴[] No.43554658{4}[source]
Why isolate these functions? That will always cripple capabilities. With well designed interfaces, it doesn't lead to a mess and a more powerful device. Of course these lower level functions shouldn't be essential, but especially in these times you almost have to provide an interface here or be left behind by other environments.
37. raxxorraxor ◴[] No.43554678[source]
Even top tier engineers can be found eventually. The problem is if you never even start.

Of course the specific disciplines need quite an investment into the knowledge of their workers, but it isn't anything insurmountable.

38. Filligree ◴[] No.43555944{4}[source]
It sometimes works.
replies(1): >>43557032 #
39. vkazanov ◴[] No.43556359{4}[source]
You never know
40. sigmoid10 ◴[] No.43556670{3}[source]
Carmack has been traditionally anti-AMD nad pro-Nvidia (at least regarding GPUs) in the past. I don't know if they could convince him even with all the money in the world unless they fundamentally changed everything first.
41. wkat4242 ◴[] No.43557032{5}[source]
How so? It's rock solid for me. I use ollama but it's based on llama.cpp

It's quite fast also, probably because that card has fast HBM2 memory (it has the same memory bandwidth as a 4090). And it was really cheap as it was on deep sale as an outgoing model.

replies(2): >>43558171 #>>43560454 #
42. bn-l ◴[] No.43558036[source]
> There is no ROCm support for their latest AI Max 395 APU

Fire the ceo

43. Filligree ◴[] No.43558171{6}[source]
"Sometimes" as in "on some cards". You're having luck with yours, but that doesn't mean it's a good place to build a community.
replies(1): >>43560427 #
44. wkat4242 ◴[] No.43560427{7}[source]
Ah I see. Yes, but you pick the card for the purpose of course. I also don't like the way they have such limited support on ROCm. But when it works it works well.

I have Nvidia cards too by the way, a 4090 and a 3060 (the latter I use for AI also, but more for Whisper because faster-whisper doesn't do ROCm right now).

45. halJordan ◴[] No.43560454{6}[source]
Aside from the fact that gfx906 is one of the blessed architecture mentioned (so why would it not work). Like how do you look at your specific instance and then turn around and say "All of you are lying, it works perfectly." How do you square that circle in your head
replies(1): >>43564900 #
46. wkat4242 ◴[] No.43564900{7}[source]
No I was just a bit thrown by the "sometimes". I thought they were referring to a reliability issue. I am aware of the limited card support with ROCm and I complained about this elsewhere in the thread too.

Also I didn't accuse anyone of lying. No need to be so confrontational. And my remark to the original poster at the top was from before they clarified their post.

I just don't really see what AMD can do to make ollama work better other than porting ROCm to all their cards which is definitely something they should do.

And no I'm not an AMD fanboi. I have no loyalty to anyone, any company or any country.