Most active commenters
  • wkat4242(4)

←back to thread

183 points spacebanana7 | 20 comments | | HN request time: 0.001s | source | bottom

I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.

Show context
fancyfredbot ◴[] No.43547461[source]
There is more than one way to answer this.

They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.

You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.

You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.

I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.

This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

replies(6): >>43547568 #>>43547675 #>>43547799 #>>43547827 #>>43549724 #>>43558036 #
1. thrtythreeforty ◴[] No.43547568[source]
This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:

https://github.com/ROCm/ROCm/issues/1714

Users complaining that the docs don't even specify which cards work.

But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.

replies(6): >>43547700 #>>43547940 #>>43547988 #>>43548203 #>>43549097 #>>43550313 #
2. CoastalCoder ◴[] No.43547700[source]
I had a similar (I think) experience when building LLVM from source a few years ago.

I kept running into some problem with LLVM's support for HIP code, even though I had not interest in having that functionality.

I realize this isn't exactly an AMD problem, but IIRC it was they were who contributed the troublesome code to LLVM, and it remained unfixed.

Apologies if there's something unfair or uninformed in what I wrote, it's been a while.

3. tomrod ◴[] No.43547940[source]
Geez. If I were Berkshire Hathaway looking to invest in the GPU market, this would be a major red flag in my fundamentals analysis.
4. Covzire ◴[] No.43547988[source]
That reeks of gross incompetence somewhere in the organization. Like a hosting company that has a customer dealing with very poor performance, over pays greatly to avoid it while the whole time nobody even thinks to check what the linux swap file is doing.
replies(1): >>43549801 #
5. mook ◴[] No.43548203[source]
I suspect part of it is also that Nvidia actually does a lot of things in firmware that can be upgraded. The new Nvidia Linux drivers (the "open" ones) support Turing cards from 2018. That means chips that old already do much of the processing in firmware.

AMD keeps having issues because their drivers talk to the hardware directly so their drivers are massive bloated messes, famous for pages of auto-generated register definitions. Likely it's much more difficult to fix anything.

replies(2): >>43551710 #>>43553189 #
6. iforgotpassword ◴[] No.43549097[source]
What I don't get is why they don't at least assign a dev or two to make the poster child of this work: llama.cpp

It's the first thing anyone tries when trying to dabble in AI or compute on the gpu, yet it's a clusterfuck to get to work. A few blessed cards work, with proper drivers and kernel; others just crash, perform horribly slow, or output GGGGGGGGGGGGGG to every input (I'm not making this up!) Then you LOL, dump it and go buy nvidia et voila, stuff works first try.

replies(1): >>43553556 #
7. zombiwoof ◴[] No.43549801[source]
This
8. citizenpaul ◴[] No.43550313[source]
I've thought about this myself and come to a conclusion that your link reinforces. As I understand it most companies doing (EE)hardware design and production consider (CS) software to be a second class citizen at the the company. It looks like AMD after all this time competing with NVIDIA has not learned the lesson. That said I have never worked in hardware so I'm taking what I've heard from other people.

NVIDIA while far from perfect has always easily kept stride in software quality ahead of AMD for over 20 years. While AMD repeatedly keeps falling on their face and getting egg all over themselves again and again and again as far as software goes.

My guess is NVIDIA internally has found a way to keep the software people from feeling like they are "less than" the people designing the hardware.

Sounds easily but apparently not. AKA management problems.

replies(1): >>43551785 #
9. bgnn ◴[] No.43551710[source]
Hmm that is interesting. Can you elaborate what is exactly different between them?

I'm asking because I think a firmware has to directly talk to hardware through lower HAL (hardware abstraction layer), while customer facing parts should be fairly isolated in the upper HAL. Some companies like to add direct HW acces to customer interface via more complex functions (often a recipe made out of lower HAL functions), which I always disliked. I prefer to isolate lower level functions and memory space from the user.

In any case, both Nvidia and AMD should have very similar FW capabilities. I don't know what I'm missing here.

replies(2): >>43553244 #>>43554658 #
10. bgnn ◴[] No.43551785[source]
This is correct but one of the reasons is the SWE at HW companies are living in their own bubble. They somehow don't follow the rest of the SW developments.

I'm a chip design engineer and I get frustrated with the garbage SW/FW team come up with, to the extent that I write my own FW library for my blocks. While doing that I try to learn the best practices, do quite a bit of research.

One other reason is, SW was only FW till not long ago, which was serving the HW. So there was almost no input from SW to HW development. This is clearly changing but some companies, like Nvidia, are ahead of the pack. Even Apple SoC team is quite HW centric compared to Nvidia.

11. Evil_Saint ◴[] No.43553189[source]
Having worked at both Nvidia and AMD I can assure you that they both feature lots of generated header files.
12. Evil_Saint ◴[] No.43553244{3}[source]
I worked on at both companies on drivers. The programming models are quite different. Both make GPUs but they were designed by different groups of people who made different decisions. For example:

Nvidia cards are much easier to program in the user mode driver. You cannot hang a Nvidia GPU with a bad memory access. You can hang the display engine with one though. At least when I was there.

You can hang an AMD GPU with a bad memory access. At least up to the Navi 3x.

13. wkat4242 ◴[] No.43553556[source]
It does work, I have it running on my Radeon VII Pro
replies(1): >>43555944 #
14. raxxorraxor ◴[] No.43554658{3}[source]
Why isolate these functions? That will always cripple capabilities. With well designed interfaces, it doesn't lead to a mess and a more powerful device. Of course these lower level functions shouldn't be essential, but especially in these times you almost have to provide an interface here or be left behind by other environments.
15. Filligree ◴[] No.43555944{3}[source]
It sometimes works.
replies(1): >>43557032 #
16. wkat4242 ◴[] No.43557032{4}[source]
How so? It's rock solid for me. I use ollama but it's based on llama.cpp

It's quite fast also, probably because that card has fast HBM2 memory (it has the same memory bandwidth as a 4090). And it was really cheap as it was on deep sale as an outgoing model.

replies(2): >>43558171 #>>43560454 #
17. Filligree ◴[] No.43558171{5}[source]
"Sometimes" as in "on some cards". You're having luck with yours, but that doesn't mean it's a good place to build a community.
replies(1): >>43560427 #
18. wkat4242 ◴[] No.43560427{6}[source]
Ah I see. Yes, but you pick the card for the purpose of course. I also don't like the way they have such limited support on ROCm. But when it works it works well.

I have Nvidia cards too by the way, a 4090 and a 3060 (the latter I use for AI also, but more for Whisper because faster-whisper doesn't do ROCm right now).

19. halJordan ◴[] No.43560454{5}[source]
Aside from the fact that gfx906 is one of the blessed architecture mentioned (so why would it not work). Like how do you look at your specific instance and then turn around and say "All of you are lying, it works perfectly." How do you square that circle in your head
replies(1): >>43564900 #
20. wkat4242 ◴[] No.43564900{6}[source]
No I was just a bit thrown by the "sometimes". I thought they were referring to a reliability issue. I am aware of the limited card support with ROCm and I complained about this elsewhere in the thread too.

Also I didn't accuse anyone of lying. No need to be so confrontational. And my remark to the original poster at the top was from before they clarified their post.

I just don't really see what AMD can do to make ollama work better other than porting ROCm to all their cards which is definitely something they should do.

And no I'm not an AMD fanboi. I have no loyalty to anyone, any company or any country.