←back to thread

1045 points mfiguiere | 2 comments | | HN request time: 0.735s | source
Show context
Cu3PO42 ◴[] No.39346489[source]
I'm really rooting for AMD to break the CUDA monopoly. To this end, I genuinely don't know whether a translation layer is a good thing or not. On the upside it makes the hardware much more viable instantly and will boost adoption, on the downside you run the risk that devs will never support ROCm, because you can just use the translation layer.

I think this is essentially the same situation as Proton+DXVK for Linux gaming. I think that that is a net positive for Linux, but I'm less sure about this. Getting good performance out of GPU compute requires much more tuning to the concrete architecture, which I'm afraid devs just won't do for AMD GPUs through this layer, always leaving them behind their Nvidia counterparts.

However, AMD desperately needs to do something. Story time:

On the weekend I wanted to play around with Stable Diffusion. Why pay for cloud compute, when I have a powerful GPU at home, I thought. Said GPU is a 7900 XTX, i.e. the most powerful consumer card from AMD at this time. Only very few AMD GPUs are supported by ROCm at this time, but mine is, thankfully.

So, how hard could it possibly to get Stable Diffusion running on my GPU? Hard. I don't think my problems were actually caused by AMD: I had ROCm installed and my card recognized by rocminfo in a matter of minutes. But the whole ML world is so focused on Nvidia that it took me ages to get a working installation of pytorch and friends. The InvokeAI installer, for example, asks if you want to use CUDA or ROCm, but then always installs the CUDA variant whatever you answer. Ultimately, I did get a model to load, but the software crashed my graphical session before generating a single image.

The whole experience left me frustrated and wanting to buy an Nvidia GPU again...

replies(10): >>39346714 #>>39347956 #>>39348258 #>>39349464 #>>39349658 #>>39350019 #>>39350273 #>>39351237 #>>39354496 #>>39433413 #
westurner ◴[] No.39349464[source]
> Proton+DXVK for Linux gaming

"Building the DirectX shader compiler better than Microsoft?" (2024) https://news.ycombinator.com/item?id=39324800

E.g. llama.cpp already supports hipBLAS; is there an advantage to this ROCm CUDA-compatibility layer - ZLUDA on Radeon (and not yet Intel OneAPI) - instead or in addition? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi... https://news.ycombinator.com/item?id=38588573

What can't WebGPU abstract away from CUDA unportability? https://news.ycombinator.com/item?id=38527552

replies(1): >>39360972 #
HarHarVeryFunny ◴[] No.39360972[source]
BLAS will only get you so far. About the highest level operation it has is matmul, which you can use to build convolution (im2col, matmul, col2im), but that won't be as performant as a hand optimized cuDNN convolution kernel. Same goes for any other high level neural net building blocks - trying to build them on top of BLAS will not get you remotely close to performance of a custom kernel.

What's nice about BLAS is that there are optimized implementations for CPUs (Intel MKL) as well as NVIDIA (cuBLAS) and AMD (hipBLAS), so while it's very much limited in what it can do, you can at least write portable code around it.

replies(1): >>39363745 #
1. westurner ◴[] No.39363745[source]
"CUDNN API supported by HIP" has a coverage table: https://rocm.docs.amd.com/projects/HIPIFY/en/amd-staging/tab...

ROCm/hipDNN wraps CuDNN on Nvidia and MiOpen on AMD; but hasn't been updated in awhile: https://github.com/ROCm/hipDNN

https://news.ycombinator.com/item?id=37808036 : conda-forge has various BLAS implementations, including MKL-optimized BLAS, and compatible NumPy and SciPy builds.

BLAS: Basic Linear Algebra Sub programs: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

"Using CuPy on AMD GPU (experimental)" https://docs.cupy.dev/en/v13.0.0/install.html#using-cupy-on-... :

  $ sudo apt install hipblas hipsparse rocsparse rocrand rocthrust rocsolver rocfft hipcub rocprim rccl
replies(1): >>39364731 #
2. HarHarVeryFunny ◴[] No.39364731[source]
I guess I misunderstood you.

You were asking if this CUDA compatability layer might hold any advantage over HIP (e.g. for use by llama.cpp) ?

I think the answer is no, since HIP includes pretty full-featured support for many of the higher level CUDA-based APIs (cuDNN, cuBLAS, etc), while per the Phoronix article ZLUDA only (currently) has minimal support for them.

I wouldn't expect ZLUDA to provide any performance benefit over HIP either, since on AMD hardware HIP is just a pass-thru to MIOpen (AMD's equivalent to cuDNN), rocBLAS, etc.