AMD funded a drop-in CUDA implementation built on ROCm: It's now open-source

1. Cu3PO42 ◴[12 Feb 24 16:09 UTC] No.39346489[source]▶

I'm really rooting for AMD to break the CUDA monopoly. To this end, I genuinely don't know whether a translation layer is a good thing or not. On the upside it makes the hardware much more viable instantly and will boost adoption, on the downside you run the risk that devs will never support ROCm, because you can just use the translation layer.

I think this is essentially the same situation as Proton+DXVK for Linux gaming. I think that that is a net positive for Linux, but I'm less sure about this. Getting good performance out of GPU compute requires much more tuning to the concrete architecture, which I'm afraid devs just won't do for AMD GPUs through this layer, always leaving them behind their Nvidia counterparts.

However, AMD desperately needs to do something. Story time:

On the weekend I wanted to play around with Stable Diffusion. Why pay for cloud compute, when I have a powerful GPU at home, I thought. Said GPU is a 7900 XTX, i.e. the most powerful consumer card from AMD at this time. Only very few AMD GPUs are supported by ROCm at this time, but mine is, thankfully.

So, how hard could it possibly to get Stable Diffusion running on my GPU? Hard. I don't think my problems were actually caused by AMD: I had ROCm installed and my card recognized by rocminfo in a matter of minutes. But the whole ML world is so focused on Nvidia that it took me ages to get a working installation of pytorch and friends. The InvokeAI installer, for example, asks if you want to use CUDA or ROCm, but then always installs the CUDA variant whatever you answer. Ultimately, I did get a model to load, but the software crashed my graphical session before generating a single image.

The whole experience left me frustrated and wanting to buy an Nvidia GPU again...

replies(10): >>39346714 #>>39347956 #>>39348258 #>>39349464 #>>39349658 #>>39350019 #>>39350273 #>>39351237 #>>39354496 #>>39433413 #

2. Certhas ◴[12 Feb 24 16:23 UTC] No.39346714[source]▶

>>39346489 (TP) #

They are focusing on HPC first. Which seems reasonable if your software stack is lacking. Look for sophisticated customers that can help build an ecosystem.

As I mentioned elsewhere, 25% of GPU compute on the Top 500 Supercomputer list is AMD. This all on the back of a card that came out only three years ago. We are very rapidly moving towards a situation where there are many, many high-performance developers that will target ROCm.

replies(1): >>39347423 #

3. ametrau ◴[12 Feb 24 17:11 UTC] No.39347423[source]▶

>>39346714 #

Is a top 500 super computer list a good way of measuring relevancy in the future?

replies(2): >>39347862 #>>39352508 #

4. latchkey ◴[12 Feb 24 17:44 UTC] No.39347862{3}[source]▶

>>39347423 #

No, it isn't. What is a better measure is to look at businesses like what I'm building (and others), where we take on the capex/opex risk around top end AMD products and bring them to the masses through bare metal rentals. Previously, these sorts of cards were only available to the Top 500.

5. whywhywhywhy ◴[12 Feb 24 17:49 UTC] No.39347956[source]▶

>>39346489 (TP) #

> I'm really rooting for AMD to break the CUDA monopoly

Personally I want Nvidia to break the x86-64 monopoly, with how amazing properly spec'd Nvidia cards are to work with I can only dream of a world where Nvidia is my CPU too.

replies(4): >>39348006 #>>39348977 #>>39351000 #>>39352323 #

6. kuschkufan ◴[12 Feb 24 17:53 UTC] No.39348006[source]▶

>>39347956 #

apt username

7. bntyhntr ◴[12 Feb 24 18:13 UTC] No.39348258[source]▶

>>39346489 (TP) #

I would love to be able to have a native stable diffusion experience, my rx 580 takes 30s to generate a single image. But it does work after following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

I got this up and running on my windows machine in short order and I don't even know what stable diffusion is.

But again, it would be nice to have first class support to locally participate in the fun.

replies(1): >>39349523 #

8. smcleod ◴[12 Feb 24 19:15 UTC] No.39348977[source]▶

>>39347956 #

That’s already been done with ARM.

9. westurner ◴[12 Feb 24 19:49 UTC] No.39349464[source]▶

>>39346489 (TP) #

> Proton+DXVK for Linux gaming

"Building the DirectX shader compiler better than Microsoft?" (2024) https://news.ycombinator.com/item?id=39324800

E.g. llama.cpp already supports hipBLAS; is there an advantage to this ROCm CUDA-compatibility layer - ZLUDA on Radeon (and not yet Intel OneAPI) - instead or in addition? https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi... https://news.ycombinator.com/item?id=38588573

What can't WebGPU abstract away from CUDA unportability? https://news.ycombinator.com/item?id=38527552

replies(1): >>39360972 #

10. Cu3PO42 ◴[12 Feb 24 19:53 UTC] No.39349523[source]▶

>>39348258 #

I have heard that DirectML was a somewhat easier story, but allegedly has worse performance (and obviously it's Windows only...). But I'm not entirely suprised that setup is somewhat easier on Windows, where bundling everything is an accepted approach.

With AMD's official 15GB(!) Docker image, I was now able to get the A1111 UI running. With SD 1.5 and 30 sample iterations, generating an image takes under 2s. I'm still struggling to get InvokeAI running.

replies(1): >>39352803 #

11. nocombination ◴[12 Feb 24 20:02 UTC] No.39349658[source]▶

>>39346489 (TP) #

As other folks have commented, CUDA not being an open standard is a large part of the problem. That and the developers who target CUDA directly when writing Stable Diffusion algorithms—they are forcing the monopoly. Even at the cost of not being able to squeeze every ounce out of the GPU, portability greatly improves software access when people target Vulkan et al.

12. formerly_proven ◴[12 Feb 24 20:27 UTC] No.39350019[source]▶

>>39346489 (TP) #

> I'm really rooting for AMD to break the CUDA monopoly. To this end, I genuinely don't know whether a translation layer is a good thing or not. On the upside it makes the hardware much more viable instantly and will boost adoption, on the downside you run the risk that devs will never support ROCm, because you can just use the translation layer.

On the other hand:

> The next major ROCm release (ROCm 6.0) will not be backward [source] compatible with the ROCm 5 series.

Even worse, not even the driver is backwards-compatible:

> There are some known limitations though like currently only targeting the ROCm 5.x API and not the newly-released ROCm 6.x releases.. In turn having to stick to ROCm 5.7 series as the latest means that using the ROCm DKMS modules don't build against the Linux 6.5 kernel now shipped by Ubuntu 22.04 LTS HWE stacks, for example. Hopefully there will be enough community support to see ZLUDA ported to ROCM 6 so at least it can be maintained with current software releases.

13. nialv7 ◴[12 Feb 24 20:47 UTC] No.39350273[source]▶

>>39346489 (TP) #

I am surprised that everybody seem to have forgotten the (in)famous Embrace, Extend and Extinguish strategy.

It's time for Open Source to be on the extinguishing side for once.

14. weebull ◴[12 Feb 24 21:49 UTC] No.39351000[source]▶

>>39347956 #

> Personally I want Nvidia to break the x86-64 monopoly

The one supplied by two companies?

replies(2): >>39351725 #>>39353087 #

15. sophrocyne ◴[12 Feb 24 22:11 UTC] No.39351237[source]▶

>>39346489 (TP) #

Hey there -

I'm a maintainer (and CEO) of Invoke.

It's something we're monitoring as well.

ROCm has been challenging to work with - we're actively talking to AMD to keep apprised of ways we can mitigate some of the more troublesome experiences that users have with getting Invoke running on AMD (and hoping to expand official support to Windows AMD)

The problem is that a lot of the solutions proposed involve significant/unsustainable dev effort (i.e., supporting an entirely different inference paradigm), rather than "drop in" for the existing Torch/diffusers pipelines.

While I don't know enough about your set up to offer immediate solutions, if you join the discord, am sure folks would be happy to try walking through some manual troubleshooting/experimentation to get you up and running - discord.gg/invoke-ai

replies(2): >>39351457 #>>39352272 #

16. latchkey ◴[12 Feb 24 22:34 UTC] No.39351457[source]▶

>>39351237 #

Invoke is awesome. Let me know if you guys want some MI300x to develop/test on. =) We've also got some good contacts at AMD if you need help there as well.

17. Keyframe ◴[12 Feb 24 22:58 UTC] No.39351725{3}[source]▶

>>39351000 #

Maybe he meant homogeneity which Nvidia did try and tries with Arm.. but, on the other hand how wild would it be for Nvidia to enter x86-64 as well? It's probably never going to happen due to licensing if nothing else, lest we remember nForce chipset ordeal with intel legal.

replies(2): >>39353105 #>>39375351 #

18. Cu3PO42 ◴[12 Feb 24 23:48 UTC] No.39352272[source]▶

>>39351237 #

Hi! I really appreciate you taking the time to reply.

I have since gotten Invoke to run and was already able to get some results I'm really quite happy with, so thank you for your time and commitment working on Invoke!

I understand that ROCm is still challenging, but it seems my problems were less related to ROCm or Invoke itself and more to Python dependency management. It really boiled down to getting the correct (ROCm) versions of packages installed. Installing Invoke from PyPi always removed my Torch and installed CUDA-enabled Torch (as well as cuBLAS, cuDNN, ...). Once I had the correct versions of packages, everything just worked.

To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.

What ended up working for me was to use one of AMD's ROCm OCI base images, manually installing all dependencies, foregoing a virtual environment, cloning your repo (, building the frontend), and then installing from there.

The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.) Trying to build the Dockerfile from your repo, I also ended up with a CUDA-enabled Torch. It did install the correct one first, but in a later step removed the ROCm-enabled Torch to switch it for the CUDA-enabled one.

I hope you'll consider investing some resources into publishing newer, working builds of your Docker image.

replies(3): >>39353370 #>>39353550 #>>39364975 #

19. mickael-kerjean ◴[12 Feb 24 23:53 UTC] No.39352323[source]▶

>>39347956 #

How would this a good idea? I am not very familiar with GPU programming but the small amount I've tried was nothing but pain a few years ago on linux, it was so bad that Torvald publicly used the f word in a very public event. That aside, CUDA seem like a great way to lock people in even further like AWS does with absolutely everything

replies(2): >>39354180 #>>39356819 #

20. llm_trw ◴[13 Feb 24 00:12 UTC] No.39352508{3}[source]▶

>>39347423 #

Yes it is, it's how cuda got it's dominance 10 years ago. Businesses don't release their source code, super computers are attached to labs and universities and have much better licenses for software, or publish papers about it.

21. washadjeffmad ◴[13 Feb 24 00:46 UTC] No.39352803{3}[source]▶

>>39349523 #

That has to include the model(s), no?

Also, nothing is easier on Windows. It's a wonder that anything works there, except for the power of recalcitrance.

Not dogging Windows users, but once your brain heals, it just can't go back.

replies(1): >>39355220 #

22. paulmd ◴[13 Feb 24 01:18 UTC] No.39353087{3}[source]▶

>>39351000 #

"minor spelling/terminology mistake, activate the post-o-tron"

23. paulmd ◴[13 Feb 24 01:20 UTC] No.39353105{4}[source]▶

>>39351725 #

https://en.wikipedia.org/wiki/Project_Denver#History

24. sophrocyne ◴[13 Feb 24 01:54 UTC] No.39353370{3}[source]▶

>>39352272 #

You bet - Thanks for the feedback. Glad you're enjoying Invoke!

We do have Docker packages hosted on GH, but I'll be the first to admit that we haven't prioritized ROCm. Contributors who have AMDs are a scant few, but maybe we'll find some help in wrangling that problem now that we know there's an avenue to do so.

replies(2): >>39355543 #>>39362959 #

25. doctorpangloss ◴[13 Feb 24 02:24 UTC] No.39353550{3}[source]▶

>>39352272 #

> Installing Invoke from PyPi... To me, your pyproject.toml looks perfectly sane, so I wasn't sure how to go about fixing the problem.

You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters. Invoke would have to author a setup.py that deals with all the issues, in a way that is compatible with build isolation.

> The majority of my struggle would have been solved by a recent working Docker image containing a working setup. (The one on Docker Hub is 9 months old.)

Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device? (https://rocm.docs.amd.com/projects/MIVisionX/en/latest/docke...)

There is a lot of ad-hoc reinvention of tooling in this space.

replies(1): >>39355527 #

26. Qwertious ◴[13 Feb 24 03:56 UTC] No.39354180{3}[source]▶

>>39352323 #

>I am not very familiar with GPU programming but the small amount I've tried was nothing but pain a few years ago on linux, it was so bad that Torvald publicly used the f word in a very public event.

I'm pretty sure Torvalds was giving the finger over the subject of GPU drivers (which run on the CPU), not programming on the Nvidia GPU itself. Particularly, they namedropped Bumblebee (and maybe Optimus?) which was more about power-management and making Nvidia cooperate with a non-Nvidia integrated GPU than it was about the Nvidia GPU itself.

27. bavell ◴[13 Feb 24 04:38 UTC] No.39354496[source]▶

>>39346489 (TP) #

Try with ComfyUI... works great and easy setup on my 6750XT. I've had it working for about a year now with SD, LlamaCpp and WhisperCpp.

28. Cu3PO42 ◴[13 Feb 24 06:46 UTC] No.39355220{4}[source]▶

>>39352803 #

It actually doesn't include the models! The image is Ubuntu with ROCm and a number of ML libraries, such as Torch, preinstalled.

> Also, nothing is easier on Windows.

As much as I, too, dislike Windows, I still have to disagree. I have encountered (proprietary) software which was much easier to get working on Windows. For example, Cisco AnyConnect with SmartCard authentication has been a nightmare for me on Linux.

29. Cu3PO42 ◴[13 Feb 24 07:50 UTC] No.39355527{4}[source]▶

>>39353550 #

> You can't install the PyTorch that's best for the currently running platform using a pyproject.toml with a setuptools backend, for starters.

I see. I do know Python, but my knowledge of setuptools, pip, poetry and whatever else have you. To get my working setup, I specified an --index-url for my Torch installation. Does that not work while using their current setup?

> Why? Given the state of the ecosystem, what guarantee is there really that the documentation for Docker Desktop with AMD ROCm device binding is going to actually work for your device?

Well, they did work for me. Though I think only passing /dev/{dri,kfd} and setting seccomp=unconfined was sufficient. So for my particular case, getting a working image was the only missing step.

From a more general POV: it might not make sense to invest in a ROCm OCI image from a short-term business perspective, but in the long term and based purely on principal, I do think the ecosystem should strive to be less reliant on CUDA and only CUDA.

30. Cu3PO42 ◴[13 Feb 24 07:54 UTC] No.39355543{4}[source]▶

>>39353370 #

I hate maintaining my own build instructions as much as the next guy, so I'll try to get your Dockerfile working for me and then send a PR.

31. whywhywhywhy ◴[13 Feb 24 11:51 UTC] No.39356819{3}[source]▶

>>39352323 #

>CUDA seem like a great way to lock people in even further like AWS does with absolutely everything

Lock people in to something that didn’t exist in a way any user could use before it existed? I get people hate CUDAs dominance but no one else was pushing this before CUDA and Apple+AMD completely fumbled OpenCL.

Can’t hate on something good just because it’s successful and I can’t be angry the talent behind the success wanting to profit.

32. HarHarVeryFunny ◴[13 Feb 24 18:30 UTC] No.39360972[source]▶

>>39349464 #

BLAS will only get you so far. About the highest level operation it has is matmul, which you can use to build convolution (im2col, matmul, col2im), but that won't be as performant as a hand optimized cuDNN convolution kernel. Same goes for any other high level neural net building blocks - trying to build them on top of BLAS will not get you remotely close to performance of a custom kernel.

What's nice about BLAS is that there are optimized implementations for CPUs (Intel MKL) as well as NVIDIA (cuBLAS) and AMD (hipBLAS), so while it's very much limited in what it can do, you can at least write portable code around it.

replies(1): >>39363745 #

33. Cu3PO42 ◴[13 Feb 24 21:17 UTC] No.39362959{4}[source]▶

>>39353370 #

As promised in my other comment, I did send a PR! https://github.com/invoke-ai/InvokeAI/pull/5714

34. westurner ◴[13 Feb 24 22:31 UTC] No.39363745{3}[source]▶

>>39360972 #

"CUDNN API supported by HIP" has a coverage table: https://rocm.docs.amd.com/projects/HIPIFY/en/amd-staging/tab...

ROCm/hipDNN wraps CuDNN on Nvidia and MiOpen on AMD; but hasn't been updated in awhile: https://github.com/ROCm/hipDNN

https://news.ycombinator.com/item?id=37808036 : conda-forge has various BLAS implementations, including MKL-optimized BLAS, and compatible NumPy and SciPy builds.

BLAS: Basic Linear Algebra Sub programs: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

"Using CuPy on AMD GPU (experimental)" https://docs.cupy.dev/en/v13.0.0/install.html#using-cupy-on-... :

  $ sudo apt install hipblas hipsparse rocsparse rocrand rocthrust rocsolver rocfft hipcub rocprim rccl

replies(1): >>39364731 #

35. HarHarVeryFunny ◴[14 Feb 24 00:20 UTC] No.39364731{4}[source]▶

>>39363745 #

I guess I misunderstood you.

You were asking if this CUDA compatability layer might hold any advantage over HIP (e.g. for use by llama.cpp) ?

I think the answer is no, since HIP includes pretty full-featured support for many of the higher level CUDA-based APIs (cuDNN, cuBLAS, etc), while per the Phoronix article ZLUDA only (currently) has minimal support for them.

I wouldn't expect ZLUDA to provide any performance benefit over HIP either, since on AMD hardware HIP is just a pass-thru to MIOpen (AMD's equivalent to cuDNN), rocBLAS, etc.

36. westurner ◴[14 Feb 24 00:53 UTC] No.39364975{3}[source]▶

>>39352272 #

> AMD's ROCm OCI base images,

ROCm docs > "Install ROCm Docker containers" > Base Image: https://rocm.docs.amd.com/projects/install-on-linux/en/lates... links to ROCm/ROCm-docker: https://github.com/ROCm/ROCm-docker which is the source of docker.io/rocm/rocm-terminal: https://hub.docker.com/r/rocm/rocm-terminal :

  docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/rocm-terminal

ROCm docs > "Docker image support matrix": https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

ROCm/ROCm-docker//dev/Dockerfile-centos-7-complete: https://github.com/ROCm/ROCm-docker/blob/master/dev/Dockerfi...

Bazzite is a ublue (Universal Blue) fork of the Fedora Kinoite (KDE) or Fedora Silverblue (Gnome) rpm-ostree Linux distributions; ublue-os/bazzite//Containerfile : https://github.com/ublue-os/bazzite/blob/main/Containerfile#... has, in addition to fan and power controls, automatic updates on desktop, supergfxctl, system76-scheduler, and an fsync kernel:

  rpm-ostree install rocm-hip \
        rocm-opencl \
        rocm-clinfo

But it's not `rpm-ostree install --apply-live` because its a Containerfile.

To install a ublue-os distro, you install any of the Fedora ostree distros: {Silverblue, Kinoite, Sway Atomic, or Budgie Atomic} from e.g. a USB stick and then `rpm-ostree rebase <OCI_host_image_url>`:

  rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite:stable
  rpm-ostree rebase ostree-unverified-registry:ghcr.io/ublue-os/bazzite-nvidia:stable
  rpm-ostree rebase ostree-image-signed:

ublue-os/config//build/ublue-os-just/40-nvidia.just defines the `ujust configure-nvidia` and `ujust toggle-nvk` commands: https://github.com/ublue-os/config/blob/main/build/ublue-os-...

There's a default `distrobox` with pytorch in ublue-os/config//build/ublue-os-just/etc-distrobox/apps.ini: https://github.com/ublue-os/config/blob/main/build/ublue-os-...

  [mlbox]
  image=nvcr.io/nvidia/pytorch:23.08-py3
  additional_packages="nano git htop"
  init_hooks="pip3 install huggingface_hub tokenizers transformers accelerate datasets wandb peft bitsandbytes fastcore fastprogress watermark torchmetrics deepspeed"
  pre-init-hooks="/init_script.sh"
  nvidia=true
  pull=true
  root=false
  replace=false

docker.io/rocm/pytorch: https://hub.docker.com/r/rocm/pytorch

pytorch/builder//manywheel/Dockerfile: https://github.com/pytorch/builder/blob/main/manywheel/Docke...

ROCm/pytorch//Dockerfile: https://github.com/ROCm/pytorch/blob/main/Dockerfile

The ublue-os (and so also bazzite) OCI host image Containerfile has Sunshine installed; which is a 4k HDR 120fps remote desktop solution for gaming.

There's a `ujust remove-sunshine` command in system_files/desktop/shared/usr/share/ublue-os/just/80-bazzite.just : https://github.com/ublue-os/bazzite/blob/main/system_files/d... and also kernel args for AMD:

  pstate-force-enable:
    rpm-ostree kargs --append-if-missing=amd_pstate=active

ublue-os/config//Containerfile: https://github.com/ublue-os/config/blob/main/Containerfile

LizardByte/Sunshine: https://github.com/LizardByte/Sunshine

moonlight-stream https://github.com/moonlight-stream

Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files

conda-forge/pytorch-cpu-feedstock > "Add ROCm variant?": https://github.com/conda-forge/pytorch-cpu-feedstock/issues/...

And Fedora supports OCI containers as host images and also podman container images with just systemd to respawn one or a pod of containers.

replies(1): >>39367834 #

37. Cu3PO42 ◴[14 Feb 24 08:43 UTC] No.39367834{4}[source]▶

>>39364975 #

I actually used the rocm/pytorch image you also linked.

I'm not sure what you're pointing to with your reference to the Fedora-based images. I'm quite happy with my NixOS install and really don't want to switch to anything else. And as long as I have the correct kernel module, my host OS really shouldn't matter to run any of the images.

And I'm sure it can be made to work with many base images, my point was just that the dependency management around pytorch was in a bad state, where it is extremely easy to break.

> Anyways, hopefully this PR fixes the immediate issue: https://github.com/invoke-ai/InvokeAI/pull/5714/files

It does! At least for me. It is my PR after all ;)

replies(1): >>39370007 #

38. westurner ◴[14 Feb 24 14:17 UTC] No.39370007{5}[source]▶

>>39367834 #

Unfortunately NixOS (and Debian and Ubuntu) lack SELinux policies or other LSM implementations by default out of the box, and container-selinux contains more than e.g. docker.

Is there a way to 'restorecon --like / /nix/os/root72`; to apply SELonix extended filesystem attributes labels just to NixOS prefixes?

Some research is done with RPM-based distros; which have become so advanced with rpm-ostree support.

FWICS Bazzite has NixOS support, too; in addition to distrobox containers.

Bazzite has alot of other stuff installed that's not necessary when attempting to isolate sources of variance in the interest of reproducible research; but being for gaming it has various optimizations.

InvokeAI might be faster to install and to compute with with conda-forge builds.

39. weebull ◴[14 Feb 24 21:06 UTC] No.39375351{4}[source]▶

>>39351725 #

Indeed, but I think people forget that the reason AMD have a license in the first place was because Intel's customers in the early days required a second source for it's processors.

Who owns the Cyrix x86 license these days?

40. mtrower ◴[19 Feb 24 19:05 UTC] No.39433413[source]▶

>>39346489 (TP) #

This is the exact reason* I bought a 4090 for my recent rebuild instead of the rDNA card I actually wanted. I really wanted to go with AMD for the driver integration with the Linux graphics stack —- I’m so, so tired of shenanigans when it comes to decades old features of X not working or working poorly due to some nvidia bug/non-integration.

But being able to leverage my graphics card for GPGPU was a top priority for me, and like you, I was appalled with the ROCm situation. Not necessarily the tech itself (though I did not enjoy the docker approach), but more the developer situation surrounding it.

* well, that and some vague notions about RTX