Apple's MLX adding CUDA support

(github.com)

548 points nsagent | 2 comments | 14 Jul 25 21:40 UTC | HN request time: 0.001s | source

Show context

benreesman ◴[15 Jul 25 00:33 UTC] No.44566742[source]▶

I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.

I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).

Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.

replies(6): >>44566825 #>>44566885 #>>44566921 #>>44567049 #>>44569265 #>>44570399 #

hamandcheese ◴[15 Jul 25 01:08 UTC] No.44566921[source]▶

>>44566742 #

For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.

The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).

It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".

replies(2): >>44567085 #>>44567169 #

Rohansi ◴[15 Jul 25 01:58 UTC] No.44567169[source]▶

>>44566921 #

You're comparing theoretical maximum memory bandwidth. It's not enough to only look at memory bandwidth because you're a lot more likely to be compute limited when you have a lot of memory bandwidth available. For example, M1 had so much bandwidth available that it couldn't make use of even when fully loaded.

replies(2): >>44568061 #>>44575926 #

zargon ◴[15 Jul 25 05:23 UTC] No.44568061[source]▶

>>44567169 #

GPUs have both the bandwidth and the compute. During token generation, no compute is needed. But both Apple silicon and Strix Halo fall on their face during prompt ingestion, due to lack of compute.

replies(1): >>44569026 #

supermatt ◴[15 Jul 25 08:09 UTC] No.44569026[source]▶

>>44568061 #

Compute (and lots of it) is absolutely needed for generation - 10s of billions of FLOPs per token on the smaller models (7B) alone - with computations of the larger models scaling proportionally.

Each token requires a forward pass through all transformer layers, involving large matrix multiplications at every step, followed by a final projection to the vocabulary.

replies(1): >>44569146 #

zargon ◴[15 Jul 25 08:30 UTC] No.44569146{3}[source]▶

>>44569026 #

Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.

replies(1): >>44570549 #

1. supermatt ◴[15 Jul 25 12:42 UTC] No.44570549{4}[source]▶

>>44569146 #

It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.

replies(1): >>44573346 #

2. zargon ◴[15 Jul 25 17:01 UTC] No.44573346[source]▶

>>44570549 (TP) #

That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.

↑