Most active commenters
  • pjmlp(8)
  • adastra22(4)
  • lukev(3)
  • saagarjha(3)
  • (3)

←back to thread

548 points nsagent | 75 comments | | HN request time: 1.059s | source | bottom
1. lukev ◴[] No.44567263[source]
So to make sure I understand, this would mean:

1. Programs built against MLX -> Can take advantage of CUDA-enabled chips

but not:

2. CUDA programs -> Can now run on Apple Silicon.

Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).

Is this correct?

replies(9): >>44567309 #>>44567350 #>>44567355 #>>44567600 #>>44567699 #>>44568060 #>>44568194 #>>44570427 #>>44577999 #
2. saagarjha ◴[] No.44567309[source]
No, it's because doing 2 would be substantially harder.
replies(2): >>44567356 #>>44567414 #
3. ls612 ◴[] No.44567350[source]
#2 would be Google v. Oracle wouldn’t it?
4. quitit ◴[] No.44567355[source]
It's 1.

It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.

That's nice to have for a range of reasons.

replies(5): >>44568550 #>>44568740 #>>44569683 #>>44570543 #>>44571119 #
5. lukev ◴[] No.44567356[source]
There's a massive financial incentive (billions) to allow existing CUDA code to run on non-NVidia hardware. Not saying it's easy, but is implementation difficulty really the blocker?
replies(5): >>44567393 #>>44567539 #>>44568123 #>>44573767 #>>44574809 #
6. saagarjha ◴[] No.44567393{3}[source]
Yes. See: AMD
replies(1): >>44567420 #
7. hangonhn ◴[] No.44567414[source]
Is CUDA tied very closely to the Nvidia hardware and architecture so that all the abstraction would not make sense on other platforms? I know very little about hardware and low level software.

Thanks

replies(4): >>44567469 #>>44567535 #>>44568191 #>>44568597 #
8. lukev ◴[] No.44567420{4}[source]
AMD has never implemented the CUDA API. And not for technical reasons.
replies(1): >>44567444 #
9. gpm ◴[] No.44567444{5}[source]
They did, or at least they paid someone else to.

https://www.techpowerup.com/319016/amd-develops-rocm-based-s...

replies(2): >>44568534 #>>44579527 #
10. saagarjha ◴[] No.44567469{3}[source]
Yes, also it's a moving target where people don't just want compatibility but also good performance.
11. dagmx ◴[] No.44567535{3}[source]
CUDA isn’t really that hyper specific to NVIDIA hardware as an api.

But a lot of the most useful libraries are closed source and available on NVIDIA hardware only.

You could probably get most open source CUDA to run on other vendors hardware without crazy work. But you’d spend a ton more work getting to parity on ecosystem and lawyer fees when NVIDIA come at you.

12. lmm ◴[] No.44567539{3}[source]
I think it's ultimately a project management problem, like all hard problems. Yes it's a task that needs skilled programmers, but if an entity was willing to pay what programmers of that caliber cost and give them the conditions to let them succeed they could get it done.
13. dagmx ◴[] No.44567600[source]
2 also further cements CUDA as the de facto API to target, and nobody would write MLX targeted code instead.

This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.

14. tekawade ◴[] No.44567699[source]
I want #3 be able to connect NVIDIA GPU with Apple Silicon and run CUDA. Take advantage of apple silicon + unified memory + GPU + CUDA with PyTorch, JAX or TensorFlow.

Haven’t really explored MLX so can’t speak about it.

replies(1): >>44578022 #
15. tho234j2344 ◴[] No.44568060[source]
I don't think #2 is really true - AMDs HIP is doing this exact thing after giving up on OpenCL way back in ~'17/'18.
replies(1): >>44568122 #
16. NekkoDroid ◴[] No.44568122[source]
I haven't looked into it, but doesn't HIP need everything to be recompiled against it? To my understanding it was mostly a source code translation effectivly.
replies(1): >>44568527 #
17. fooker ◴[] No.44568123{3}[source]
Existing high performance cuda code is almost all first party libraries, written by NVIDIA and uses weird internal flags and inline ptx.

You can get 90% of the way there with a small team of compiler devs. The rest 10% would take hundreds of people working ten years. The cost of this is suspiciously close to the billions in financial incentive you mentioned, funny how efficient markets work.

replies(2): >>44568168 #>>44568589 #
18. lcnielsen ◴[] No.44568168{4}[source]
> funny how efficient markets work.

Can one really speak of efficient markets when there are multiple near molopolies at various steps in the production chain with massive integration, and infinity amounts of state spending in the process?

replies(2): >>44568219 #>>44568327 #
19. lcnielsen ◴[] No.44568191{3}[source]
The kind of CUDA you or I would write is not very hardware specific (a few constants here and there) but the kind of CUDA behind cuBLAS with a million magic flags, inline PTX ("GPU assembly") and exploitation of driver/firmware hacks is. It's like the difference between numerics code in C and and numerics code in C with tons of in-line assembly code for each one of a number of specific processors.

You can see similar things if you buy datacenter-grade CPUs from AMD or Intel and compare their per-model optimized BLAS builds and compilers to using OpenBLAS or swapping them around. The difference is not world ending but you can see maybe 50% in some cases.

20. sitkack ◴[] No.44568194[source]
#2 is not a copyright violation. You can reimplement APIs.
replies(2): >>44568364 #>>44568387 #
21. bigyabai ◴[] No.44568219{5}[source]
Sure they can. CUDA used to have a competitor, sponsored by Apple. It's name is OpenCL.
replies(2): >>44569174 #>>44579523 #
22. fooker ◴[] No.44568327{5}[source]
Yes, free markets and monopolies are not incompatible.

When a monopoly uses it's status in an attempt to gain another monopoly, that's a problem and governments eventually strike this behavior down.

Sometimes it takes time, because you'd rather not go on a ideology power trip and break something that's useful to the country/world.

replies(1): >>44568796 #
23. 7734128 ◴[] No.44568364[source]
The famous Android Java fight is probably the most important case of that discussion.
replies(1): >>44568806 #
24. adastra22 ◴[] No.44568387[source]
CUDA is not an API, it is a set of libraries written by NVIDIA. You'd have to reimplement those libraries, and for people to care at all you'd have to reimplement the optimizations in those libraries. That does get into various IP issues.
replies(3): >>44568506 #>>44568575 #>>44570953 #
25. Imustaskforhelp ◴[] No.44568506{3}[source]
Even if its not as optimized, it would still be nice to see a CUDA alternative really

Also I do wonder what the difference b/w a API and a set of libraries are, couldn't an API be exposed from that set of libraries which could be used? Its a little confusing I guess

replies(1): >>44568626 #
26. pjmlp ◴[] No.44568527{3}[source]
For CUDA C++, not CUDA the ecosystem.
27. Imustaskforhelp ◴[] No.44568534{6}[source]
But I think then there was some lawsuit and the rocm guy/team had gone really ahead but amd dropped it because of either fear of lawsuit or lawsuit in general.

Then, now they had to stop working on some part of the source code and had to rewrite a lot of things again, they are still not as close to as they were before amd lawyer shenanigan

28. _zoltan_ ◴[] No.44568550[source]
"relatively high powered"? there's nothing faster out there.
replies(4): >>44568714 #>>44568716 #>>44568748 #>>44569262 #
29. pjmlp ◴[] No.44568575{3}[source]
CUDA is neither an API, nor a set of libraries, people get this wrong all the time.

CUDA is an ecosystem of programming languages, libraries and developer tools.

Composed by compilers for C, C++, Fortran, Python JIT DSLs, provided by NVidia, plus several others with either PTX or NVVM IR.

The libraries, which you correctly point out.

And then the IDE integrations, the GPU debugger that is on par with Visual Studio like debugging, profiler,...

Hence why everyone that focus on copying only CUDA C, or CUDA C++, without everything else that makes CUDA relevant keeps failing.

replies(1): >>44572506 #
30. pjmlp ◴[] No.44568589{4}[source]
And the tooling, people keep forgeting about CUDA tooling.
31. pjmlp ◴[] No.44568597{3}[source]
CUDA is an ecosystem, many keep failing to understand that, trying to copy only the C++ compiler.
32. adastra22 ◴[] No.44568626{4}[source]
> couldn't an API be exposed from that set of libraries which could be used

And now you've entered that copyright violation territory.

replies(1): >>44568978 #
33. chvid ◴[] No.44568714{3}[source]
Relative to what you can get in the cloud or on a desktop machine.
34. MangoToupe ◴[] No.44568716{3}[source]
Is this true per watt?
replies(1): >>44569017 #
35. chvid ◴[] No.44568740[source]
If Apple cannot do their own implementation of CUDA due to copyright second best is this; getting developers to build for LMX (which is on their laptops) and still get NVIDIA hardware support.

Apple should do a similar thing for AMD.

replies(2): >>44569645 #>>44570359 #
36. sgt101 ◴[] No.44568748{3}[source]
I wonder what Apple would have to do to make metal + its processors run faster than nVidia? I guess that it's all about the interconnects really.
replies(1): >>44569316 #
37. Perseids ◴[] No.44568796{6}[source]
> > Can one really speak of efficient markets

> Yes, free markets and monopolies are not incompatible.

How did you get from "efficient markets" to "free markets"? The first could be accepted as inherently value, while the latter is clearly not, if this kind of freedom degrades to: "Sure you can start your business, it's a free country. For certain, you will fail, though, because there are monopolies already in place who have all the power in the market."

Also, monopolies are regularly used to squeeze exorbitant shares of the added values from the other market participants, see e.g. Apple's AppStore cut. Accepting that as "efficient" would be a really unusual usage of the term in regard to markets.

replies(2): >>44572181 #>>44572354 #
38. hnfong ◴[] No.44568806{3}[source]
Indeed.

Unfortunately when that case went to the Supreme Court they basically just said "yeah for this case it's fair use, but we're not going to comment on whether APIs in general are copyrightable"...

39. Someone ◴[] No.44568978{5}[source]
IP infringement, not copyright violation.

A clean room reimplementation of cuda would avoid any copyright claims, but would not necessary avoid patents infringement.

https://en.wikipedia.org/wiki/Clean-room_design:

“Clean-room design is useful as a defense against copyright infringement because it relies on independent creation. However, because independent invention is not a defense against patents, clean-room designs typically cannot be used to circumvent patent restrictions.”

replies(2): >>44573410 #>>44578961 #
40. spookie ◴[] No.44569017{4}[source]
It doesn't matter for a lot of applications. But fair, for a big part of them it is either essential or a nice to have. But completely off the point if we are waging fastest compute no matter what.
replies(1): >>44570777 #
41. dannyw ◴[] No.44569174{6}[source]
And after Apple dropped NVIDIA, they stopped caring about openCL performance on their GPUs.
42. quitit ◴[] No.44569262{3}[source]
Relative to the apple hardware, the nvidia is high powered.

I appreciate that English is your second language after your Hungarian mother-tongue. My comment reflects upon the low and high powered compute of the apple vs. nvidia hardware.

43. summarity ◴[] No.44569316{4}[source]
Right now, for LLMs, the only limiting factor on Apple Silicon is memory bandwidth. There hasn’t been progress on this since the original M1 Ultra. And since abandoning UltraFusion, we won’t see progress here anytime soon either.
replies(3): >>44569480 #>>44569623 #>>44569854 #
44. glhaynes ◴[] No.44569623{5}[source]
Have they abandoned UltraFusion? Last I’d heard, they’d just said something like “not all generations will get an Ultra chip” around the time the M4 showed up (the first M chip lacking an Ultra variation), which makes me think the M5 or M6 is fairly likely to get an Ultra.
45. ◴[] No.44569645{3}[source]
46. ◴[] No.44569683[source]
47. librasteve ◴[] No.44569854{5}[source]
this is like saying the only limiting factor on computers is the von neumann bottleneck
48. xd1936 ◴[] No.44570359{3}[source]
I thought that the US Supreme Court decision in Google v. Oracle and the Java reimplementation provided enough case precedent to allow companies to re-implement something like CUDA APIs?

https://www.theverge.com/2021/4/5/22367851/google-oracle-sup...

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

replies(1): >>44574685 #
49. pxc ◴[] No.44570427[source]
Copyright can't prohibit compatible implementations that are developed independently through reverse engineering, if the implementers are very careful about the way they work.

(Software patents can, though.)

50. randomNumber7 ◴[] No.44570543[source]
What is the performance penalty compared to a program in native CUDA?
51. johnboiles ◴[] No.44570777{5}[source]
...fastest compute no matter watt
52. vFunct ◴[] No.44570953{3}[source]
So if people aren't aware, you can have AI reimplement CUDA libraries for any hardware, as well as develop new ones.

You wouldn't believe me if you didn't try it and see for yourself, so try it.

NVidia's CUDA moat is no more.

replies(1): >>44578964 #
53. karmakaze ◴[] No.44571119[source]
It would be great for Apple if enough developers took this path and Apple could later release datacenter GPUs that support MLX without CUDA.
replies(1): >>44574044 #
54. privatelypublic ◴[] No.44572181{7}[source]
You scuttled your argument by using apple AppStore as an example.
55. ameliaquining ◴[] No.44572354{7}[source]
The term "efficient markets" tends to confuse and mislead people. It refers to a particular narrow form of "efficiency", which is definitely not the same thing as "socially optimal". It's more like "inexploitability"; the idea is that in a big enough world, any limited opportunities to easily extract value will be taken (up to the opportunity cost of the labor of the people who can take them), so you shouldn't expect to find any unless you have an edge. The standard metaphor is, if I told you that there's a $20 bill on the sidewalk in Times Square and it's been there all week, you shouldn't believe me, because if it were there, someone would have picked it up.

(The terminology is especially unfortunate because people tend to view it as praise for free markets, and since that's an ideological claim people respond with opposing ideological claims, and now the conversation is about ideology instead of about understanding a specific phenomenon in economics.)

This is fully compatible with Apple's App Store revenue share existing and not creating value (i.e., being rent). What the efficient markets principle tells us is that, if it were possible for someone else to start their own app store with a smaller revenue share and steal Apple's customers that way, then their revenue share would already be much lower, to account for that. Since this isn't the case, we can conclude that there's some reason why starting your own competing app store wouldn't work. Of course, we already separately know what that reason is: an app store needs to be on people's existing devices to succeed, and your competing one wouldn't be.

Similarly, if it were possible to spend $10 million to create an API-compatible clone of CUDA, and then save more than $10 million by not having to pay huge margins to Nvidia, then someone would have already done it. So we can conclude that either it can't be done for $10 million, or it wouldn't create $10 million of value. In this case, the first seems more likely, and the comment above hypothesizes why: because an incomplete clone wouldn't produce $10 million of value, and a complete one would cost much more than $10 million. Alternatively, if Nvidia could enforce intellectual property rights against someone creating such a clone, that would also explain it.

(Technically it's possible that this could instead be explained by a free-rider problem; i.e., such a clone would create more value than it would cost, but no company wants to sponsor it because they're all waiting for some other company to do it and then save the $10 million it would cost to do it themselves. But this seems unlikely; big tech companies often spend more than $10 million on open source projects of strategic significance, which a CUDA clone would have.)

56. CamperBob2 ◴[] No.44572506{4}[source]
Only the runtime components matter, though. Nobody cares about the dev tools beyond the core compiler. What people want is to be able to recompile and run on competitive hardware, and I don't understand why that's such an intractable problem.
replies(4): >>44573430 #>>44573665 #>>44573734 #>>44579484 #
57. dragonwriter ◴[] No.44573410{6}[source]
> A clean room reimplementation of cuda would avoid any copyright claims,

Assuming APIs are either not copyirghtable or that API reimplementation is always fair use of the API, neither of which there is sufficient precedent to justify as a conclusion; Oracle v. Google ended with “well, it would be fair use in the exact factual circumstances in this case so we don't have to reach the thornier general questions”.

58. outworlder ◴[] No.44573430{5}[source]
It is not.

However, companies may still be hoping to get their own solutions in place instead of CUDA. If they do implement CUDA, that cements its position forever. That ship has probably already sailed, of course.

59. StillBored ◴[] No.44573665{5}[source]
Because literally the entire rest of the ecosystem is immature demoware. Rather than each vendor buying into opencl+SPIRV and building a robust stack around it, they are all doing their own half baked tech demos hoping to lock up some portion of the market to duplicate nvidia's success, or at least carve out a niche. While nvidia continues to extend and mature their ecosystem. Intel has oneAPI, AMD has ROCM, Arm has ACL/Kleidi/etc, and a pile of other stacks like MLX, Windows ML, whatever. Combined with a confusing mix of pure software plays like pytorch and windows ML.

A lot of people talk about 'tooling' quality and no one hears them. I just spent a couple weeks porting a fairly small library to some fairly common personal hardware and hit all the same problems you see everywhere. Bugs aren't handled gracefully. Instead of returning "you messed up here", the hardware locks up, and power cycling is the only solution. Not a problem when your writing hello world, but trolling through tens of thousands of lines of GPU kernel code to find the error is going to burn engineer time without anything to show for it. Then when its running, spending weeks in an open feedback loop trying to figure out why the GPU utilization metrics are reporting 50% utilization (if your lucky enough to even have them) and the kernel is running at 1/4 the expected performance is again going to burn weeks. All because there isn't a functional profiler.

And the vendors can't even get this stuff working. People rant about the ROCm support list not supporting, well the hardware people actually have. And it is such a mess, that in some cases it actually works but AMD says it doesn't. And of course, the only reason you hear people complaining about AMD is because they are literally the only company that has a hardware ecosystem that in theory spans the same breadth of devices from small embedded systems to giant data center grade products that NVIDIA does. Everyone else wants a slice of the market, but take apple here, they have nothing in the embedded/edge space that isn't a fixed function device (ex a watch, or apple TV), and their GPU's while interesting are nowhere near the level of the datacenter grade stuff, much less even top of the line AIC boards for gamers.

And its all gotten to be such an industry wide pile of trash that people can't even keep track of basic feature capabilities. Like, a huge pile of hardware actually 'supports' openCL, but its buried to the point where actual engineers working on say ROCm are unaware its actually part of the ROCm stack (imagine my surprise!). And its been the same for nvidia, they have at times supported openCL, but the support is like a .dll they install with the GPU driver stack and don't even bother to document that its there. Or tensorflow that seems to have succumbed to the immense gravitational black hole it had become, where just building it on something that wasn't the blessed platform could take days.

60. int_19h ◴[] No.44573734{5}[source]
It's the same essential problem as with e.g. Wine - if you're trying to reimplement someone else's constantly evolving API with a closed-source implementation, it takes a lot of effort just to barely keep up.

As far as portability, people who care about that already have the option of using higher-level APIs that have CUDA backend among several others. The main reason why you'd want to do CUDA directly is to squeeze that last bit of performance out of the hardware, but that is also precisely the area where deviation in small details starts to matter a lot.

61. int_19h ◴[] No.44573767{3}[source]
From the market perspective, it's down to whether the amount of money needed to get there and stay there (keeping in mind that this would have to be an ongoing effort given that CUDA is not a static target) is more or less than the amount of money needed to just buy NVIDIA GPUs.
62. nightski ◴[] No.44574044{3}[source]
It's the other way around. If Apple released data center GPUs then developers might take that path. Apple has shown time and again they don't care for developers, so it's on them.
63. timhigins ◴[] No.44574685{4}[source]
Exactly and see also ROCM/HIP which is AMD’s reimplementation of CUDA for their gpus.
replies(2): >>44579461 #>>44593560 #
64. ivell ◴[] No.44574809{3}[source]
Modular is trying with Mojo + Max offering. It has taken quite a bit of effort to target NVidia and get parity. They are now focusing on other hardware.
65. ◴[] No.44577999[source]
66. musicale ◴[] No.44578022[source]
Nvidia already has unified memory on Grace Blackwell etc.

I guess M5 Blackwell could be better, but there are business and technical barriers to making that happen.

replies(1): >>44600346 #
67. adastra22 ◴[] No.44578961{6}[source]
I didn't read the GP post as talking about clean-room reimplementation, but rather just serving the same NVIDIA-written libraries on top of AMD hardware.
68. adastra22 ◴[] No.44578964{4}[source]
If it is so easy, please go do so.
69. pjmlp ◴[] No.44579461{5}[source]
Reimplementation of CUDA C++, not CUDA.

CUDA is a set of four compilers, namely C, C++, Fortran and Python JIT DSLs, a bytecode and two compiler backend libraries, a set of compute libraries collection for the languages listed above, plugins for Eclipse and Visual Studio, a GPU graphical debugger and profiler.

70. pjmlp ◴[] No.44579484{5}[source]
CUDAs adoption prove otherwise.
71. pjmlp ◴[] No.44579523{6}[source]
It was never really a competitor, as the other two sponsors Intel and AMD, never deliverd anything great with it.

Additionally the tooling is horrendous, plain old C, with the same compilation model as OpenGL.

It took getting a hard beating from CUDA, to finally add a bytecode format (SPIR), and at least support C++ as well.

Additionally the other mobile OS big name never cared about OpenCL, rather pushed their own thing, Renderscript.

replies(1): >>44584811 #
72. pjmlp ◴[] No.44579527{6}[source]
Partially, the CUDA C++ API, not CUDA APIs.
73. bigyabai ◴[] No.44584811{7}[source]
So, to reiterate: Apple used to have a CUDA competitor that was so bad you guys get mad when I call it competition.
74. qalmakka ◴[] No.44593560{5}[source]
There's ZLUDA for AMD that actually implements CUDA, but it's still quite immature yet
75. tekawade ◴[] No.44600346{3}[source]
I meant Apple ecosystem does not support NVIDIA GPUs anymore. NVIDIA itself with CUDA Support DMA.