Most active commenters
  • almostgotcaught(5)
  • davidatbu(5)
  • Hizonner(4)
  • saagarjha(4)
  • bjourne(3)
  • boroboro4(3)

←back to thread

311 points melodyogonna | 43 comments | | HN request time: 0.001s | source | bottom
Show context
MontyCarloHall ◴[] No.45138920[source]
The reason why Python dominates is that modern ML applications don't exist in a vacuum. They aren't the standalone C/FORTRAN/MATLAB scripts of yore that load in some simple, homogeneous data, crunch some numbers, and spit out a single result. Rather, they are complex applications with functionality extending far beyond the number crunching, which requires a robust preexisting software ecosystem.

For example, a modern ML application might need an ETL pipeline to load and harmonize data of various types (text, images, video, etc., all in different formats) from various sources (local filesystem, cloud storage, HTTP, etc.) The actual computation then must leverage many different high-level functionalities, e.g. signal/image processing, optimization, statistics, etc. All of this computation might be too big for one machine, and so the application must dispatch jobs to a compute cluster or cloud. Finally, the end results might require sophisticated visualization and organization, with a GUI and database.

There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python. Python's numerical computing libraries (NumPy/PyTorch/JAX etc.) all call out to C/C++/FORTRAN under the hood and are thus extremely high-performance, and for functionality they don't implement, Python's C/C++ FFIs (e.g. Python.h, NumPy C integration, PyTorch/Boost C++ integration) are not perfect, but are good enough that implementing the performance-critical portions of code in C/C++ is much easier compared to re-implementing entire ecosystems of packages in another language like Julia.

replies(8): >>45139364 #>>45140601 #>>45141802 #>>45143317 #>>45144664 #>>45146179 #>>45146608 #>>45146905 #
1. Hizonner ◴[] No.45139364[source]
This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

> There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python.

That may be true, but some of us are still bitter that all that grew up around an at-least-averagely-annoying language rather than something nicer.

replies(5): >>45139454 #>>45140625 #>>45141909 #>>45142782 #>>45147478 #
2. MontyCarloHall ◴[] No.45139454[source]
>This guy is worried about GPU kernels

Then the title should be "why GPU kernel programming needs a new programming language." I can get behind that; I've written CUDA C and it was not fun (though this was over a decade ago and things may have since improved, not to mention that the code I wrote then could today be replaced by a couple lines of PyTorch). That said, GPU kernel programming is fairly niche: for the vast majority of ML applications, the high-level API functions in PyTorch/TensorFlow/JAX/etc. provide optimal GPU performance. It's pretty rare that one would need to implement custom kernels.

>which are never, ever written in Python.

Not true! Triton is a Python API for writing kernels, which are JIT compiled.

replies(3): >>45140116 #>>45143238 #>>45147017 #
3. catgary ◴[] No.45140116[source]
I agree with you that writing kernels isn’t necessarily the most important thing for most ML devs. I think an MLIR-first workflow with robust support for the StableHLO and LinAlg dialects is the best path forward for ML/array programming, so on one hand I do applaud what Mojo is doing.

But I’m much more interested in how MLIR opens the door to “JAX in <x>”. I think Julia is moving in that direction with Reactant.jl, and I think there’s a Rust project doing something similar (I think burn.dev may be using ONNX has an even higher-level IR). In my ideal world, I would be able to write an ML model and training loop in some highly verified language and call it from Python/Rust for training.

4. ModernMech ◴[] No.45140625[source]
> This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

That's kind of the point of Mojo, they're trying to solve the so-called "two language problem" in this space. Why should you need two languages to write your glue code and kernel code? Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

replies(5): >>45141705 #>>45143663 #>>45144593 #>>45145100 #>>45145290 #
5. nostrademons ◴[] No.45141705[source]
It's interesting, people have been trying to solve the "two language problem" since before I started professionally programming 25 years ago, and in that time period two-language solutions have just gotten even more common. Back in the 90s they were usually spoken about only in reference to games and shell programming; now the pattern of "scripting language calls out to highly-optimized C or CUDA for compute-intensive tasks" is common for webapps, ML, cryptocurrency, drones, embedded, robotics, etc.

I think this is because many, many problem domains have a structure that lends themselves well to two-language solutions. They have a small homogenous computation structure on lots of data that needs to run extremely fast. And they also have a lot of configuration and data-munging that is basically quick one-time setup but has to be specified somewhere, and the more concisely you can specify it, the less human time development takes. The requirements on a language designed to run extremely fast are going to be very different from one that is designed to be as flexible and easy to write as possible. You usually achieve quick execution by eschewing flexibility and picking a programming model that is fairly close to the machine model, but you achieve flexibility by having lots of convenience features built into the language, most of which will have some cost in memory or indirections.

There've been a number of attempts at "one language to rule them all", notably PL/1, C++, Julia (in the mathematical programming subdomain), and Common Lisp, but it often feels like the "flexible" subset is shoehorned in to fit the need for zero-cost abstractions, and/or the "compute-optimized" subset is almost a whole separate language that is bolted on with similar but more verbose syntax.

replies(3): >>45142735 #>>45144777 #>>45147626 #
6. jimbokun ◴[] No.45141909[source]
> That may be true, but some of us are still bitter that all that grew up around an at-least-averagely-annoying language rather than something nicer.

Don't worry. If you stick around this industry long enough you'll see this happen several more times.

replies(1): >>45142238 #
7. Hizonner ◴[] No.45142238[source]
I'm basically retired. But I'm still bitter about each of the times...
replies(1): >>45143106 #
8. soVeryTired ◴[] No.45142735{3}[source]
There's a very interesting video about the "1.5 language problem" in Julia [0]. The point being that when you write high-performance Julia it ends up looking nothing like "standard" Julia.

It seems like it's just extremely difficult to give fine-grained control over the metal while having an easy, ergonomic language that lets you just get on with your tasks.

[0] https://www.youtube.com/watch?v=RUJFd-rEa0k

9. almostgotcaught ◴[] No.45142782[source]
> which are never, ever written in Python

nah never ever ever ever ever ... except

https://github.com/FlagOpen/FlagGems

https://github.com/linkedin/Liger-Kernel

https://github.com/meta-pytorch/applied-ai

https://github.com/gpu-mode/triton-index

https://github.com/AlibabaPAI/FLASHNN

https://github.com/unslothai/unsloth

the number of people commenting on this stuff that don't know what they're actually talking about grows by leaps and bounds every day...

replies(1): >>45143363 #
10. anakaine ◴[] No.45143106{3}[source]
Im not.

Move on with life and be happy. What we have is functional, easy to develop, and well supported.

replies(1): >>45143241 #
11. pama ◴[] No.45143238[source]
Although they are incredibly useful, the ML applications, or MLops, or ML Devops, or whatever other production-related application tech stack terminology may come to mind, are providing critical scaffolding infrastructure and glue but are not strictly what comes to mind when you only use the term “Machine Learning”. The key to machine learning is the massively parallel ability to efficiently train large neural network models, and the key to using the benefits of these trained models is the ability to rapidly evaluate them. Yes you need to get data for the training and you need complex infrastructure for the applications but the conferences, papers, studies on machine learning dont refer to these (other than in passing), in part because they are solvable, and in part because they are largely orthogonal to ML. Another way to think about it is that the nvidia price is what goes to infinity in this recent run and not the database or disk drive providers. So if someone finds a way to make the core ML part better with a programming language solution that is certainly very welcome and the title is appropriate. (The fact that GPU programming is considered niche in the current state of ML is a strong argument for keeping the title as is.)
12. Hizonner ◴[] No.45143241{4}[source]
My bitterness is the only thing that keeps me happy.
13. Hizonner ◴[] No.45143363[source]
I stand corrected. I should have known people would be doing that in Python.

How many of the world's total FLOPs are going through those?

replies(2): >>45144301 #>>45147790 #
14. bobajeff ◴[] No.45143663[source]
I don't think Mojo can solve the two language problem. Maybe if it was going to be superset of Python? Anyway I think that was actually Julia's goal not Mojo's.
replies(1): >>45147057 #
15. almostgotcaught ◴[] No.45144301{3}[source]
Triton is a backend for PyTorch. Lately it is the backend. So it's definitely double digits percentage if not over 50%.
replies(2): >>45145514 #>>45147142 #
16. ◴[] No.45144593[source]
17. Karrot_Kream ◴[] No.45144777{3}[source]
From what I can tell, gaming has mostly just embraced two language solutions. The big engines Unity, Unreal, and Godot have tight cores written in C/C++ and then scripting languages that are written on top. Hobby engines like Love2D often also have a tight, small core and are extensible with languages like Lua or Fennel.

Modern Common Lisp also seems to have given up its "one language to rule them all" mindset and is pretty okay with just dropping into CFFI to call into C libraries as needed. Over the years I've come to see that mindset as mostly a dead-end. Python, web browsers, game engines, emacs, these are all prominent living examples of two-language solutions that have come to dominate in their problem spaces.

One aspect of the "two language problem" that I find troubling though is that modern environments often ossify around the exact solution. For example, it's very difficult to have something like PyTorch in say Common Lisp even though libcuda and libdnn should be fairly straightforward to wrap in Common Lisp (see [1] for Common Lisp CUDA bindings.) JS/TS/WASM that runs in the browser often is dependent on Chrome's behavior. Emacs continues to be tied to its ancient, tech-debt ridden C runtime. There seems to be a lot of value tied into the glue between the two chosen languages and it's hard to recreate that value with other HLLs even if the "metal" language/runtime stays the same.

[1]: https://github.com/takagi/cl-cuda

replies(1): >>45144991 #
18. nostrademons ◴[] No.45144991{4}[source]
This may be because while the computational core is small, much of the code and the value of the overall solution are actually in the HLL. That's the reason for the use of a HLL in the first place.

PyTorch is actually quite illustrative as being a counterexample that proves the rule. It was based on Torch, which had very similar if not identical BLAS routines but used Lua as the scripting language. But now everybody uses PyTorch because Lua development stopped in 2017, so all the extra goodies that people rely on now are in the Python wrapper.

The only exception seems to be when multiple scripting languages are supported, and at roughly equal points of development. So for example - SQLite continues to have most of its value in the C substrate, and is relatively easy to port to other languages, because it has so many language bindings that there's a strong incentive to write new functionality in C and keep the API simple. Ditto client libraries for things like MySQL, PostGres, MongoDB, Redis, etc. ZeroMQ has a bunch of bindings that are largely dumb passthroughs to the underlying C++ substrate.

But even a small imbalance can lead to that one language being preferenced heavily in supporting tooling and documentation. Pola.rs is a Rust substrate and ships with bindings for Python, R, and Node.js, but all the examples on the website are in Python or Rust, and I rarely hear of a non-Python user picking it up.

replies(1): >>45145207 #
19. adsharma ◴[] No.45145100[source]
Python -> Mojo -> MLIR ... <target hardware>

Yes, you can write mojo with python syntax and transpile. You'd end up with something similar to Julia's 1.5 language problem.

Since the mojo language is not fully specified, it's hard to understand what language constructs can't be efficiently expressed in the python syntax.

Love MLIR and Mojo as debuggable/performant intermediate languages.

20. Karrot_Kream ◴[] No.45145207{5}[source]
Very interesting observation on SQLite vs Pola.rs. Also, how could I forget that Torch was originally a Lua library when I used it forever ago.

I also wonder how much of the ossification comes from the embodied logic in the HLL. SQLite wrappers tend to be very simple and let the C core do most of the work. Something like PyTorch on the other hand layers on a lot of logic onto underlying CUDA/BLAS that is essential complexity living solely in Python the HLL. This is also probably why libcurl has so many great wrappers in HLLs because libcurl does the heavy lifting.

The pain point I see repeatedly in putting most of the logic into the performant core is asynchrony. Every HLL seems to have its own way to do async execution (Python with asyncio, Node with its async runtime, Go with lightweight green threads (goroutines), Common Lisp with native threads, etc.) This means that the C core needs to be careful as to what to expose and how to accommodate various asynchrony patterns.

21. bjourne ◴[] No.45145290[source]
> Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

It already exists. It is called PyTorch/JAX/TensorFlow. These frameworks already contain sophisticated compilers for turning computational graphs into optimized GPU code. I dare say that they don't leave enough performance on the table for a completely new language to be viable.

replies(2): >>45147094 #>>45147780 #
22. lairv ◴[] No.45145514{4}[source]
It's the backend for torch.compile, pytorch eager mode will still use cuBLAS/cuDNN/custom CUDA kernels, not sure what's the usage of torch.compile
replies(1): >>45145971 #
23. almostgotcaught ◴[] No.45145971{5}[source]
> not sure what's the usage of torch.compile

consider that at minimum both FB and OAI themselves definitely make heavy use of the Triton backend in PyTorch.

24. seanmcdirmid ◴[] No.45147017[source]
There are a lot of tricks for writing GPU code in a high level language and using some sort of meta programming to make it work out (I think Conal Elliott first did this in Haskell, where he also does symbolic differentiation in the same paper!).
25. davidatbu ◴[] No.45147057{3}[source]
Being a Python superset is literally a goal of Mojo mentioned in the podcast.

Edit: from other posts on this page, I've realized that being a superset of Python is now regarded a nice-to-have by Modular, not a must-have. They realized it's harder than they thought initially, basically.

26. davidatbu ◴[] No.45147094{3}[source]
Last I checked , all of pytorch, tensorflow, and Jax sit at a layer of abstraction that is above GPU kernels. They avail GPU kernels (as basically nodes in the computational graph you mention), but they don't let you write GPU kernels.

Triton, CUDA, etc, let one write GPU kernels.

replies(2): >>45148597 #>>45148632 #
27. cavisne ◴[] No.45147142{4}[source]
Doesn’t triton write its own intermediate language that then compiles to PTX?
replies(2): >>45147752 #>>45147783 #
28. anothernewdude ◴[] No.45147478[source]
> rather than something nicer.

Python was the something nicer. A lot of the other options were so much worse.

29. imtringued ◴[] No.45147626{3}[source]
You say this is some ideal outcome, but I want to get as far away from python and C++ as possible.

Also, no. I can't use Python for inference, because it is too slow, so I have to export to tensorflow lite and run the model in C++, which essentially required me to rewrite half the code in C++ again.

30. almostgotcaught ◴[] No.45147752{5}[source]
Yes and?
31. saagarjha ◴[] No.45147780{3}[source]
There's plenty of performance on the table but I don't think it will be captured by a new language.
32. saagarjha ◴[] No.45147783{5}[source]
It has a fairly standard MLIR pipeline
33. saagarjha ◴[] No.45147790{3}[source]
A lot. OpenAI uses Triton for their critical kernels. Meta has torch.compile using it too. I know Anthropic is not using Triton but I think their homegrown compiler is also Python. xAI is using CUTLASS which is C++ but I wouldn't be surprised if they start using the Python API moving forward.
replies(1): >>45148453 #
34. almostgotcaught ◴[] No.45148453{4}[source]
Anthropic is a Jax shop
replies(1): >>45151753 #
35. bjourne ◴[] No.45148597{4}[source]
Yes, they kinda do. The computational graph you specify is completely different from the execution schedule it is compiled into. Whether it's 1, 2, or N kernels is irrelevant as long as it runs fast. Mojo being an HLL is conceptually no different from Python. Whether it will, in the future, become better for DNNs, time will tell.
replies(1): >>45148792 #
36. boroboro4 ◴[] No.45148632{4}[source]
Torch.compile sits at both the level of computation graph and GPU kernels and can fuse your operations by using triton compiler. I think something similar applies to Jax and tensorflow by the way of XLA, but I’m not 100% sure.
replies(1): >>45148807 #
37. davidatbu ◴[] No.45148792{5}[source]
I assume HLL=Higher Level Language? Mojo definitely avails lower-level facilities than Python. Chris has even described Mojo as "syntactic sugar over MLIR". (For example, the native integer type is defined in library code as a struct).

> Whether it's 1, 2, or N kernels is irrelevant.

Not sure what you mean here. But new kernels are written all the time (flash-attn is a great example). One can't do that in plain Python. E.g., flash-attn was originally written in C++ CUDA, and now in Triton.

replies(1): >>45152799 #
38. davidatbu ◴[] No.45148807{5}[source]
Good point. But the overall point about Mojo availing a different level of abstraction as compared to Python still stands: I imagine that no amount of magic/operator-fusion/etc in `torch.compile()` would let one get reasonable performance for an implementation of, say, flash-attn. One would have to use CUDA/Triton/Mojo/etc.
replies(1): >>45149868 #
39. boroboro4 ◴[] No.45149868{6}[source]
But python is already operating fully on different level of abstraction - you mention triton yourself, and there is new python cuda api too (the one similar to triton). More to this - flash attention 4 is actually written in python.

Somehow python managed to be both high level and low level language for GPUs…

replies(1): >>45150630 #
40. davidatbu ◴[] No.45150630{7}[source]
IIUC, triton uses Python syntax, but it has a separate compiler (which is kinda what Mojo is doing, except Mojo's syntax is a superset of Python's, instead of a subset, like Triton). I think it's fair to describe it as a different language (otherwise, we'd also have to describe Mojo also as "Python"). Triton's website and repo describes itself as "the Triton language and compiler" (as opposed to, I dunno, "Write GPU kernels in Python").

Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?

[0] https://github.com/Dao-AILab/flash-attention

But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?

replies(1): >>45150751 #
41. boroboro4 ◴[] No.45150751{8}[source]
From this perspective PyTorch is separate language, at least as soon as you start using torch.compile (only subset of PyTorch python will be compilable). That’s strength of python - it’s great for describing things and later for analyzing them (and compiling, for example).

Just to be clear here - you use triton from plain python, it runs compilation inside.

Just like I’m pretty sure not all mojo can be used to write kernels? I might be wrong here, but it would be very hard to fit general purpose code into kernels (and to be frank pointless, constrains bring speed).

As for flash attention there was a leak: https://www.reddit.com/r/LocalLLaMA/comments/1mt9htu/flashat...

42. saagarjha ◴[] No.45151753{5}[source]
Surprised they got it working for Tranium
43. bjourne ◴[] No.45152799{6}[source]
Well, Mojo hasn't been released so we can't precisely say what it can and can't do. If it can emit CUDA code then it does it by transpiling Mojo into CUDA. And there is no reason why Python can't also be transpiled to CUDA.

What I mean here is that DNN code is written on a much higher level than kernels. They are just building blocks you use to instantiate your dataflow.