←back to thread

311 points melodyogonna | 5 comments | | HN request time: 0.014s | source
Show context
MontyCarloHall ◴[] No.45138920[source]
The reason why Python dominates is that modern ML applications don't exist in a vacuum. They aren't the standalone C/FORTRAN/MATLAB scripts of yore that load in some simple, homogeneous data, crunch some numbers, and spit out a single result. Rather, they are complex applications with functionality extending far beyond the number crunching, which requires a robust preexisting software ecosystem.

For example, a modern ML application might need an ETL pipeline to load and harmonize data of various types (text, images, video, etc., all in different formats) from various sources (local filesystem, cloud storage, HTTP, etc.) The actual computation then must leverage many different high-level functionalities, e.g. signal/image processing, optimization, statistics, etc. All of this computation might be too big for one machine, and so the application must dispatch jobs to a compute cluster or cloud. Finally, the end results might require sophisticated visualization and organization, with a GUI and database.

There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python. Python's numerical computing libraries (NumPy/PyTorch/JAX etc.) all call out to C/C++/FORTRAN under the hood and are thus extremely high-performance, and for functionality they don't implement, Python's C/C++ FFIs (e.g. Python.h, NumPy C integration, PyTorch/Boost C++ integration) are not perfect, but are good enough that implementing the performance-critical portions of code in C/C++ is much easier compared to re-implementing entire ecosystems of packages in another language like Julia.

replies(8): >>45139364 #>>45140601 #>>45141802 #>>45143317 #>>45144664 #>>45146179 #>>45146608 #>>45146905 #
Hizonner ◴[] No.45139364[source]
This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

> There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python.

That may be true, but some of us are still bitter that all that grew up around an at-least-averagely-annoying language rather than something nicer.

replies(5): >>45139454 #>>45140625 #>>45141909 #>>45142782 #>>45147478 #
ModernMech ◴[] No.45140625[source]
> This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

That's kind of the point of Mojo, they're trying to solve the so-called "two language problem" in this space. Why should you need two languages to write your glue code and kernel code? Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

replies(5): >>45141705 #>>45143663 #>>45144593 #>>45145100 #>>45145290 #
bjourne ◴[] No.45145290{3}[source]
> Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

It already exists. It is called PyTorch/JAX/TensorFlow. These frameworks already contain sophisticated compilers for turning computational graphs into optimized GPU code. I dare say that they don't leave enough performance on the table for a completely new language to be viable.

replies(2): >>45147094 #>>45147780 #
davidatbu ◴[] No.45147094{4}[source]
Last I checked , all of pytorch, tensorflow, and Jax sit at a layer of abstraction that is above GPU kernels. They avail GPU kernels (as basically nodes in the computational graph you mention), but they don't let you write GPU kernels.

Triton, CUDA, etc, let one write GPU kernels.

replies(2): >>45148597 #>>45148632 #
1. boroboro4 ◴[] No.45148632{5}[source]
Torch.compile sits at both the level of computation graph and GPU kernels and can fuse your operations by using triton compiler. I think something similar applies to Jax and tensorflow by the way of XLA, but I’m not 100% sure.
replies(1): >>45148807 #
2. davidatbu ◴[] No.45148807[source]
Good point. But the overall point about Mojo availing a different level of abstraction as compared to Python still stands: I imagine that no amount of magic/operator-fusion/etc in `torch.compile()` would let one get reasonable performance for an implementation of, say, flash-attn. One would have to use CUDA/Triton/Mojo/etc.
replies(1): >>45149868 #
3. boroboro4 ◴[] No.45149868[source]
But python is already operating fully on different level of abstraction - you mention triton yourself, and there is new python cuda api too (the one similar to triton). More to this - flash attention 4 is actually written in python.

Somehow python managed to be both high level and low level language for GPUs…

replies(1): >>45150630 #
4. davidatbu ◴[] No.45150630{3}[source]
IIUC, triton uses Python syntax, but it has a separate compiler (which is kinda what Mojo is doing, except Mojo's syntax is a superset of Python's, instead of a subset, like Triton). I think it's fair to describe it as a different language (otherwise, we'd also have to describe Mojo also as "Python"). Triton's website and repo describes itself as "the Triton language and compiler" (as opposed to, I dunno, "Write GPU kernels in Python").

Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?

[0] https://github.com/Dao-AILab/flash-attention

But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?

replies(1): >>45150751 #
5. boroboro4 ◴[] No.45150751{4}[source]
From this perspective PyTorch is separate language, at least as soon as you start using torch.compile (only subset of PyTorch python will be compilable). That’s strength of python - it’s great for describing things and later for analyzing them (and compiling, for example).

Just to be clear here - you use triton from plain python, it runs compilation inside.

Just like I’m pretty sure not all mojo can be used to write kernels? I might be wrong here, but it would be very hard to fit general purpose code into kernels (and to be frank pointless, constrains bring speed).

As for flash attention there was a leak: https://www.reddit.com/r/LocalLLaMA/comments/1mt9htu/flashat...