Most active commenters
  • almostgotcaught(5)
  • davidatbu(5)
  • saagarjha(5)
  • Hizonner(4)
  • pjmlp(4)
  • benreesman(3)
  • bjourne(3)
  • boroboro4(3)

←back to thread

311 points melodyogonna | 71 comments | | HN request time: 1.759s | source | bottom
1. MontyCarloHall ◴[] No.45138920[source]
The reason why Python dominates is that modern ML applications don't exist in a vacuum. They aren't the standalone C/FORTRAN/MATLAB scripts of yore that load in some simple, homogeneous data, crunch some numbers, and spit out a single result. Rather, they are complex applications with functionality extending far beyond the number crunching, which requires a robust preexisting software ecosystem.

For example, a modern ML application might need an ETL pipeline to load and harmonize data of various types (text, images, video, etc., all in different formats) from various sources (local filesystem, cloud storage, HTTP, etc.) The actual computation then must leverage many different high-level functionalities, e.g. signal/image processing, optimization, statistics, etc. All of this computation might be too big for one machine, and so the application must dispatch jobs to a compute cluster or cloud. Finally, the end results might require sophisticated visualization and organization, with a GUI and database.

There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python. Python's numerical computing libraries (NumPy/PyTorch/JAX etc.) all call out to C/C++/FORTRAN under the hood and are thus extremely high-performance, and for functionality they don't implement, Python's C/C++ FFIs (e.g. Python.h, NumPy C integration, PyTorch/Boost C++ integration) are not perfect, but are good enough that implementing the performance-critical portions of code in C/C++ is much easier compared to re-implementing entire ecosystems of packages in another language like Julia.

replies(8): >>45139364 #>>45140601 #>>45141802 #>>45143317 #>>45144664 #>>45146179 #>>45146608 #>>45146905 #
2. Hizonner ◴[] No.45139364[source]
This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

> There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python.

That may be true, but some of us are still bitter that all that grew up around an at-least-averagely-annoying language rather than something nicer.

replies(5): >>45139454 #>>45140625 #>>45141909 #>>45142782 #>>45147478 #
3. MontyCarloHall ◴[] No.45139454[source]
>This guy is worried about GPU kernels

Then the title should be "why GPU kernel programming needs a new programming language." I can get behind that; I've written CUDA C and it was not fun (though this was over a decade ago and things may have since improved, not to mention that the code I wrote then could today be replaced by a couple lines of PyTorch). That said, GPU kernel programming is fairly niche: for the vast majority of ML applications, the high-level API functions in PyTorch/TensorFlow/JAX/etc. provide optimal GPU performance. It's pretty rare that one would need to implement custom kernels.

>which are never, ever written in Python.

Not true! Triton is a Python API for writing kernels, which are JIT compiled.

replies(3): >>45140116 #>>45143238 #>>45147017 #
4. catgary ◴[] No.45140116{3}[source]
I agree with you that writing kernels isn’t necessarily the most important thing for most ML devs. I think an MLIR-first workflow with robust support for the StableHLO and LinAlg dialects is the best path forward for ML/array programming, so on one hand I do applaud what Mojo is doing.

But I’m much more interested in how MLIR opens the door to “JAX in <x>”. I think Julia is moving in that direction with Reactant.jl, and I think there’s a Rust project doing something similar (I think burn.dev may be using ONNX has an even higher-level IR). In my ideal world, I would be able to write an ML model and training loop in some highly verified language and call it from Python/Rust for training.

5. benzible ◴[] No.45140601[source]
Python's ecosystem is hard to beat, but Elixir/Nx already does a lot of what Mojo promises. EXLA gives you GPU/TPU compilation through XLA with similar performance to Mojo's demos, Explorer handles dataframes via Polars, and now Pythonx lets you embed Python when you need those specialized libraries.

The real difference is that Elixir was built for distributed systems from day one. OTP/BEAM gives the ability to handle millions of concurrent requests as well as coordinating across GPU nodes. If you're building actual ML services (not just optimizing kernels), having everything from Phoenix / LiveView to Nx in one stack built for extreme fault-tolerance might matter more than getting the last bit of performance out of your hardware.

replies(2): >>45140671 #>>45148164 #
6. ModernMech ◴[] No.45140625[source]
> This guy is worried about GPU kernels, which are never, ever written in Python. As you point out, Python is a glue language for ML.

That's kind of the point of Mojo, they're trying to solve the so-called "two language problem" in this space. Why should you need two languages to write your glue code and kernel code? Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

replies(5): >>45141705 #>>45143663 #>>45144593 #>>45145100 #>>45145290 #
7. melodyogonna ◴[] No.45140671[source]
Who uses this Exla in production?
replies(1): >>45144825 #
8. nostrademons ◴[] No.45141705{3}[source]
It's interesting, people have been trying to solve the "two language problem" since before I started professionally programming 25 years ago, and in that time period two-language solutions have just gotten even more common. Back in the 90s they were usually spoken about only in reference to games and shell programming; now the pattern of "scripting language calls out to highly-optimized C or CUDA for compute-intensive tasks" is common for webapps, ML, cryptocurrency, drones, embedded, robotics, etc.

I think this is because many, many problem domains have a structure that lends themselves well to two-language solutions. They have a small homogenous computation structure on lots of data that needs to run extremely fast. And they also have a lot of configuration and data-munging that is basically quick one-time setup but has to be specified somewhere, and the more concisely you can specify it, the less human time development takes. The requirements on a language designed to run extremely fast are going to be very different from one that is designed to be as flexible and easy to write as possible. You usually achieve quick execution by eschewing flexibility and picking a programming model that is fairly close to the machine model, but you achieve flexibility by having lots of convenience features built into the language, most of which will have some cost in memory or indirections.

There've been a number of attempts at "one language to rule them all", notably PL/1, C++, Julia (in the mathematical programming subdomain), and Common Lisp, but it often feels like the "flexible" subset is shoehorned in to fit the need for zero-cost abstractions, and/or the "compute-optimized" subset is almost a whole separate language that is bolted on with similar but more verbose syntax.

replies(3): >>45142735 #>>45144777 #>>45147626 #
9. goatlover ◴[] No.45141802[source]
> There is no single language with a rich enough ecosystem that can provide literally all of the aforementioned functionality besides Python.

Have a hard time believing C++ and Java don't have rich enough ecosystems. Not saying they make for good glue languages, but everything was being written in those languages before Python became this popular.

replies(2): >>45142107 #>>45144959 #
10. jimbokun ◴[] No.45141909[source]
> That may be true, but some of us are still bitter that all that grew up around an at-least-averagely-annoying language rather than something nicer.

Don't worry. If you stick around this industry long enough you'll see this happen several more times.

replies(1): >>45142238 #
11. j2kun ◴[] No.45142107[source]
Yeah the OP here listed a bunch of Python stuff that all ends up shelling out to C++. C++ is rich enough, period, but people find it unpleasant to work in (which I agree with).

It's not about "richness," it's about giving a language ecosystem for people who don't really want to do the messy, low-level parts of software, and which can encapsulate the performance-critical parts with easy glue

replies(2): >>45143014 #>>45145614 #
12. Hizonner ◴[] No.45142238{3}[source]
I'm basically retired. But I'm still bitter about each of the times...
replies(1): >>45143106 #
13. soVeryTired ◴[] No.45142735{4}[source]
There's a very interesting video about the "1.5 language problem" in Julia [0]. The point being that when you write high-performance Julia it ends up looking nothing like "standard" Julia.

It seems like it's just extremely difficult to give fine-grained control over the metal while having an easy, ergonomic language that lets you just get on with your tasks.

[0] https://www.youtube.com/watch?v=RUJFd-rEa0k

14. almostgotcaught ◴[] No.45142782[source]
> which are never, ever written in Python

nah never ever ever ever ever ... except

https://github.com/FlagOpen/FlagGems

https://github.com/linkedin/Liger-Kernel

https://github.com/meta-pytorch/applied-ai

https://github.com/gpu-mode/triton-index

https://github.com/AlibabaPAI/FLASHNN

https://github.com/unslothai/unsloth

the number of people commenting on this stuff that don't know what they're actually talking about grows by leaps and bounds every day...

replies(1): >>45143363 #
15. FuckButtons ◴[] No.45143014{3}[source]
I mean, you’ve basically described why people use Python, it’s a way to use C/C++ without having to write it.
replies(1): >>45143132 #
16. anakaine ◴[] No.45143106{4}[source]
Im not.

Move on with life and be happy. What we have is functional, easy to develop, and well supported.

replies(1): >>45143241 #
17. anakaine ◴[] No.45143132{4}[source]
And ill take that reason every single day. I could spend days or more working out particular issues in C++, or I could use a much nicer to use glue language with a great ecosystem and a huge community driving it and get the same task done in minutes to hours.
18. pama ◴[] No.45143238{3}[source]
Although they are incredibly useful, the ML applications, or MLops, or ML Devops, or whatever other production-related application tech stack terminology may come to mind, are providing critical scaffolding infrastructure and glue but are not strictly what comes to mind when you only use the term “Machine Learning”. The key to machine learning is the massively parallel ability to efficiently train large neural network models, and the key to using the benefits of these trained models is the ability to rapidly evaluate them. Yes you need to get data for the training and you need complex infrastructure for the applications but the conferences, papers, studies on machine learning dont refer to these (other than in passing), in part because they are solvable, and in part because they are largely orthogonal to ML. Another way to think about it is that the nvidia price is what goes to infinity in this recent run and not the database or disk drive providers. So if someone finds a way to make the core ML part better with a programming language solution that is certainly very welcome and the title is appropriate. (The fact that GPU programming is considered niche in the current state of ML is a strong argument for keeping the title as is.)
19. Hizonner ◴[] No.45143241{5}[source]
My bitterness is the only thing that keeps me happy.
20. halayli ◴[] No.45143317[source]
Fortunately, Chris knows what he's doing. https://docs.modular.com/mojo/manual/python/
21. Hizonner ◴[] No.45143363{3}[source]
I stand corrected. I should have known people would be doing that in Python.

How many of the world's total FLOPs are going through those?

replies(2): >>45144301 #>>45147790 #
22. bobajeff ◴[] No.45143663{3}[source]
I don't think Mojo can solve the two language problem. Maybe if it was going to be superset of Python? Anyway I think that was actually Julia's goal not Mojo's.
replies(1): >>45147057 #
23. almostgotcaught ◴[] No.45144301{4}[source]
Triton is a backend for PyTorch. Lately it is the backend. So it's definitely double digits percentage if not over 50%.
replies(2): >>45145514 #>>45147142 #
24. ◴[] No.45144593{3}[source]
25. benreesman ◴[] No.45144664[source]
I'm in kind of a different place with it on the inference side.

I've got these crazy tuned up CUDA kernels that are relatively straightforward to build in isolation and really where all the magic happens, and there's this new CUTLASS 3 stuff and modern C++ can call it all trivially.

And then there's this increasingly thin film of torch crap that's just this side of unbuildable and drags in this reference counting and broken setup.py and it's a bunch of up and down projections to the real hacker shit.

I'm thinking I'm about one squeeze of the toothpaste tube from just squuezing that junk out and having a nice, clean, well-groomed C++ program that can link anything and link into anything.

replies(1): >>45146909 #
26. Karrot_Kream ◴[] No.45144777{4}[source]
From what I can tell, gaming has mostly just embraced two language solutions. The big engines Unity, Unreal, and Godot have tight cores written in C/C++ and then scripting languages that are written on top. Hobby engines like Love2D often also have a tight, small core and are extensible with languages like Lua or Fennel.

Modern Common Lisp also seems to have given up its "one language to rule them all" mindset and is pretty okay with just dropping into CFFI to call into C libraries as needed. Over the years I've come to see that mindset as mostly a dead-end. Python, web browsers, game engines, emacs, these are all prominent living examples of two-language solutions that have come to dominate in their problem spaces.

One aspect of the "two language problem" that I find troubling though is that modern environments often ossify around the exact solution. For example, it's very difficult to have something like PyTorch in say Common Lisp even though libcuda and libdnn should be fairly straightforward to wrap in Common Lisp (see [1] for Common Lisp CUDA bindings.) JS/TS/WASM that runs in the browser often is dependent on Chrome's behavior. Emacs continues to be tied to its ancient, tech-debt ridden C runtime. There seems to be a lot of value tied into the glue between the two chosen languages and it's hard to recreate that value with other HLLs even if the "metal" language/runtime stays the same.

[1]: https://github.com/takagi/cl-cuda

replies(1): >>45144991 #
27. benzible ◴[] No.45144825{3}[source]
These guys, for one: https://www.amplified.ai

See: https://www.youtube.com/watch?v=5FlZHkc4Mq4

28. flourpower471 ◴[] No.45144959[source]
Ever tried to write a web scraper in c++?
29. nostrademons ◴[] No.45144991{5}[source]
This may be because while the computational core is small, much of the code and the value of the overall solution are actually in the HLL. That's the reason for the use of a HLL in the first place.

PyTorch is actually quite illustrative as being a counterexample that proves the rule. It was based on Torch, which had very similar if not identical BLAS routines but used Lua as the scripting language. But now everybody uses PyTorch because Lua development stopped in 2017, so all the extra goodies that people rely on now are in the Python wrapper.

The only exception seems to be when multiple scripting languages are supported, and at roughly equal points of development. So for example - SQLite continues to have most of its value in the C substrate, and is relatively easy to port to other languages, because it has so many language bindings that there's a strong incentive to write new functionality in C and keep the API simple. Ditto client libraries for things like MySQL, PostGres, MongoDB, Redis, etc. ZeroMQ has a bunch of bindings that are largely dumb passthroughs to the underlying C++ substrate.

But even a small imbalance can lead to that one language being preferenced heavily in supporting tooling and documentation. Pola.rs is a Rust substrate and ships with bindings for Python, R, and Node.js, but all the examples on the website are in Python or Rust, and I rarely hear of a non-Python user picking it up.

replies(1): >>45145207 #
30. adsharma ◴[] No.45145100{3}[source]
Python -> Mojo -> MLIR ... <target hardware>

Yes, you can write mojo with python syntax and transpile. You'd end up with something similar to Julia's 1.5 language problem.

Since the mojo language is not fully specified, it's hard to understand what language constructs can't be efficiently expressed in the python syntax.

Love MLIR and Mojo as debuggable/performant intermediate languages.

31. Karrot_Kream ◴[] No.45145207{6}[source]
Very interesting observation on SQLite vs Pola.rs. Also, how could I forget that Torch was originally a Lua library when I used it forever ago.

I also wonder how much of the ossification comes from the embodied logic in the HLL. SQLite wrappers tend to be very simple and let the C core do most of the work. Something like PyTorch on the other hand layers on a lot of logic onto underlying CUDA/BLAS that is essential complexity living solely in Python the HLL. This is also probably why libcurl has so many great wrappers in HLLs because libcurl does the heavy lifting.

The pain point I see repeatedly in putting most of the logic into the performant core is asynchrony. Every HLL seems to have its own way to do async execution (Python with asyncio, Node with its async runtime, Go with lightweight green threads (goroutines), Common Lisp with native threads, etc.) This means that the C core needs to be careful as to what to expose and how to accommodate various asynchrony patterns.

32. bjourne ◴[] No.45145290{3}[source]
> Why can't there be a language which is both as easy to write as Python, but can still express GPU kernels for ML applications? That's what Mojo is trying to be through clever use of LLVM MLIR.

It already exists. It is called PyTorch/JAX/TensorFlow. These frameworks already contain sophisticated compilers for turning computational graphs into optimized GPU code. I dare say that they don't leave enough performance on the table for a completely new language to be viable.

replies(2): >>45147094 #>>45147780 #
33. lairv ◴[] No.45145514{5}[source]
It's the backend for torch.compile, pytorch eager mode will still use cuBLAS/cuDNN/custom CUDA kernels, not sure what's the usage of torch.compile
replies(1): >>45145971 #
34. lairv ◴[] No.45145614{3}[source]
I tried to statically link DuckDB to one of my C++ project earlier this year and it took me 3 days to have something working on Windows/Linux/MacOS (just to be able to use the dependency)

While I'm not a C++ expert, doing the same in Python is just one pip install away, so yeah both "richness" and "ease of use" of the ecosystem matters

35. almostgotcaught ◴[] No.45145971{6}[source]
> not sure what's the usage of torch.compile

consider that at minimum both FB and OAI themselves definitely make heavy use of the Triton backend in PyTorch.

36. fellowmartian ◴[] No.45146179[source]
Ironically Python is the worst language for everything you’ve described. Packaging is pain, wheels are pain, everything breaks all the time. It’s only great for those standalone scripts. Nobody in their right mind would design Python the way it turned out if the goal was to be the main ML language.
37. nialv7 ◴[] No.45146608[source]
You argument is circular. Python has all this ecosystem _because_ it have been the language of choice for ML for a decade. At this point it's difficult to beat, but doesn't explain why it was chosen all those years ago.
replies(2): >>45146697 #>>45147882 #
38. chickenzzzzu ◴[] No.45146697[source]
Not only is their argument circular but it is wrong. There is no need to use 50 million lines of Python, Pytorch, Numpy, Linux, Cmake, CUDA, and god knows how many other layers of madness to do inference.

It is literally on the order of tens of thousands of lines of code, instead of tens of millions, to do Vulkan ML, especially if you strip out the parts of the kernel you don't need.

39. pjmlp ◴[] No.45146905[source]
I was there when Perl and Tcl were the main actors, that is why VTK used Tcl originally.

Python dominates, because 25 years ago places like CERN started to adopt Python as their main scripting language, and eventually got used for more tasks than empowered shell scripts.

It is like arguing why C dominates and nothing else can ever replace it.

replies(1): >>45147181 #
40. pjmlp ◴[] No.45146909[source]
CUTLASS 4 has first class support for Python.
replies(1): >>45147770 #
41. seanmcdirmid ◴[] No.45147017{3}[source]
There are a lot of tricks for writing GPU code in a high level language and using some sort of meta programming to make it work out (I think Conal Elliott first did this in Haskell, where he also does symbolic differentiation in the same paper!).
42. davidatbu ◴[] No.45147057{4}[source]
Being a Python superset is literally a goal of Mojo mentioned in the podcast.

Edit: from other posts on this page, I've realized that being a superset of Python is now regarded a nice-to-have by Modular, not a must-have. They realized it's harder than they thought initially, basically.

43. davidatbu ◴[] No.45147094{4}[source]
Last I checked , all of pytorch, tensorflow, and Jax sit at a layer of abstraction that is above GPU kernels. They avail GPU kernels (as basically nodes in the computational graph you mention), but they don't let you write GPU kernels.

Triton, CUDA, etc, let one write GPU kernels.

replies(2): >>45148597 #>>45148632 #
44. cavisne ◴[] No.45147142{5}[source]
Doesn’t triton write its own intermediate language that then compiles to PTX?
replies(2): >>45147752 #>>45147783 #
45. cdavid ◴[] No.45147181[source]
I agree ability to use python to "script HPC" was key factor, but by itself would not have been enough. What really made it dominate is numpy/scipy/matplotlib becoming good enough to replace matlab 20 years ago, and enabled an explosion of tools on top of it: pandas, scikit learn, and the DL stuff ofc.

This is what differentiates python from other "morally equivalent" scripting languages.

46. anothernewdude ◴[] No.45147478[source]
> rather than something nicer.

Python was the something nicer. A lot of the other options were so much worse.

47. imtringued ◴[] No.45147626{4}[source]
You say this is some ideal outcome, but I want to get as far away from python and C++ as possible.

Also, no. I can't use Python for inference, because it is too slow, so I have to export to tensorflow lite and run the model in C++, which essentially required me to rewrite half the code in C++ again.

48. almostgotcaught ◴[] No.45147752{6}[source]
Yes and?
49. saagarjha ◴[] No.45147770{3}[source]
In fact I doubt the C++ API will be getting much love moving forward
replies(1): >>45147855 #
50. saagarjha ◴[] No.45147780{4}[source]
There's plenty of performance on the table but I don't think it will be captured by a new language.
51. saagarjha ◴[] No.45147783{6}[source]
It has a fairly standard MLIR pipeline
52. saagarjha ◴[] No.45147790{4}[source]
A lot. OpenAI uses Triton for their critical kernels. Meta has torch.compile using it too. I know Anthropic is not using Triton but I think their homegrown compiler is also Python. xAI is using CUTLASS which is C++ but I wouldn't be surprised if they start using the Python API moving forward.
replies(1): >>45148453 #
53. pjmlp ◴[] No.45147855{4}[source]
At GTC 2025, NVidia introduced two major changes in CUDA ecosystem.

First class support for Python JIT/DSLs across the whole ecosystem.

Change the way C++ is used and taught, more focused on standard C++ support and libraries, than low level CUDA extensions.

So in a way, I think you're kind of right.

replies(1): >>45149470 #
54. 317070 ◴[] No.45147882[source]
I was there when it was chosen all those years ago.

At the time (2007-2009), Matlab was the application of choice for what would become "deep" learning research. Though it had its warts, and licensing issues. It was easy for students to get started with and to use, also as a lot of them were not from computer science backgrounds, but often from statistics, engineering or neuroscience.

When autograd came (this was even before gpu's), people needed something more powerful than matlab, yet familiar. Numpy already existed, and python+numpy+matplotlib give you an environment and a language very similar to matlab. The biggest hurdle was that python is zero-indexed.

If things went slightly different, I reckon we might have ended up using Octave or lua. I reckon Octave was too restrictive and poorly documented for autograd. On the other hand, lua was too dissimilar to matlab. I think it was Theano, the first widely used python autograd, and then later PyTorch, that really sealed the deal for python.

replies(2): >>45148368 #>>45148960 #
55. devbug ◴[] No.45148164[source]
I recently built out training and inference at a FinTech (for fraud and risk) using Elixir and tried this very approach…

We’re now using Python for training and forking Ortex (ONNX) for inference.

The ecosystem just isn’t there, especially for training. It’s a little better for inference but still has significant gaps. I will eventually have time to push contributions upstream but Python has so much momentum behind it.

Livebooks are amazing though and a better experience than anything a Python offers sans libraries.

56. nickpeterson ◴[] No.45148368{3}[source]
You were there 30 years ago, when the strength of men failed?
57. almostgotcaught ◴[] No.45148453{5}[source]
Anthropic is a Jax shop
replies(1): >>45151753 #
58. bjourne ◴[] No.45148597{5}[source]
Yes, they kinda do. The computational graph you specify is completely different from the execution schedule it is compiled into. Whether it's 1, 2, or N kernels is irrelevant as long as it runs fast. Mojo being an HLL is conceptually no different from Python. Whether it will, in the future, become better for DNNs, time will tell.
replies(1): >>45148792 #
59. boroboro4 ◴[] No.45148632{5}[source]
Torch.compile sits at both the level of computation graph and GPU kernels and can fuse your operations by using triton compiler. I think something similar applies to Jax and tensorflow by the way of XLA, but I’m not 100% sure.
replies(1): >>45148807 #
60. davidatbu ◴[] No.45148792{6}[source]
I assume HLL=Higher Level Language? Mojo definitely avails lower-level facilities than Python. Chris has even described Mojo as "syntactic sugar over MLIR". (For example, the native integer type is defined in library code as a struct).

> Whether it's 1, 2, or N kernels is irrelevant.

Not sure what you mean here. But new kernels are written all the time (flash-attn is a great example). One can't do that in plain Python. E.g., flash-attn was originally written in C++ CUDA, and now in Triton.

replies(1): >>45152799 #
61. davidatbu ◴[] No.45148807{6}[source]
Good point. But the overall point about Mojo availing a different level of abstraction as compared to Python still stands: I imagine that no amount of magic/operator-fusion/etc in `torch.compile()` would let one get reasonable performance for an implementation of, say, flash-attn. One would have to use CUDA/Triton/Mojo/etc.
replies(1): >>45149868 #
62. breuleux ◴[] No.45148960{3}[source]
We chose Python for Theano because Python was already the language of choice for our research lab. If it had been my choice, I would probably have picked Scheme (I was really into macros at that time) or Ruby (I think it's better designed than Python). But if we had done it in another language than Python, frankly, I'm not sure it would have taken off in the first place. Python already had quite a bit of inertia, likely thanks to numpy and matplotlib.
63. benreesman ◴[] No.45149470{5}[source]
Nah, their people are way involved in mdarray and ROCm is looking to have the "oh no its broken again" bit flipped off in the RDNA 4/5 cycle.

NVIDIA wants Python and C++ people, they want a new thing to moat up on, and they know it has to be legitimately good to defy economic gravity on chips a lot of companies can design and fan now.

replies(1): >>45149634 #
64. pjmlp ◴[] No.45149634{6}[source]
Intel and AMD don't have anyone but themselves to blame.

And Khronos always expecting the community to do the work for their standards, filling in the missing pieces for actually excellence in developer experience.

replies(2): >>45150267 #>>45154804 #
65. boroboro4 ◴[] No.45149868{7}[source]
But python is already operating fully on different level of abstraction - you mention triton yourself, and there is new python cuda api too (the one similar to triton). More to this - flash attention 4 is actually written in python.

Somehow python managed to be both high level and low level language for GPUs…

replies(1): >>45150630 #
66. benreesman ◴[] No.45150267{7}[source]
Well I blame our legislators, our regulators, and a public who tolerates low integrity, low competence leadership. A society gets what it pays for with the standards it sets for its leaders.

Out of a great many outcomes a very topical one today is semiconductors, and the outcomes rival any on anything for corrupt, incompetent, and entirely consistent with our speed run of the road to irrelevance.

67. davidatbu ◴[] No.45150630{8}[source]
IIUC, triton uses Python syntax, but it has a separate compiler (which is kinda what Mojo is doing, except Mojo's syntax is a superset of Python's, instead of a subset, like Triton). I think it's fair to describe it as a different language (otherwise, we'd also have to describe Mojo also as "Python"). Triton's website and repo describes itself as "the Triton language and compiler" (as opposed to, I dunno, "Write GPU kernels in Python").

Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?

[0] https://github.com/Dao-AILab/flash-attention

But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?

replies(1): >>45150751 #
68. boroboro4 ◴[] No.45150751{9}[source]
From this perspective PyTorch is separate language, at least as soon as you start using torch.compile (only subset of PyTorch python will be compilable). That’s strength of python - it’s great for describing things and later for analyzing them (and compiling, for example).

Just to be clear here - you use triton from plain python, it runs compilation inside.

Just like I’m pretty sure not all mojo can be used to write kernels? I might be wrong here, but it would be very hard to fit general purpose code into kernels (and to be frank pointless, constrains bring speed).

As for flash attention there was a leak: https://www.reddit.com/r/LocalLLaMA/comments/1mt9htu/flashat...

69. saagarjha ◴[] No.45151753{6}[source]
Surprised they got it working for Tranium
70. bjourne ◴[] No.45152799{7}[source]
Well, Mojo hasn't been released so we can't precisely say what it can and can't do. If it can emit CUDA code then it does it by transpiling Mojo into CUDA. And there is no reason why Python can't also be transpiled to CUDA.

What I mean here is that DNN code is written on a much higher level than kernels. They are just building blocks you use to instantiate your dataflow.

71. bigyabai ◴[] No.45154804{7}[source]
You don't seem to know what Khronos even is, at this point. It is a nonprofit consortium, joining the group for the community effort is the only reason it exists.

By blaming Khronos you're really just reiterating blame on Intel and AMD with the tacit inclusion of Apple. I suppose you could blame Nvidia for not giving away their IP, but they were a staunch OpenCL supporter from the start.