https://cuda.juliagpu.org/stable/tutorials/introduction/#Wri...
With KernelAbstractions.jl you can actually target CUDA and ROCm:
https://juliagpu.github.io/KernelAbstractions.jl/stable/kern...
For python (or rather python-like), there is also triton (and probably others):
Mojo is effectively an internal tool that Modular have released publicly.
I'd be surprised to see any serious adoption until a 1.0 state is reached.
But as the other commented said, it's not really competing with PyTorch, it's competing with CUDA.
Although I have my doubts that Julia is actually willing to make the compromises which would allow Julia to go that low level. I.e. semantic guarantees about allocations and inference, guarantees about certain optimizations, and more.
First of all some people really like Julia, regardless of how it gets discussed on HN, its commercial use has been steadily growing, and has GPGPU support.
On the other hand, regardless of the sore state of JIT compilers on CPU side for Python, at least MVidia and Intel are quite serious on Python DSLs for GPGPU programming on CUDA and One API, so one gets close enough to C++ performance while staying in Python.
So Mojo isn't that appealing in the end.
1. Easy packaging into one executable. Then, making sure that can be reproducible across versions. Getting code from prior, AI papers to rub can be hard.
2. Predictability vs Python runtime. Think concurrent, low-latency GC's or low/zero-overhead abstractions.
3. Metaprogramming. There have been macro proposals for Python. Mojo could borrow from D or Rust here.
4. Extensibility in a way where extensions don't get too tied into the internal state of Mojo like they do Python. I've considered Python to C++, Rust, or parallelized Python schemes many times. The extension interplay is harder to deal with than either Python or C++ itself.
5. Write once, run anywhere, to effortlessly move code across different accelerators. Several frameworks are doing this.
6. Heterogenous, hot-swappable, vendor-neutral acceleration. That's what I'm calling it when you can use the same code in a cluster with a combination of Nvidia GPU', AMD GPU's, Gaudi3's, NPU's, SIMD chips, etc.
As per the roadmap[1], I expect to start seeing more adoption once phase 1 is completed.
If you're interested, they think the language will be ready for open source after completing phase 1 of the roadmap[2].
Most people that know this kind of thing don't get much value out of using a high level language to do it, and it's a huge risk because if the language fails to generate something that you want, you're stuck until a compiler team fixes and ships a patch which could take weeks or months. Even extremely fast bug fixes are still extremely slow on the timescales people want to work on.
I've spent a lot of my career trying to make high level languages for performance work well, and I've basically decided that the sweet spot for me is C++ templates: I can get the compiler to generate a lot of good code concisely, and when it fails the escape hatch of just writing some architecture specific intrinsics is right there whenever it is needed.
Got any sources on that? I've been interested in learning Julia for a while but don't because it feels useless compared to Python, especially now with 3.13
C++ just seems like a safer bet but I'd love something better and more ergonomic.
Optimizing Julia is much harder than optimizing Fortran or C.
https://info.juliahub.com/industries/case-studies-1/author/j...