←back to thread

365 points tanelpoder | 3 comments | | HN request time: 0s | source
Show context
alkh ◴[] No.44978054[source]
I enjoy using uv a lot but am getting afraid that it is getting bloated for no reason. For ex., the number of niche flags that a lot of subcommands support is very high + some of them seemingly achieve the same result(uv run --no-project and uv run --active). I'd rather them working on improving existing tools and documentation than adding new (redundant) functionality
replies(4): >>44978481 #>>44978543 #>>44981333 #>>44986265 #
benreesman ◴[] No.44978543[source]
It's really difficult to do Python projects in a sound, reproducible, reasonably portable way. uv sync is in general able to build you only a package set that it can promise to build again.

But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip. They are trying to thread a delicate an economically pivotal needle: the Python community prizes privatizing the profits and socializing the costs of "works on my box".

The cost of making the software deployable, secure, repeatable, reliable didn't go away! It just became someone else's problem at a later time in a different place with far fewer degrees of freedom.

Doing this in a way that satisfies serious operations people without alienating the "works on my box...sometimes" crowd is The Lord's Work.

replies(1): >>44980418 #
kouteiheika ◴[] No.44980418[source]
> But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip.

This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case.

What you can do is the following:

1) Compile your CUDA kernels offline. 2) Include those compiled kernels in a package you push to pypi. 3) Call into the kernels with pure Python, without going through a C++ extension.

I do this for the CUDA kernels I maintain and it works great.

Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.

[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...

replies(1): >>44981179 #
twothreeone ◴[] No.44981179[source]
While shipping binary kernels may be a workaround for some users, it goes against what many people would consider "good etiquette" for various valid reasons, such as hackability, security, or providing free (as in liberty) software.
replies(2): >>44982899 #>>44983619 #
kouteiheika ◴[] No.44983619[source]
It's not a workaround; it's the most sane way of shipping such software. As long as the builds are reproducible there's nothing wrong with shipping binaries by default, especially when those binaries require non-trivial dependencies (the whole CUDA toolchain) to build.

There's a reason why even among the most diehard Linux users very few run Gentoo and compile their whole system from scratch.

replies(1): >>44983973 #
1. benreesman ◴[] No.44983973{3}[source]
I agree with you that binary distribution is a perfectly reasonable adjunct to source distribution and sometimes even the more sensible one (toolchain size, etc).

In this instance the build is way nastier than building the NVIDIA toolchain (which Nix can do with a single line of configuration in most cases), and the binary artifacts are almost as broken as the source artifact because of NVIDIA tensor core generation shenanigans.

The real answer here is to fucking fork flash-attn and fix it. And it's on my list, but I'm working my way down the major C++ packages that all that stuff links to first. `libmodern-cpp` should be ready for GitHub in two or three months. `hypermodern-ai` is still mostly a domain name and some scripts, but they're the scripts I use in production, so it's coming.

replies(1): >>44986051 #
2. kouteiheika ◴[] No.44986051[source]
I thought about fixing Flash Attention too so that I don't have to recompile it every time I update Python or pytorch (it's the only special snowflake dependency that I need to manually handle), but at the end of the day it's not that much of a pain to justify the time investment.

If I'm going to invest time here then I'd rather just write my own attention kernels and also do other things which Flash Attention currently doesn't do (8-bit and 4-bit attention variants similar to Sage Attention, and focus on supporting/optimizing primarily for GeForce and RTX Pro GPUs instead of datacenter GPUs which are unobtanium for normal people).

replies(1): >>44986168 #
3. benreesman ◴[] No.44986168[source]
I usually think the same way, and I bet a lot of people do which is why its still broken. But I've finally decided it's never going away completely and it's time to just fix it.