←back to thread

268 points dipampaul17 | 1 comments | | HN request time: 0.209s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context
Aurornis ◴[] No.44012200[source]
I finally had time to read the code. The patch is unnecessary because this functionality has been in llama.cpp since 2023 if I understand this PR correctly: https://github.com/ggml-org/llama.cpp/pull/4312

Instead of offering a forked llama.cpp with the changes applied as commits, the repo wants you to run an `install.sh` script which checks out the master branch of llama.cpp without specifying a revision, then applies a short patch to it. This alone should be a warning flag that something is amiss.

There are 4 different patch files in the repo and 1 extra version of the patch as a Heredoc embedded in the install script for some reason. The script has two different versions of code to clone the repo and attempt the patch, too.

The install.sh script overwrites one of the patch files with another patch file with this line:

> cp patch/split_kv_quant.diff patch/fixed_kv_patch.diff

So the `fixed_kv_patch.diff` that is checked into the repo gets overwritten before being applied.

As far as I can tell, this is therefore the patch it's supposed to use: https://github.com/dipampaul17/KVSplit/blob/main/patch/split... (EDIT: I think it's actually this one, see comment at the end: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... )

The only thing it adds is a "--kvq" argument which is supposed to let you set K and V quantization at the same time, but immediately above it are the already built-in arguments for setting the K and V quantization separately. Surely the author must have noticed the functionality already existed at some point while shuffling these patches around?

I strongly recommend that people do not run shell scripts from new repos like this, especially when the shell script is so convoluted.

The HN post has 200+ upvotes and the GitHub repo has collected 200+ stars and climbing at this point, but I think the content is misleading. The flagged-to-death comment in this thread calling out the problem was actually correct. It's also concerning that the author continues to respond to this thread but is avoiding any questions about the functionality already existing.

EDIT: I misread the shell script. I think it actually applies this patch: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... After applying the patch it mysteriously overwrites the fixed_kv_patch.diff patch with the split_kv_quant.diff file but then does nothing with it. I don't know if this is the result of vibecoding or just someone carelessly editing code, but I'll reiterate that nobody should run shell scripts like this from unknown repos.

EDIT 2: I'm even more confused now. The install.sh script references the old URL for the llama.cpp repo ( https://github.com/ggerganov/llama.cpp ) which now redirects because it was changed some time ago. The patches attempt to modify arg parsing in common.cpp, but that code was moved to arg.cpp 8 months ago ( https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bf... ). So this install script and repo appear to be based on code from ~2024 using options added to llama.cpp in ~2023. What is going on here?

replies(2): >>44013055 #>>44013769 #
1. imiric ◴[] No.44013055[source]
Finally someone making sense. The fact this project works by applying patches instead of forking the original project and committing changes should alone be reason for concern.

But OP's entire GitHub presence is suspicious. On May 12th they fired off LLM slop PRs to a bunch of popular projects, and only the JAX ones were rejected. Nevertheless, this allowed them to pin these popular projects to their profile, as if they were a contributor.

I can't put into words how despicable this all is. Anyone working in the AI field is complicit in the corruption of information, the ramifications of which we can't even predict yet. Dead internet and the flood of AI slop is just the beginning.