←back to thread

268 points dipampaul17 | 4 comments | | HN request time: 0.964s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

1. smcleod ◴[] No.44009960[source]
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?
replies(1): >>44010430 #
2. nomel ◴[] No.44010430[source]
> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.

replies(1): >>44010906 #
3. smcleod ◴[] No.44010906[source]
Yeah I get that, that's what we yse k/v cache quantisation for now which has a lower impact on PPL than this unless I'm missing something?
replies(1): >>44011526 #
4. dipampaul17 ◴[] No.44011526{3}[source]
You're right to question the perplexity impact - 0.86% isn't negligible. Our extended testing shows this impact remains fairly consistent across context lengths up to 16K, which was our test limit.

We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.

The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.

For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.

@smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.