←back to thread

269 points dipampaul17 | 2 comments | | HN request time: 0s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

1. 3abiton ◴[] No.44010545[source]
This is a brilliant idea, and initiative. Does this also apply to GPUs? And I assume should be compatible with other quantization techniques, albeit they probably require their own patches?
replies(1): >>44011539 #
2. dipampaul17 ◴[] No.44011539[source]
Yup, this approach would likely work on NVIDIA/AMD GPUs as well - the underlying principle that keys require higher precision than values is hardware-independent.

The CUDA backend in llama.cpp already supports separate cache type settings with the `--cache-type-k` and `--cache-type-v` flags. Our particular patch is focused on Metal-specific optimizations, but the core technique transfers directly.

Regarding compatibility with other quantization methods - absolutely. This KV cache optimization is complementary to model weight quantization (Q4_K_M, GPTQ, AWQ, etc.). You can combine asymmetric KV cache precision with any model weight format.

Since KV cache quantization happens at runtime while processing tokens (separate from model weights), it doesn't conflict with how the model itself is quantized. They operate on different parts of the inference pipeline.

What would require additional work is integrating with specialized inference engines that have custom KV cache handling, like vLLM or TensorRT-LLM. Each would need its own implementation of asymmetric KV precision.

The most immediate GPU benefit would likely come from integrating these insights into the FlashAttention implementation directly, where the memory bandwidth savings could translate to even greater speedups on CUDA hardware.