(github.com)

272 points dipampaul17 | 1 comments | 16 May 25 20:04 UTC | HN request time: 0.192s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context

badmonster ◴[16 May 25 20:40 UTC] No.44009632[source]▶

>>44009321 (OP) #

I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?

replies(1): >>44009655 #

dipampaul17 ◴[16 May 25 20:43 UTC] No.44009655[source]▶

>>44009632 #

Yes, that's one of the key benefits - KVSplit works with any existing .gguf model without requiring reconstruction or special conversion. The quantization happens at runtime on the KV cache, not during model loading or conversion.

This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.

I've tested it successfully with:

- Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama - Qwen variants

The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.

replies(1): >>44011299 #

1. fennecbutt ◴[17 May 25 01:18 UTC] No.44011299[source]▶

>>44009655 #

I thought flash attention was required for quantised KV?

↑

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon