←back to thread

268 points dipampaul17 | 1 comments | | HN request time: 0.202s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context
nico ◴[] No.44009571[source]
Great work. This seems very interesting, but I need something slightly more high level to relate to it

Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?

What’s the ideal use case for local models?

Thank you

replies(2): >>44009665 #>>44009726 #
1. dipampaul17 ◴[] No.44009665[source]
With the K8V4 configuration providing 59% memory savings, you can effectively run contexts 2.4× longer on the same hardware. A model with a 2048 token context can now handle about 5000 tokens, while an 8K context model can reach approximately 19.5K tokens.

In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.

The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.

This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.