Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

(github.com)

272 points dipampaul17 | 1 comments | 16 May 25 20:04 UTC | HN request time: 0.204s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context

matheist ◴[16 May 25 20:53 UTC] No.44009741[source]▶

>>44009321 (OP) #

Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?

A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.

A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.

replies(2): >>44011484 #>>44012275 #

dipampaul17 ◴[17 May 25 02:00 UTC] No.44011484[source]▶

>>44009741 #

Great question about the intuition! The difference comes from the core roles these components play in attention.

Keys determine which tokens to attend to - they create the actual attention pattern through similarity calculations. Values only store what information gets passed forward once attention is decided.

When a key vector is quantized too aggressively, it distorts the similarity calculations for every token interaction. A small error in keys can completely redirect attention to the wrong tokens.

Values, however, are much more forgiving. When a value vector is quantized, any error only affects the specific information content of that single token after the attention pattern is already established.

It's like a library catalog system vs. the books themselves. If catalog numbers (keys) are corrupted, you'll look in completely wrong sections. If some words in books (values) are smudged, you're still reading the right book - just with occasional noise.

Mathematically, keys participate in softmax calculations where small errors get exponentially amplified through the normalization process. Values just undergo linear weighted averaging, where errors tend to cancel out.

I first encountered this asymmetry in papers like "More for Keys, Less for Values" and "KV-AdaQuant," but wanted to quantify exactly how it impacts Apple Silicon inference. The 7× quality difference between K8V4 and K4V8 using identical memory was striking.

Thanks for the installation feedback too! I'll fix the placeholder and make the Python dependencies more flexible.

replies(2): >>44011553 #>>44012010 #

1. vlovich123 ◴[17 May 25 04:18 UTC] No.44012010[source]▶

>>44011484 #

My understanding is that the roles of KVQ aren’t actually well understood and that while they’re called key/value/query tensors it’s not quite straightforward to tease out what they mean or the role they play.

↑