←back to thread

268 points dipampaul17 | 3 comments | | HN request time: 0.492s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context
entrepy123 ◴[] No.44009707[source]
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?

I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.

So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.

replies(1): >>44011509 #
1. dipampaul17 ◴[] No.44011509[source]
The memory savings from KVSplit scale proportionally with context length, so higher-RAM Macs (64GB/128GB) benefit even more in absolute terms. On a 128GB Mac Studio, you could potentially handle context windows in the hundreds of thousands of tokens.

However, KVSplit doesn't fundamentally change computation speed - just memory efficiency. Our benchmarks show a 14.5% throughput improvement with K8V4, but this comes from better memory locality, not reduced computation.

The "painfully slow" issue with large models on Apple Silicon stems primarily from the compute limitations, not memory constraints. A 70B parameter model will still run at similar token generation speeds regardless of available RAM or KV cache optimizations.

What KVSplit does is make better use of whatever memory you have available. It's particularly valuable when your bottleneck is context length rather than model size.

For practical Apple Silicon usage, the sweet spot remains smaller models (7B-13B) with now-expanded context windows. This lets you process significantly more text while maintaining reasonable generation speeds.

If your workflow needs both massive contexts AND large models, you'd still want to consider server-grade GPUs, but KVSplit helps push the boundary of what's feasible on Apple hardware.

replies(2): >>44011524 #>>44011537 #
2. hiatus ◴[] No.44011524[source]
Is this any different from using --cache-type-k and --cache-type-v?
3. andrewmcwatters ◴[] No.44011537[source]
Thank you for these insights!