(github.com)

272 points dipampaul17 | 1 comments | 16 May 25 20:04 UTC | HN request time: 0.223s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Show context

behnamoh ◴[16 May 25 20:52 UTC] No.44009732[source]▶

>>44009321 (OP) #

Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.

replies(1): >>44011964 #

1. landl0rd ◴[17 May 25 04:05 UTC] No.44011964[source]▶

>>44009732 #

Probably but I am currently deep in the MLX weeds and finding out that though it's a well-designed framework it's much less mature in terms of example code you can steal where someone has already benchmarked the "best way" to do something.

My biggest hope for it is actually Haskell bindings believe it or not. Someone pointed out the other day its laziness makes it fit really well for that paradigm and the more or less pure-function approach to the compile graph helps too. ML in Haskell would be fun.

↑

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon