(github.com)

272 points dipampaul17 | 4 comments | 16 May 25 20:04 UTC | HN request time: 0.498s | source

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

1. ondra ◴[16 May 25 20:58 UTC] No.44009778[source]▶

>>44009321 (OP) #

Is this any different from using --cache-type-k and --cache-type-v?

replies(3): >>44010421 #>>44011977 #>>44014477 #

2. azinman2 ◴[16 May 25 22:41 UTC] No.44010421[source]▶

>>44009778 (TP) #

That’s what I want to know!

3. landl0rd ◴[17 May 25 04:07 UTC] No.44011977[source]▶

>>44009778 (TP) #

I'm guessing it's a bit different since MLX/MPS doesn't have native 4-bit support (or even 8 if I remember correctly?) It didn't launch with bf16 support even. So I think the lowest you could go on the old type_k/v solution and apple GPUs was 16-bit f16/bf16 but not a llama.cpp internals expert so maybe wrong?

4. Aurornis ◴[17 May 25 14:13 UTC] No.44014477[source]▶

>>44009778 (TP) #

No, it appears to be an LLM-generated attempt to gain GitHub stars.

See my other comment for a sampling of the other oddities in the repo.

↑

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon