Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

1. smcleod ◴[16 May 25 21:31 UTC] No.44009960[source]▶

+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?

replies(1): >>44010430 #

2. nomel ◴[16 May 25 22:42 UTC] No.44010430[source]▶

>>44009960 (TP) #

> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.

replies(1): >>44010906 #

3. smcleod ◴[17 May 25 00:06 UTC] No.44010906[source]▶

>>44010430 #

Yeah I get that, that's what we yse k/v cache quantisation for now which has a lower impact on PPL than this unless I'm missing something?

replies(1): >>44011526 #

4. dipampaul17 ◴[17 May 25 02:13 UTC] No.44011526{3}[source]▶

>>44010906 #

You're right to question the perplexity impact - 0.86% isn't negligible. Our extended testing shows this impact remains fairly consistent across context lengths up to 16K, which was our test limit.

We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.

The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.

For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.

@smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.