←back to thread

507 points martinald | 1 comments | | HN request time: 0.238s | source
1. mutkach ◴[] No.45053490[source]
A full KV-cache is quite big compared to the weights of the model (depending on the context size), that should be a factor too (and basically you need to maintain a separate KV cache for each request, I think...). Also the the token/s is not uniform across the request and it's getting slower with each subsequent generated token.

On the other side, there's an insane booster of speculative decoding, that would give a semi-prefill rate for decoding, but the memory pressure is still a factor.

I would be happy to be corrected regarding both factors.