←back to thread

507 points martinald | 2 comments | | HN request time: 0s | source
Show context
gpjanik ◴[] No.45052085[source]
"Here's the key insight: each forward pass processes ALL tokens in ALL sequences simultaneously."

This sounds incorrect, you only process all tokens once, and later incrementally. It's an auto-regressive model after all.

replies(1): >>45052604 #
1. Voloskaya ◴[] No.45052604[source]
Not during prefill, i.e. the very first token generated in a new conversation. During this forward pass, all tokens in the context are all processed at the same time, and then attention's KV are cached, you still generate a single token, but you need to compute attention from all tokens to all tokens.

From that point on every subsequent tokens is processed sequentially in autoregressive way, but because we have the KV cache, this becomes O(N) (1 token query to all tokens) and not O(N^2)

replies(1): >>45053351 #
2. gpjanik ◴[] No.45053351[source]
I somehow missed the "decode phase" paragraph and hence was confused - it's essentially that separation I meant, you're obviously correct.