(martinalderson.com)

507 points martinald | 2 comments | 28 Aug 25 10:15 UTC | HN request time: 0s | source

Show context

gpjanik ◴[28 Aug 25 13:42 UTC] No.45052085[source]▶

"Here's the key insight: each forward pass processes ALL tokens in ALL sequences simultaneously."

This sounds incorrect, you only process all tokens once, and later incrementally. It's an auto-regressive model after all.

replies(1): >>45052604 #

1. Voloskaya ◴[28 Aug 25 14:28 UTC] No.45052604[source]▶

>>45052085 #

Not during prefill, i.e. the very first token generated in a new conversation. During this forward pass, all tokens in the context are all processed at the same time, and then attention's KV are cached, you still generate a single token, but you need to compute attention from all tokens to all tokens.

From that point on every subsequent tokens is processed sequentially in autoregressive way, but because we have the KV cache, this becomes O(N) (1 token query to all tokens) and not O(N^2)

replies(1): >>45053351 #

2. gpjanik ◴[28 Aug 25 15:22 UTC] No.45053351[source]▶

>>45052604 (TP) #

I somehow missed the "decode phase" paragraph and hence was confused - it's essentially that separation I meant, you're obviously correct.

↑

Are OpenAI and Anthropic losing money on inference?