←back to thread

724 points simonw | 3 comments | | HN request time: 0.732s | source
Show context
xnx ◴[] No.44527256[source]
> It’s worth noting that LLMs are non-deterministic,

This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."

Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

replies(7): >>44527264 #>>44527395 #>>44527458 #>>44528870 #>>44530104 #>>44533038 #>>44536027 #
msgodel ◴[] No.44527264[source]
I run my local LLMs with a seed of one. If I re-run my "ai" command (which starts a conversation with its parameters as a prompt) I get exactly the same output every single time.
replies(2): >>44527284 #>>44527453 #
xnx ◴[] No.44527284[source]
Yes. This is what I was trying to say. Saying "It’s worth noting that LLMs are non-deterministic" is wrong and should be changed in the blog post.
replies(3): >>44527462 #>>44528765 #>>44529031 #
1. boroboro4 ◴[] No.44527462[source]
You’re correct in batch size 1 (local is one), but not in production use case when multiple requests get batched together (and that’s how all the providers do this).

With batching matrix shapes/request position in them aren’t deterministic and this leads to non deterministic results, regardless of sampling temperature/seed.

replies(1): >>44527524 #
2. unsnap_biceps ◴[] No.44527524[source]
Isn't that true only if the batches are different? If you run exactly the same batch, you're back to a deterministic result.

If I had a black box api, just because you don't know how it's calculated doesn't mean that it's non-deterministic. It's the underlaying algorithm that determines that and a LLM is deterministic.

replies(1): >>44527543 #
3. boroboro4 ◴[] No.44527543[source]
Providers never run same batches because they mix requests between different clients, otherwise GPUs are gonna be severely underutilized.

It’s inherently non deterministic because it reflects the reality of having different requests coming to the servers at the same time. And I don’t believe there are any realistic workarounds if you want to keep costs reasonable.

Edit: there might be workarounds if matmul algorithms will give stronger guarantees then they are today (invariance on rows/columns swap). Not an expert to say how feasible it is, especially in quantized scenario.