←back to thread

724 points simonw | 9 comments | | HN request time: 1.401s | source | bottom
Show context
xnx ◴[] No.44527256[source]
> It’s worth noting that LLMs are non-deterministic,

This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."

Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

replies(7): >>44527264 #>>44527395 #>>44527458 #>>44528870 #>>44530104 #>>44533038 #>>44536027 #
simonw ◴[] No.44527395[source]
I don't think those race conditions are rare. None of the big hosted LLMs provide a temperature=0 plus fixed seed feature which they guarantee won't return different results, despite clear demand for that from developers.
replies(3): >>44527634 #>>44529574 #>>44529823 #
1. toolslive ◴[] No.44529574[source]
I, naively (an uninformed guess), considered the non-determinism (multiple results possible, even with temperature=0 and fixed seed) stemming from floating point rounding errors propagating through the calculations. How wrong am I ?
replies(4): >>44529754 #>>44529801 #>>44529836 #>>44531008 #
2. bmicraft ◴[] No.44529754[source]
They're gonna round the same each time you're running it on the same hardware.
replies(1): >>44530559 #
3. williamdclt ◴[] No.44529801[source]
Also uninformed but I can't see how that would be true, floating point rounding errors are entirely deterministic
replies(1): >>44531897 #
4. impossiblefork ◴[] No.44529836[source]
With a fixed seed there will be the same floating point rounding errors.

A fixed seed is enough for determinism. You don't need to set temperature=0. Setting temperature=0 also means that you aren't sampling, which means that you're doing greedy one-step probability maximization which might mean that the text ends up strange for that reason.

5. toolslive ◴[] No.44530559[source]
but they're not: they are scheduled on some infrastructure in the cloud. So the code version might be slightly different, the compiler (settings) might differ, and the actual hardware might differ.
6. zahlman ◴[] No.44531008[source]
You may be interested in https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm... .

> The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.

7. saagarjha ◴[] No.44531897[source]
Not if your scheduler causes accumulation in a different order.
replies(1): >>44533285 #
8. williamdclt ◴[] No.44533285{3}[source]
Are you talking about a DAG of FP calculations, where parallel steps might finish in different order across different executions? That's getting out of my area of knowledge, but I'd believe it's possible
replies(1): >>44546301 #
9. saagarjha ◴[] No.44546301{4}[source]
Well a very simple example would be if you run a parallel reduce using atomics the result will depend on which workers acquire the accumulator first.