Most active commenters

Popular/hot comments

(simonwillison.net)

Show context

xnx ◴[11 Jul 25 00:34 UTC] No.44527256[source]▶

> It’s worth noting that LLMs are non-deterministic,

This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."

Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

replies(7): >>44527264 #>>44527395 #>>44527458 #>>44528870 #>>44530104 #>>44533038 #>>44536027 #

1. simonw ◴[11 Jul 25 00:58 UTC] No.44527395[source]▶

>>44527256 #

I don't think those race conditions are rare. None of the big hosted LLMs provide a temperature=0 plus fixed seed feature which they guarantee won't return different results, despite clear demand for that from developers.

replies(3): >>44527634 #>>44529574 #>>44529823 #

2. xnx ◴[11 Jul 25 01:41 UTC] No.44527634[source]▶

>>44527395 (TP) #

Fair. I dislike "non-deterministic" as a blanket llm descriptor for all llms since it implies some type of magic or quantum effect.

replies(4): >>44527956 #>>44528597 #>>44528690 #>>44529070 #

3. dekhn ◴[11 Jul 25 02:42 UTC] No.44527956[source]▶

>>44527634 #

I see LLM inference as sampling from a distribution. Multiple details go into that sampling - everything from parameters like temperature to numerical imprecision to batch mixing effects as well as the next-token-selection approach (always pick max, sample from the posterior distribution, etc). But ultimately, if it was truly important to get stable outputs, everything I listed above can be engineered (temp=0, very good numerical control, not batching, and always picking the max probability next token).

dekhn from a decade ago cared a lot about stable outputs. dekhn today thinks sampling from a distribution is a far more practical approach for nearly all use cases. I could see it mattering when the false negative rate of a medical diagnostic exceeded a reasonable threshold.

4. basch ◴[11 Jul 25 05:13 UTC] No.44528597[source]▶

>>44527634 #

I agree its phrased poorly.

Better said would be: LLM's are designed to act as if they were non-deterministic.

replies(1): >>44528792 #

5. tanewishly ◴[11 Jul 25 05:33 UTC] No.44528690[source]▶

>>44527634 #

Errr... that word implies some type of non-deterministic effect. Like using a randomizer without specifying the seed (ie. sampling from a distribution). I mean, stuff like NFAs (non-deterministic finite automata) isn't magic.

6. ◴[11 Jul 25 05:55 UTC] No.44528792{3}[source]▶

>>44528597 #

7. EdiX ◴[11 Jul 25 06:51 UTC] No.44529070[source]▶

>>44527634 #

Interesting, but in general it does not imply that. For example: https://en.wikipedia.org/wiki/Nondeterministic_finite_automa...

8. toolslive ◴[11 Jul 25 08:18 UTC] No.44529574[source]▶

>>44527395 (TP) #

I, naively (an uninformed guess), considered the non-determinism (multiple results possible, even with temperature=0 and fixed seed) stemming from floating point rounding errors propagating through the calculations. How wrong am I ?

replies(4): >>44529754 #>>44529801 #>>44529836 #>>44531008 #

9. bmicraft ◴[11 Jul 25 08:43 UTC] No.44529754[source]▶

>>44529574 #

They're gonna round the same each time you're running it on the same hardware.

replies(1): >>44530559 #

10. williamdclt ◴[11 Jul 25 08:49 UTC] No.44529801[source]▶

>>44529574 #

Also uninformed but I can't see how that would be true, floating point rounding errors are entirely deterministic

replies(1): >>44531897 #

11. diggan ◴[11 Jul 25 08:53 UTC] No.44529823[source]▶

>>44527395 (TP) #

> despite clear demand for that from developers

Theorizing about why that is: Could it be possible they can't do deterministic inference and batching at the same time, so the reason we see them avoiding that is because that'd require them to stop batching which would shoot up costs?

12. impossiblefork ◴[11 Jul 25 08:54 UTC] No.44529836[source]▶

>>44529574 #

With a fixed seed there will be the same floating point rounding errors.

A fixed seed is enough for determinism. You don't need to set temperature=0. Setting temperature=0 also means that you aren't sampling, which means that you're doing greedy one-step probability maximization which might mean that the text ends up strange for that reason.

13. toolslive ◴[11 Jul 25 10:33 UTC] No.44530559{3}[source]▶

>>44529754 #

but they're not: they are scheduled on some infrastructure in the cloud. So the code version might be slightly different, the compiler (settings) might differ, and the actual hardware might differ.

14. zahlman ◴[11 Jul 25 11:41 UTC] No.44531008[source]▶

>>44529574 #

You may be interested in https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm... .

> The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.

15. saagarjha ◴[11 Jul 25 13:27 UTC] No.44531897{3}[source]▶

>>44529801 #

Not if your scheduler causes accumulation in a different order.

replies(1): >>44533285 #

16. williamdclt ◴[11 Jul 25 15:29 UTC] No.44533285{4}[source]▶

>>44531897 #

Are you talking about a DAG of FP calculations, where parallel steps might finish in different order across different executions? That's getting out of my area of knowledge, but I'd believe it's possible

replies(1): >>44546301 #

17. saagarjha ◴[13 Jul 25 00:02 UTC] No.44546301{5}[source]▶

>>44533285 #

Well a very simple example would be if you run a parallel reduce using atomics the result will depend on which workers acquire the accumulator first.

↑

Grok: Searching X for "From:Elonmusk (Israel or Palestine or Hamas or Gaza)"