Most active commenters
  • troupo(8)
  • xnx(3)
  • msgodel(3)
  • simonw(3)
  • boroboro4(3)
  • DemocracyFTW2(3)
  • saagarjha(3)
  • llm_nerd(3)

←back to thread

724 points simonw | 60 comments | | HN request time: 0.651s | source | bottom
1. xnx ◴[] No.44527256[source]
> It’s worth noting that LLMs are non-deterministic,

This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."

Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

replies(7): >>44527264 #>>44527395 #>>44527458 #>>44528870 #>>44530104 #>>44533038 #>>44536027 #
2. msgodel ◴[] No.44527264[source]
I run my local LLMs with a seed of one. If I re-run my "ai" command (which starts a conversation with its parameters as a prompt) I get exactly the same output every single time.
replies(2): >>44527284 #>>44527453 #
3. xnx ◴[] No.44527284[source]
Yes. This is what I was trying to say. Saying "It’s worth noting that LLMs are non-deterministic" is wrong and should be changed in the blog post.
replies(3): >>44527462 #>>44528765 #>>44529031 #
4. simonw ◴[] No.44527395[source]
I don't think those race conditions are rare. None of the big hosted LLMs provide a temperature=0 plus fixed seed feature which they guarantee won't return different results, despite clear demand for that from developers.
replies(3): >>44527634 #>>44529574 #>>44529823 #
5. lgessler ◴[] No.44527453[source]
In my (poor) understanding, this can depend on hardware details. What are you running your models on? I haven't paid close attention to this with LLMs, but I've tried very hard to get non-deterministic behavior out of my training runs for other kinds of transformer models and was never able to on my 2080, 4090, or an A100. PyTorch docs have a note saying that in general it's impossible: https://docs.pytorch.org/docs/stable/notes/randomness.html

Inference on a generic LLM may not be subject to these non-determinisms even on a GPU though, idk

replies(1): >>44533405 #
6. kcb ◴[] No.44527458[source]
FP multiplication is non-commutative.
replies(2): >>44527482 #>>44528992 #
7. boroboro4 ◴[] No.44527462{3}[source]
You’re correct in batch size 1 (local is one), but not in production use case when multiple requests get batched together (and that’s how all the providers do this).

With batching matrix shapes/request position in them aren’t deterministic and this leads to non deterministic results, regardless of sampling temperature/seed.

replies(1): >>44527524 #
8. boroboro4 ◴[] No.44527482[source]
It doesn’t mean it’s non-deterministic though.

But it does when coupled with non-deterministic requests batching, which is the case.

9. unsnap_biceps ◴[] No.44527524{4}[source]
Isn't that true only if the batches are different? If you run exactly the same batch, you're back to a deterministic result.

If I had a black box api, just because you don't know how it's calculated doesn't mean that it's non-deterministic. It's the underlaying algorithm that determines that and a LLM is deterministic.

replies(1): >>44527543 #
10. boroboro4 ◴[] No.44527543{5}[source]
Providers never run same batches because they mix requests between different clients, otherwise GPUs are gonna be severely underutilized.

It’s inherently non deterministic because it reflects the reality of having different requests coming to the servers at the same time. And I don’t believe there are any realistic workarounds if you want to keep costs reasonable.

Edit: there might be workarounds if matmul algorithms will give stronger guarantees then they are today (invariance on rows/columns swap). Not an expert to say how feasible it is, especially in quantized scenario.

11. xnx ◴[] No.44527634[source]
Fair. I dislike "non-deterministic" as a blanket llm descriptor for all llms since it implies some type of magic or quantum effect.
replies(4): >>44527956 #>>44528597 #>>44528690 #>>44529070 #
12. dekhn ◴[] No.44527956{3}[source]
I see LLM inference as sampling from a distribution. Multiple details go into that sampling - everything from parameters like temperature to numerical imprecision to batch mixing effects as well as the next-token-selection approach (always pick max, sample from the posterior distribution, etc). But ultimately, if it was truly important to get stable outputs, everything I listed above can be engineered (temp=0, very good numerical control, not batching, and always picking the max probability next token).

dekhn from a decade ago cared a lot about stable outputs. dekhn today thinks sampling from a distribution is a far more practical approach for nearly all use cases. I could see it mattering when the false negative rate of a medical diagnostic exceeded a reasonable threshold.

13. basch ◴[] No.44528597{3}[source]
I agree its phrased poorly.

Better said would be: LLM's are designed to act as if they were non-deterministic.

replies(1): >>44528792 #
14. tanewishly ◴[] No.44528690{3}[source]
Errr... that word implies some type of non-deterministic effect. Like using a randomizer without specifying the seed (ie. sampling from a distribution). I mean, stuff like NFAs (non-deterministic finite automata) isn't magic.
15. TheDong ◴[] No.44528765{3}[source]
> Saying "It’s worth noting that LLMs are non-deterministic" is wrong and should be changed in the blog post.

Every person in this thread understood that Simon meant "Grok, ChatGPT, and other common LLM interfaces run with a temperature>0 by default, and thus non-deterministically produce different outputs for the same query".

Sure, he wrote a shorter version of that, and because of that y'all can split hairs on the details ("yes it's correct for how most people interact with LLMs and for grok, but _technically_ it's not correct").

The point of English blog posts is not to be a long wall of logical prepositions, it's to convey ideas and information. The current wording seems fine to me.

The point of what he was saying was to caution readers "you might not get this if you try to repro it", and that is 100% correct.

replies(2): >>44529058 #>>44530499 #
16. ◴[] No.44528792{4}[source]
17. troupo ◴[] No.44528870[source]
> Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

Are these LLMs in the room with us?

Not a single LLM available as a SaaS is deterministic.

As for other models: I've only run ollama locally, and it, too, provided different answers for the same question five minutes apart

Edit/update: not a single LLM available as a SaaS's output is deterministic, especially when used from a UI. Pointing out that you could probably run a tightly controlled model in a tightly controlled environment to achieve deterministic output is very extremely irrelevant when describing output of grok in situations when the user has no control over it

replies(5): >>44528884 #>>44528892 #>>44528898 #>>44528952 #>>44528971 #
18. fooker ◴[] No.44528884[source]
> Not a single LLM available as a SaaS is deterministic.

Lower the temperature parameter.

replies(2): >>44528930 #>>44529115 #
19. eightysixfour ◴[] No.44528892[source]
The models themselves are mathematically deterministic. We add randomness during the sampling phase, which you can turn off when running the models locally.

The SaaS APIs are sometimes nondeterministic due to caching strategies and load balancing between experts on MoE models. However, if you took that model and executed it in single user environment, it could also be done deterministically.

replies(1): >>44528944 #
20. moralestapia ◴[] No.44528898[source]
True.

I'm now wondering, would it be desirable to have deterministic outputs on an LLM?

21. troupo ◴[] No.44528930{3}[source]
So, how does one do it outside of APIs in the context we're discussing? In the UI or when invoking @grok in X?

How do we also turn off all the intermediate layers in between that we don't know about like "always rant about white genocide in South Africa" or "crash when user mentions David Meyer"?

replies(1): >>44530946 #
22. troupo ◴[] No.44528944{3}[source]
> However, if you took that model and executed it in single user environment,

Again, are those environments in the room with us?

In the context of the article, is the model executed in such an environment? Do we even know anything about the environment, randomness, sampling and anything in between or have any control over it (see e.g https://news.ycombinator.com/item?id=44528930)?

replies(1): >>44531825 #
23. DemocracyFTW2 ◴[] No.44528952[source]
Akchally... Strictly speaking and to the best of my understanding, LLMs are deterministic in the sense that a dice roll is deterministic; the randomness comes from insufficient knowledge about its internal state. But use a constant seed and run the model with the same sequence of questions, you will get the same answers. It's possible that the interactions with other users who use the model in parallel could influence the outcome, but given that the state-of-the-art technique to provide memory and context is to re-submit the entirety of the current chat I'd doubt that. One hint that what I surmise is in fact true can be gleaned from those text-to-image generators that allow seeds to be set; you still don't get a 'linear', predictable (but hopefully a somewhat-sensible) relation between prompt to output, but each (seed, prompt) pair will always give the same sequence of images.
24. orbital-decay ◴[] No.44528971[source]
> Not a single LLM available as a SaaS is deterministic.

Gemini Flash has deterministic outputs, assuming you're referring to temperature 0 (obviously). Gemini Pro seems to be deterministic within the same kernel (?) but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.

replies(1): >>44529041 #
25. DemocracyFTW2 ◴[] No.44528992[source]
That's like you can't deduce the input t from a cryptographic hash h but the same input always gives you the same hash, so t->h is deterministic. h->t is, in practice, not a way that you can or want to walk (because it's so expensive to do) and because there may be / must be collisions (given that a typical hash is much smaller than the typical inputs), so the inverse is not h->t with a single input but h->{t1,t2,...}, a practically open set of possible inputs that is still deterministic.
26. DemocracyFTW2 ◴[] No.44529031{3}[source]
"Non-deterministic" in the sense that a dice roll is when you don't know every parameter with ultimate precision. On one hand I find insistence on the wrongness on the phrase a bit too OCD, on the other I must agree that a very simple re-phrasing like "appears {non-deterministic|random|unpredictable} to an outside observer" would've maybe even added value even for less technically-inclined folks, so yeah.
27. troupo ◴[] No.44529041{3}[source]
And it's the author of the original article running Gemkni Flash/GemmniPro through an API where he can control the temperature? can kernels be controlled by the user? Any of those can be controlled through the UI/apis where most of these LLMs are involved from?

> but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.

So you're literally saying it's non-deterministic

replies(1): >>44529068 #
28. root_axis ◴[] No.44529058{4}[source]
Still, the statement that LLMs are non-deterministic is incorrect and could mislead some people who simply aren't familiar with how they work.

Better phrasing would be something like "It's worth noting that LLM products are typically operated in a manner that produces non-deterministic output for the user"

replies(2): >>44529211 #>>44529618 #
29. orbital-decay ◴[] No.44529068{4}[source]
The only thing I'm saying is that there is a SaaS model that would give you the same output for the same input, over and over. You just seem to be arguing for the sake of arguing, especially considering that non-determinism is a red herring to begin with, and not a thing to care about for practical use (that's why providers usually don't bother with guaranteeing it). The only reason it was mentioned in the article is because the author is basically reverse engineering a particular model.
replies(1): >>44532061 #
30. EdiX ◴[] No.44529070{3}[source]
Interesting, but in general it does not imply that. For example: https://en.wikipedia.org/wiki/Nondeterministic_finite_automa...
31. pydry ◴[] No.44529115{3}[source]
It's not enough. Ive done this and still often gotten different results for the same question.
32. Veen ◴[] No.44529211{5}[source]
Simon would be less engaging if he caveated every generalisation in that way. It’s one of the main reasons academic writing is often tedious to read.
33. toolslive ◴[] No.44529574[source]
I, naively (an uninformed guess), considered the non-determinism (multiple results possible, even with temperature=0 and fixed seed) stemming from floating point rounding errors propagating through the calculations. How wrong am I ?
replies(4): >>44529754 #>>44529801 #>>44529836 #>>44531008 #
34. antonvs ◴[] No.44529618{5}[source]
> It's worth noting that LLM products are typically operated in a manner that produces non-deterministic output for the user

Or you could abbreviate this by saying “LLMs are non-deterministic.” Yes, it requires some shared context with the audience to interpret correctly, but so does every text.

35. bmicraft ◴[] No.44529754{3}[source]
They're gonna round the same each time you're running it on the same hardware.
replies(1): >>44530559 #
36. williamdclt ◴[] No.44529801{3}[source]
Also uninformed but I can't see how that would be true, floating point rounding errors are entirely deterministic
replies(1): >>44531897 #
37. diggan ◴[] No.44529823[source]
> despite clear demand for that from developers

Theorizing about why that is: Could it be possible they can't do deterministic inference and batching at the same time, so the reason we see them avoiding that is because that'd require them to stop batching which would shoot up costs?

38. impossiblefork ◴[] No.44529836{3}[source]
With a fixed seed there will be the same floating point rounding errors.

A fixed seed is enough for determinism. You don't need to set temperature=0. Setting temperature=0 also means that you aren't sampling, which means that you're doing greedy one-step probability maximization which might mean that the text ends up strange for that reason.

39. TOMDM ◴[] No.44530104[source]
I think the better statement is likely "LLMs are typically not executed in a deterministic manner", since you're right there are no non deterministic properties interment to the models themselves that I'm aware of
40. msgodel ◴[] No.44530499{4}[source]
My temperature is set higher than zero as well. That doesn't make them nondeterministic.
replies(1): >>44531909 #
41. toolslive ◴[] No.44530559{4}[source]
but they're not: they are scheduled on some infrastructure in the cloud. So the code version might be slightly different, the compiler (settings) might differ, and the actual hardware might differ.
42. marcinzm ◴[] No.44530946{4}[source]
Grok is not deterministic would then be the correct statement.
replies(1): >>44532080 #
43. zahlman ◴[] No.44531008{3}[source]
You may be interested in https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm... .

> The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.

44. mathiaspoint ◴[] No.44531825{4}[source]
It's very poor communication. They absolutely do not have to be non-deterministic.
replies(1): >>44532052 #
45. saagarjha ◴[] No.44531897{4}[source]
Not if your scheduler causes accumulation in a different order.
replies(1): >>44533285 #
46. saagarjha ◴[] No.44531909{5}[source]
I would hope that your temperature is set higher than zero.
47. troupo ◴[] No.44532052{5}[source]
The output of all these systems used by people not through API is non-deterministic.
replies(1): >>44537068 #
48. troupo ◴[] No.44532061{5}[source]
> especially considering that non-determinism is a red herring to begin with, and not a thing to care about for practical use

That is, it really is important in practical use because it's impossible to talk about stuff like in the original article without being able to consistently reproduce results.

Also, in almost all situations you really do want deterministic output (remember how "do what I want and what is expected" was an important property of computer systems? Good times)

> The only reason it was mentioned in the article is because the author is basically reverse engineering a particular model.

The author is attempting reverse engineering the model, the randomness and the temperature, the system prompts and the training set, and all the possible layers added by xAI in between, and still getting a non-deterministic output.

HN: no-no-no, you don't understand, it's 100% deterministic and it doesn't matter

49. troupo ◴[] No.44532080{5}[source]
When used through UI, like the author does, Grok isn't. OpenAI isn't. Gemini isn't
50. llm_nerd ◴[] No.44533038[source]
That non-deterministic claim, along with the rather ludicrous claim that this is all just some accidental self-awareness of the model or something (rather than Elon clearly and obviously sticking his fat fingers into the machine), make the linked piece technically dubious.

A baked LLM is 100% deterministic. It is a straightforward set of matrix algebra with a perfectly deterministic output at a base state. There is no magic quantum mystery machine happening in the model. We add a randomization -- the seed or temperature -- to as a value-add randomize the outputs in the intention of giving creativity. So while it might be true that "in the customer-facing default state an LLM gives non-deterministic output", this is not some base truth about LLMs.

replies(1): >>44533633 #
51. williamdclt ◴[] No.44533285{5}[source]
Are you talking about a DAG of FP calculations, where parallel steps might finish in different order across different executions? That's getting out of my area of knowledge, but I'd believe it's possible
replies(1): >>44546301 #
52. msgodel ◴[] No.44533405{3}[source]
Ah. I've typically avoided CUDA except for a couple of really big jobs so I haven't noticed this.
53. simonw ◴[] No.44533633[source]
LLMs work using huge amounts of matrix multiplication.

Floating point multiplication is non-associative:

  a = 0.1, b = 0.2, c = 0.3
  a * (b * c) = 0.006
  (a * b) * c = 0.006000000000000001
Almost all serious LLMs are deployed across multiple GPUs and have operations executed in batches for efficiency.

As such, the order in which those multiplications are run depends on all sorts of factors. There are no guarantees of operation order, which means non-associative floating point operations play a role in the final result.

This means that, in practice, most deployed LLMs are non-deterministic even with a fixed seed.

That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.

Here's an example: https://cookbook.openai.com/examples/reproducible_outputs_wi...

> Developers can now specify seed parameter in the Chat Completion request to receive (mostly) consistent outputs. [...] There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

replies(2): >>44534555 #>>44536746 #
54. ◴[] No.44534555{3}[source]
55. spindump8930 ◴[] No.44536027[source]
The many sources of stochastic/non-deterministic behavior have been mentioned in other replies but I wanted to point out this paper: https://arxiv.org/abs/2506.09501 which analyzes the issues around GPU non determinism (once sampling and batching related effects are removed).

One important take-away is that these issues are more likely in longer generations so reasoning models can suffer more.

56. llm_nerd ◴[] No.44536746{3}[source]
>That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.

They absolutely can keep such a promise, which anyone who has worked with LLMs could confirm. I can run a sequence of tokens through a large LLMs thousands of times and get identical results every time (and have done precisely this! In fact, in one situation it was a QA test I built). I could run it millions of times and get exactly the same final layer every single time.

They don't want to keep such a promise because it limits flexibility and optimizations available when doing things at a very large scale. This is not an LLM thing, and saying "LLMs are non-deterministic" is simply wrong, even if you can find an LLM purveyor who decided to make choices where they no longer have any interest in such an outcome. And FWIW, non-associative floating point arithmetic is usually not the reason.

It's like claiming that a chef cannot do something that McDonalds and Burger King don't do, using those purveyors as an example of what is possible when cooking. Nothing works like that.

replies(1): >>44536756 #
57. simonw ◴[] No.44536756{4}[source]
If not non-associative floating point, what's the reason?
replies(1): >>44542213 #
58. troupo ◴[] No.44537068{6}[source]
I would also assume that in vast majority of cases people don't set temperature to zero even with API calls.

And even if you do set it to zero, you never know what changes to the layers and layers of wrappers and system prompts you will run into on any given day resulting in "on this day we crash for certain input, and on other days we don't": https://www.techdirt.com/2024/12/03/the-curious-case-of-chat...

59. llm_nerd ◴[] No.44542213{5}[source]
There are a huge number of reasons for large scale systems. Batching sizes when hitting MoE systems (which are basically all LLMs now) leading to routing variations. Consecutive submissions could be routed to entirely different hardware, software, and even quantization levels! Repeat resubmissions could even hit different variations of a model.

No one targets determinism because randomness/"creativity" in LLMs is considered a prime feature, so there is zero reason to avoid variation, but that isn't some core function of LLMs.

60. saagarjha ◴[] No.44546301{6}[source]
Well a very simple example would be if you run a parallel reduce using atomics the result will depend on which workers acquire the accumulator first.