This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."
Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.
This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."
Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.
Inference on a generic LLM may not be subject to these non-determinisms even on a GPU though, idk
With batching matrix shapes/request position in them aren’t deterministic and this leads to non deterministic results, regardless of sampling temperature/seed.
If I had a black box api, just because you don't know how it's calculated doesn't mean that it's non-deterministic. It's the underlaying algorithm that determines that and a LLM is deterministic.
It’s inherently non deterministic because it reflects the reality of having different requests coming to the servers at the same time. And I don’t believe there are any realistic workarounds if you want to keep costs reasonable.
Edit: there might be workarounds if matmul algorithms will give stronger guarantees then they are today (invariance on rows/columns swap). Not an expert to say how feasible it is, especially in quantized scenario.
dekhn from a decade ago cared a lot about stable outputs. dekhn today thinks sampling from a distribution is a far more practical approach for nearly all use cases. I could see it mattering when the false negative rate of a medical diagnostic exceeded a reasonable threshold.
Every person in this thread understood that Simon meant "Grok, ChatGPT, and other common LLM interfaces run with a temperature>0 by default, and thus non-deterministically produce different outputs for the same query".
Sure, he wrote a shorter version of that, and because of that y'all can split hairs on the details ("yes it's correct for how most people interact with LLMs and for grok, but _technically_ it's not correct").
The point of English blog posts is not to be a long wall of logical prepositions, it's to convey ideas and information. The current wording seems fine to me.
The point of what he was saying was to caution readers "you might not get this if you try to repro it", and that is 100% correct.
Are these LLMs in the room with us?
Not a single LLM available as a SaaS is deterministic.
As for other models: I've only run ollama locally, and it, too, provided different answers for the same question five minutes apart
Edit/update: not a single LLM available as a SaaS's output is deterministic, especially when used from a UI. Pointing out that you could probably run a tightly controlled model in a tightly controlled environment to achieve deterministic output is very extremely irrelevant when describing output of grok in situations when the user has no control over it
The SaaS APIs are sometimes nondeterministic due to caching strategies and load balancing between experts on MoE models. However, if you took that model and executed it in single user environment, it could also be done deterministically.
I'm now wondering, would it be desirable to have deterministic outputs on an LLM?
How do we also turn off all the intermediate layers in between that we don't know about like "always rant about white genocide in South Africa" or "crash when user mentions David Meyer"?
Again, are those environments in the room with us?
In the context of the article, is the model executed in such an environment? Do we even know anything about the environment, randomness, sampling and anything in between or have any control over it (see e.g https://news.ycombinator.com/item?id=44528930)?
Gemini Flash has deterministic outputs, assuming you're referring to temperature 0 (obviously). Gemini Pro seems to be deterministic within the same kernel (?) but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.
> but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.
So you're literally saying it's non-deterministic
Better phrasing would be something like "It's worth noting that LLM products are typically operated in a manner that produces non-deterministic output for the user"
Or you could abbreviate this by saying “LLMs are non-deterministic.” Yes, it requires some shared context with the audience to interpret correctly, but so does every text.
Theorizing about why that is: Could it be possible they can't do deterministic inference and batching at the same time, so the reason we see them avoiding that is because that'd require them to stop batching which would shoot up costs?
A fixed seed is enough for determinism. You don't need to set temperature=0. Setting temperature=0 also means that you aren't sampling, which means that you're doing greedy one-step probability maximization which might mean that the text ends up strange for that reason.
> The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.
That is, it really is important in practical use because it's impossible to talk about stuff like in the original article without being able to consistently reproduce results.
Also, in almost all situations you really do want deterministic output (remember how "do what I want and what is expected" was an important property of computer systems? Good times)
> The only reason it was mentioned in the article is because the author is basically reverse engineering a particular model.
The author is attempting reverse engineering the model, the randomness and the temperature, the system prompts and the training set, and all the possible layers added by xAI in between, and still getting a non-deterministic output.
HN: no-no-no, you don't understand, it's 100% deterministic and it doesn't matter
A baked LLM is 100% deterministic. It is a straightforward set of matrix algebra with a perfectly deterministic output at a base state. There is no magic quantum mystery machine happening in the model. We add a randomization -- the seed or temperature -- to as a value-add randomize the outputs in the intention of giving creativity. So while it might be true that "in the customer-facing default state an LLM gives non-deterministic output", this is not some base truth about LLMs.
Floating point multiplication is non-associative:
a = 0.1, b = 0.2, c = 0.3
a * (b * c) = 0.006
(a * b) * c = 0.006000000000000001
Almost all serious LLMs are deployed across multiple GPUs and have operations executed in batches for efficiency.As such, the order in which those multiplications are run depends on all sorts of factors. There are no guarantees of operation order, which means non-associative floating point operations play a role in the final result.
This means that, in practice, most deployed LLMs are non-deterministic even with a fixed seed.
That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.
Here's an example: https://cookbook.openai.com/examples/reproducible_outputs_wi...
> Developers can now specify seed parameter in the Chat Completion request to receive (mostly) consistent outputs. [...] There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.
One important take-away is that these issues are more likely in longer generations so reasoning models can suffer more.
They absolutely can keep such a promise, which anyone who has worked with LLMs could confirm. I can run a sequence of tokens through a large LLMs thousands of times and get identical results every time (and have done precisely this! In fact, in one situation it was a QA test I built). I could run it millions of times and get exactly the same final layer every single time.
They don't want to keep such a promise because it limits flexibility and optimizations available when doing things at a very large scale. This is not an LLM thing, and saying "LLMs are non-deterministic" is simply wrong, even if you can find an LLM purveyor who decided to make choices where they no longer have any interest in such an outcome. And FWIW, non-associative floating point arithmetic is usually not the reason.
It's like claiming that a chef cannot do something that McDonalds and Burger King don't do, using those purveyors as an example of what is possible when cooking. Nothing works like that.
And even if you do set it to zero, you never know what changes to the layers and layers of wrappers and system prompts you will run into on any given day resulting in "on this day we crash for certain input, and on other days we don't": https://www.techdirt.com/2024/12/03/the-curious-case-of-chat...
No one targets determinism because randomness/"creativity" in LLMs is considered a prime feature, so there is zero reason to avoid variation, but that isn't some core function of LLMs.