←back to thread

3337 points keepamovin | 1 comments | | HN request time: 0.076s | source
Show context
jll29 ◴[] No.46216933[source]
AI professor here. I know this page is a joke, but in the interest of accuracy, a terminological comment: we don't call it a "hallucination" if a model complies exactly with what a prompt asked for and produces a prediction, exactly as requested.

Rater, "hallucinations" are spurious replacements of factual knowledge with fictional material caused by the use of statistical process (the pseudo random number generator used with the "temperature" parameter of neural transformers): token prediction without meaning representation.

[typo fixed]

replies(12): >>46217033 #>>46217061 #>>46217166 #>>46217410 #>>46217456 #>>46217758 #>>46218070 #>>46218282 #>>46218393 #>>46218588 #>>46219018 #>>46219935 #
articlepan ◴[] No.46217166[source]
I agree with your first paragraph, but not your second. Models can still hallucinate when temperature is set to zero (aka when we always choose the highest probability token from the model's output token distribution).

In my mind, hallucination is when some aspect of the model's response should be consistent with reality but is not, and the reality-inconsistent information is not directly attributable or deducible from (mis)information in the pre-training corpus.

While hallucination can be triggered by setting the temperature high, it can also be the result of many possible deficiencies in model pre- and post- training that result in the model outputting bad token probability distributions.

replies(3): >>46217642 #>>46217654 #>>46219141 #
julienreszka ◴[] No.46217654[source]
That's because of rounding errors
replies(1): >>46218631 #
1. leecarraher ◴[] No.46218631[source]
i agree, not just the multinomial sampling that causes hallucinations. If that were the case, setting temp to 0 and just argmax over the logits would "solve" hallucinations. while round-off error causes some stochasticity it's unlikely to be the the primary cause, rather it's lossy compression over the layers that causes it.

first compression: You create embeddings that need to differentiate N tokens, JL lemma gives us a bound that modern architectures are well above that. At face value, the embeddings could encode the tokens and provide deterministic discrepancy. But words aren't monolithic , they mean many things and get contextualized by other words. So despite being above jl bound, the model still forces a lossy compression.

next compression: each layer of the transformer blows up the input to KVQ, then compresses it back to the inter-layer dimension.

finally there is the output layer which at 0 temp is deterministic, but it is heavily path dependent on getting to that token. The space of possible paths is combinatorial, so any non-deterministic behavior elsewhere will inflate the likelihood of non-deterministic output, including things like roundoff. heck most models are quantized down to 4 even2 bits these days, which is wild!