Most active commenters

kgwgk(5)
canjobear(4)
bloppe(3)

What Is Entropy?

(jasonfantl.com)

Show context

glial ◴[14 Apr 25 19:46 UTC] No.43685469[source]▶

One thing that helped me was the realization that, at least as used in the context of information theory, entropy is a property of an individual (typically the person receiving a message) and NOT purely of the system or message itself.

> entropy quantifies uncertainty

This sums it up. Uncertainty is the property of a person and not a system/message. That uncertainty is a function of both a person's model of a system/message and their prior observations.

You and I may have different entropies about the content of the same message. If we're calculating the entropy of dice rolls (where the outcome is the 'message'), and I know the dice are loaded but you don't, my entropy will be lower than yours.

replies(4): >>43685585 #>>43686121 #>>43687411 #>>43688999 #

ninetyninenine ◴[14 Apr 25 19:55 UTC] No.43685585[source]▶

>>43685469 #

Not true. The uncertainty of the dice rolls is not controlled by you. It is the property of the loaded dice itself.

Here's a better way to put it. If I roll the dice infinite times. The uncertainty of the outcome of the dice will become evident in the distribution of the outcomes of the dice. Whether you or another person is certain or uncertain of this does not indicate anything.

Now when you realize this you'll start to think about this thing in probability called frequentists vs. bayesian and you'll realize that all entropy is, is a consequence of probability and that the philosophical debate in probability applies to entropy as well because they are one and the same.

I think the word "entropy" confuses people into thinking it's some other thing when really it's just probability at work.

replies(3): >>43685604 #>>43686183 #>>43691399 #

1. bloppe ◴[14 Apr 25 20:50 UTC] No.43686183[source]▶

>>43685585 #

Probability is subjective though, because macrostates are subjective.

The notion of probability relies on the notion of repeatability: if you repeat a coin flip infinite times, what proportion of outcomes will be heads, etc. But if you actually repeated the toss exactly the same way every time, say with a finely-tuned coin-flipping machine in a perfectly still environment, you would always get the same result.

We say that a regular human flipping a coin is a single macrostate that represents infinite microstates (the distribution of trajectories and spins you could potentially impart on the coin). But who decides that? Some subjective observer. Another finely tuned machine could conceivably detect the exact trajectory and spin of the coin as it leaves your thumb and predict with perfect accuracy what the outcome will be. According to that machine, you're not repeating anything. You're doing a new thing every time.

replies(1): >>43687296 #

2. canjobear ◴[14 Apr 25 23:11 UTC] No.43687296[source]▶

>>43686183 (TP) #

Probability is a bunch of numbers that add to 1. Sometimes you can use this to represent subjective beliefs. Sometimes you can use it to represent objectively existing probability distributions. For example, an LLM is a probability distribution on a following token given previous tokens. If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct. So the probability is objective.

replies(2): >>43688108 #>>43689218 #

3. bloppe ◴[15 Apr 25 01:25 UTC] No.43688108[source]▶

>>43687296 #

We're talking about 2 different things. I agree that probability is objective as long as you've already decided on the definition of the macrostate, but that definition is subjective.

From an LLM's perspective, the macrostate is all the tokens in the context window and nothing more. A different observer may be able to take into account other information, such as the identity and mental state of the author, giving rise to a different distribution. Both of these models can be objectively valid even though they're different, because they rely on different definitions of the macrostate.

It can be hard to wrap your head around this, but try taking it to the extreme. Let's say there's an omniscient being that knows absolutely everything there is to know about every single atom within a system. To that observer, probability does not exist, because every macrostate represents a single microstate. In order for something to be repeated (which is core to the definition of probability), it must start from the exact same microstate, and thus always have the same outcome.

You might think that true randomness exists at the quantum level and that means true omniscience is impossible (and thus irrelevant), but that's not provable and, even if it were true, would not invalidate the general point that probabilities are determined by macrostate definition.

replies(1): >>43688531 #

4. canjobear ◴[15 Apr 25 02:44 UTC] No.43688531{3}[source]▶

>>43688108 #

Suppose you're training a language model by minimizing cross entropy, and the omniscient being is watching. In each step, your model instantiates some probability distribution, whose gradients are computed. That distribution exists, and is not deterministic to the omniscient entity.

replies(2): >>43689911 #>>43689949 #

5. kgwgk ◴[15 Apr 25 05:25 UTC] No.43689218[source]▶

>>43687296 #

> If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct.

The observer who knows the implementation in detail and the state of the pseudo-random number generator can predict the next token with certainty. (Or almost certainty, if we consider flip-switching cosmic rays, etc.)

replies(1): >>43689432 #

6. canjobear ◴[15 Apr 25 06:00 UTC] No.43689432{3}[source]▶

>>43689218 #

That’s the probability to observe a token given the prompt and the seed. The probability assigned to a token given the prompt alone is a separate thing, which is objectively defined independent of any observer and can be found by reading out the model logits.

replies(1): >>43689453 #

7. kgwgk ◴[15 Apr 25 06:03 UTC] No.43689453{4}[source]▶

>>43689432 #

Yes, that’s a purely mathematical abstract concept that exists outside of space and time. The labels “objective” and “subjective” are usually used to talk about probabilities in relation to the physical world.

replies(2): >>43689575 #>>43689589 #

8. ◴[15 Apr 25 06:23 UTC] No.43689575{5}[source]▶

>>43689453 #

9. canjobear ◴[15 Apr 25 06:25 UTC] No.43689589{5}[source]▶

>>43689453 #

An LLM distribution exists in the physical world, just as much as this comment does. It didn’t exist before the model was trained. It has relation to the physical world: it assigns probabilities to subword units of text. It has commercial value that it wouldn’t have if its objective probability values were different.

replies(1): >>43689760 #

10. kgwgk ◴[15 Apr 25 06:53 UTC] No.43689760{6}[source]▶

>>43689589 #

> It has relation to the physical world: it assigns probabilities to subword units of text.

How is that probability assignment linked to the physical world exactly? In the physical world the computer will produce a token. You rejected before that it was about predicting the token that would be produced.

replies(1): >>43689822 #

11. kgwgk ◴[15 Apr 25 07:04 UTC] No.43689822{7}[source]▶

>>43689760 #

Or maybe you mean that the probability assignments are not about the output of a particular LLM implementation in the real world but about subword units of text in the wild.

In that case how could two different LLMs do different assigments to the same physical world without being wrong? Would they be “objective” but unrelated to the “object”?

12. bloppe ◴[15 Apr 25 07:21 UTC] No.43689911{4}[source]▶

>>43688531 #

An LLM is given a definition of the macrostate which creates the probability distribution, but a different definition of the macrostate (such as would be known to the omniscient being) would create a different distribution. According to the omniscient entity, the vast majority of long combinations of tokens would have zero probability because nobody will ever write them down in that order. The infinite monkey theorem is misleading in this regard. The odds of producing Shakespeare's works completely randomly before the heat death of the universe are practically zero, even if all the computing power in the world were dedicated to the cause.

13. kgwgk ◴[15 Apr 25 07:27 UTC] No.43689949{4}[source]▶

>>43688531 #

What’s non deterministic there?

That “probability distribution” is just a mathematical function assigning numbers to tokens, defined using a model that the person creating the model and the omniscent entity know, applying a set of deterministic mathematical functions to a sequence of observed inputs that the person creating the model and the omniscent entity also know.

↑