Most active commenters
  • canjobear(5)
  • kgwgk(5)
  • glial(3)
  • bloppe(3)

←back to thread

What Is Entropy?

(jasonfantl.com)
287 points jfantl | 25 comments | | HN request time: 0.001s | source | bottom
1. glial ◴[] No.43685469[source]
One thing that helped me was the realization that, at least as used in the context of information theory, entropy is a property of an individual (typically the person receiving a message) and NOT purely of the system or message itself.

> entropy quantifies uncertainty

This sums it up. Uncertainty is the property of a person and not a system/message. That uncertainty is a function of both a person's model of a system/message and their prior observations.

You and I may have different entropies about the content of the same message. If we're calculating the entropy of dice rolls (where the outcome is the 'message'), and I know the dice are loaded but you don't, my entropy will be lower than yours.

replies(4): >>43685585 #>>43686121 #>>43687411 #>>43688999 #
2. ninetyninenine ◴[] No.43685585[source]
Not true. The uncertainty of the dice rolls is not controlled by you. It is the property of the loaded dice itself.

Here's a better way to put it. If I roll the dice infinite times. The uncertainty of the outcome of the dice will become evident in the distribution of the outcomes of the dice. Whether you or another person is certain or uncertain of this does not indicate anything.

Now when you realize this you'll start to think about this thing in probability called frequentists vs. bayesian and you'll realize that all entropy is, is a consequence of probability and that the philosophical debate in probability applies to entropy as well because they are one and the same.

I think the word "entropy" confuses people into thinking it's some other thing when really it's just probability at work.

replies(3): >>43685604 #>>43686183 #>>43691399 #
3. glial ◴[] No.43685604[source]
I concede that my framing was explicitly Bayesian, but with that caveat, it absolutely is true: your uncertainty is a function of your knowledge, which is a model of the world, but is not equivalent to the world itself.

Suppose I had a coin that only landed on heads. You don't know this and you flip the coin. According to your argument, for the first flip, your entropy about the outcome of the flip is zero. However, you wouldn't be able to tell me which way the coin would land, making your entropy nonzero. This is a contradiction.

replies(1): >>43686906 #
4. empath75 ◴[] No.43686121[source]
> If we're calculating the entropy of dice rolls (where the outcome is the 'message'), and I know the dice are loaded but you don't, my entropy will be lower than yours.

That's got nothing to do with entropy being subjective. If 2 people are calculating any property and one of them is making a false assumption, they'll end up with a different (false) conclusion.

replies(3): >>43686265 #>>43686365 #>>43687307 #
5. bloppe ◴[] No.43686183[source]
Probability is subjective though, because macrostates are subjective.

The notion of probability relies on the notion of repeatability: if you repeat a coin flip infinite times, what proportion of outcomes will be heads, etc. But if you actually repeated the toss exactly the same way every time, say with a finely-tuned coin-flipping machine in a perfectly still environment, you would always get the same result.

We say that a regular human flipping a coin is a single macrostate that represents infinite microstates (the distribution of trajectories and spins you could potentially impart on the coin). But who decides that? Some subjective observer. Another finely tuned machine could conceivably detect the exact trajectory and spin of the coin as it leaves your thumb and predict with perfect accuracy what the outcome will be. According to that machine, you're not repeating anything. You're doing a new thing every time.

replies(1): >>43687296 #
6. glial ◴[] No.43686265[source]
Entropy is based on your model of the world and every model, being a simplification and an estimate, is false.
7. mitthrowaway2 ◴[] No.43686365[source]
What if I told you the dice were loaded, but I didn't tell you which face they were loaded in favor of?

Then you (presumably) assign a uniform probability over one true assumption and five false assumptions. Which is the sort of situation where subjective entropy seems quite appropriate.

8. nyrikki ◴[] No.43686906{3}[source]
To add to this.

Both the Bayesian vs frequentist interpretations make understanding the problem challenging, as both are powerful interpretations to find the needle in the haystack, when the problem is finding the hay in the haystack.

A better lens is that a recursive binary sequence (coin flips) is an algorithmically random sequence if and only if it is a Chaitin's number.[1]

Chaitin's number is normal, which is probably easier understood with decimal digits meaning that with any window size, over time the distribution, the distribution of 0-9 will be the same.

This is why HALT ≈ open frame ≈ system identification ≈ symbol grounding problems.

Probabilities are very powerful for problems like The dining philosophers problem or the Byzantine generals problem, they are still grabbing needles every time they reach into the hay stack.

Pretty much any almost all statement is a hay in the haystack problem. For example almost all real numbers are normal, but we have only found a few.

We can construct them, say with .101010101 in base 2 .123123123123 in base 3 etc...but we can't access them.

Given access to the true reals, you have 0 percent chance of picking a computable number, rational, etc... but a 100% chance of getting a normal number or 100% chance of getting an uncomputable number.

Bayesian vs frequentist interpretations allow us to make useful predictions, but they are the map, not the territory.

Bayesian iid data and Frequentist iid random variables play the exact similar roles Enthalpy, Gibbs free energy, statistical entropy, information theory entropy, Shannon Entropy etc...

The difference between them is the independent variables that they depend on and the needs of the model they are serving.

You can also approach the property that people often want to communicate when using the term entropy as effective measure 0 sets, null cover, martingales, kolmogorov complexity, compressibility, set shattering, etc...

As a lens, null cover is most useful in my mind, as a random real number should not have any "uncommon" properties, or look more like the normal reals.

This is very different from statistical methods, or any effective usable algorithm/program, which absolutely depend on "uncommon" properties.

Which is exactly the hay in the problem of finding the hay haystack problem, hay is boring.

[1]https://www.cs.auckland.ac.nz/~cristian/samplepapers/omegast...

9. canjobear ◴[] No.43687296{3}[source]
Probability is a bunch of numbers that add to 1. Sometimes you can use this to represent subjective beliefs. Sometimes you can use it to represent objectively existing probability distributions. For example, an LLM is a probability distribution on a following token given previous tokens. If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct. So the probability is objective.
replies(2): >>43688108 #>>43689218 #
10. canjobear ◴[] No.43687307[source]
> If 2 people are calculating any property and one of them is making a false assumption, they'll end up with a different (false) conclusion.

This implies that there is an objectively true conclusion. The true probability is objective.

replies(1): >>43689052 #
11. pharrington ◴[] No.43687411[source]
Are you basically just saying "we're not oracles"?
12. bloppe ◴[] No.43688108{4}[source]
We're talking about 2 different things. I agree that probability is objective as long as you've already decided on the definition of the macrostate, but that definition is subjective.

From an LLM's perspective, the macrostate is all the tokens in the context window and nothing more. A different observer may be able to take into account other information, such as the identity and mental state of the author, giving rise to a different distribution. Both of these models can be objectively valid even though they're different, because they rely on different definitions of the macrostate.

It can be hard to wrap your head around this, but try taking it to the extreme. Let's say there's an omniscient being that knows absolutely everything there is to know about every single atom within a system. To that observer, probability does not exist, because every macrostate represents a single microstate. In order for something to be repeated (which is core to the definition of probability), it must start from the exact same microstate, and thus always have the same outcome.

You might think that true randomness exists at the quantum level and that means true omniscience is impossible (and thus irrelevant), but that's not provable and, even if it were true, would not invalidate the general point that probabilities are determined by macrostate definition.

replies(1): >>43688531 #
13. canjobear ◴[] No.43688531{5}[source]
Suppose you're training a language model by minimizing cross entropy, and the omniscient being is watching. In each step, your model instantiates some probability distribution, whose gradients are computed. That distribution exists, and is not deterministic to the omniscient entity.
replies(2): >>43689911 #>>43689949 #
14. Geee ◴[] No.43688999[source]
It's both. The system or process has it's actual entropy, and the sequence of observations we make has a certain entropy. We can say that "this sequence of numbers has this entropy", which is slightly different from the entropy of the process which created the numbers. For example, when we make more coin tosses, our sequence of observations has an entropy which gets closer and closer to the actual entropy of the coin.
15. mitthrowaway2 ◴[] No.43689052{3}[source]
Ok. I rolled a die and the result was 5. What should the true objective probability have been for the outcome of that roll?
16. kgwgk ◴[] No.43689218{4}[source]
> If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct.

The observer who knows the implementation in detail and the state of the pseudo-random number generator can predict the next token with certainty. (Or almost certainty, if we consider flip-switching cosmic rays, etc.)

replies(1): >>43689432 #
17. canjobear ◴[] No.43689432{5}[source]
That’s the probability to observe a token given the prompt and the seed. The probability assigned to a token given the prompt alone is a separate thing, which is objectively defined independent of any observer and can be found by reading out the model logits.
replies(1): >>43689453 #
18. kgwgk ◴[] No.43689453{6}[source]
Yes, that’s a purely mathematical abstract concept that exists outside of space and time. The labels “objective” and “subjective” are usually used to talk about probabilities in relation to the physical world.
replies(2): >>43689575 #>>43689589 #
19. ◴[] No.43689575{7}[source]
20. canjobear ◴[] No.43689589{7}[source]
An LLM distribution exists in the physical world, just as much as this comment does. It didn’t exist before the model was trained. It has relation to the physical world: it assigns probabilities to subword units of text. It has commercial value that it wouldn’t have if its objective probability values were different.
replies(1): >>43689760 #
21. kgwgk ◴[] No.43689760{8}[source]
> It has relation to the physical world: it assigns probabilities to subword units of text.

How is that probability assignment linked to the physical world exactly? In the physical world the computer will produce a token. You rejected before that it was about predicting the token that would be produced.

replies(1): >>43689822 #
22. kgwgk ◴[] No.43689822{9}[source]
Or maybe you mean that the probability assignments are not about the output of a particular LLM implementation in the real world but about subword units of text in the wild.

In that case how could two different LLMs do different assigments to the same physical world without being wrong? Would they be “objective” but unrelated to the “object”?

23. bloppe ◴[] No.43689911{6}[source]
An LLM is given a definition of the macrostate which creates the probability distribution, but a different definition of the macrostate (such as would be known to the omniscient being) would create a different distribution. According to the omniscient entity, the vast majority of long combinations of tokens would have zero probability because nobody will ever write them down in that order. The infinite monkey theorem is misleading in this regard. The odds of producing Shakespeare's works completely randomly before the heat death of the universe are practically zero, even if all the computing power in the world were dedicated to the cause.
24. kgwgk ◴[] No.43689949{6}[source]
What’s non deterministic there?

That “probability distribution” is just a mathematical function assigning numbers to tokens, defined using a model that the person creating the model and the omniscent entity know, applying a set of deterministic mathematical functions to a sequence of observed inputs that the person creating the model and the omniscent entity also know.

25. quietbritishjim ◴[] No.43691399[source]
You're right it reduces to Bayesian vs frequentist views of probability. But you seem to be taking an adamantly frequentist view yourself.

Imagine you're not interested in whether a dice is weighted (in fact assume that it is fair in every reasonable sense), but instead you want to know the outcome of a specific roll. What if that roll has already happened, but you haven't seen it? I've cheekily covered up the dice with my hand straight after I rolled it. It's no longer random at all, in at least some philosophical points of view, because its outcome is now 100% determined. If you're only concerned about "the property of the dice itself" are you now only concerned with the property of the roll itself? It's done and dusted. So the entropy of that "random variable" (which only has one outcome, of probability 1) is 0.

This is actually a valid philosophical point of view. But people that act as though the outcome is still random, allow themselves to use probability theory as if it hadn't been rolled yet, are going to win a lot more games of chance than those that refuse to.

Maybe this all seems like a straw man. Have I argued against anything you actually said in your post? Yes I have: your core disagreement with OP's statement "entropy is a property of an individual". You see, when I covered up the dice with my hand, I did see it. So if you take the Bayesian view of probability and allow yourself to consider that dice roll probabilistically, then you and I really do have different views about the probability distribution of that dice roll and therefore the entropy. If I tell a third person, secretly and honestly, that the dice roll is even then they have yet another view of the entropy of the same dice roll! All at the same time and all perfectly valid.