←back to thread

181 points jxmorris12 | 2 comments | | HN request time: 0.409s | source
Show context
nobodywillobsrv ◴[] No.43111951[source]
Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.
replies(2): >>43111971 #>>43113945 #
semiinfinitely ◴[] No.43111971[source]
right and it should be totally obvious that we would choose an energy function from statistical mechanics to train our hotdog-or-not classifier
replies(3): >>43112080 #>>43112333 #>>43112585 #
C-x_C-f ◴[] No.43112080[source]
No need to introduce the concept of energy. It's a "natural" probability measure on any space where the outcomes have some weight. In particular, it's the measure that maximizes entropy while fixing the average weight. Of course it's contentious if this is really "natural," and what that even means. Some hardcore proponents like Jaynes argue along the lines of epistemic humility but for applications it really just boils down to it being a simple and effective choice.
replies(1): >>43112729 #
1. yorwba ◴[] No.43112729[source]
In statistical mechanics, fixing the average weight has significance, since the average weight i.e. average energy determines the total energy of a large collection of identical systems, and hence is macroscopically observable.

But in machine learning, it has no significance at all. In particular, to fix the average weight, you need to vary the temperature depending on the individual weights, but machine learning practicioners typically fix the temperature instead, so that the average weight varies wildly.

So softmax weights (logits) are just one particular way to parameterize a categorical distribution, and there's nothing precluding another parameterization from working just as well or better.

replies(1): >>43112788 #
2. C-x_C-f ◴[] No.43112788[source]
I agree that the choice of softmax is arbitrary; but if I may be nitpicky, the average weight and the temperature determine one another (the average weight is the derivative of the log of the partition function with respect to the inverse temperature). I think the arbitrariness comes more from choosing logits as a weight in the first place.