Important machine learning equations

(chizkidd.github.io)

302 points sebg | 1 comments | 28 Aug 25 11:38 UTC | HN request time: 0s | source

Show context

dkislyuk ◴[28 Aug 25 15:16 UTC] No.45053271[source]▶

Presenting information theory as a series of independent equations like this does a disservice to the learning process. Cross-entropy and KL-divergence are directly derived from information entropy, where InformationEntropy(P) represents the baseline number of bits needed to encode events from the true distribution P, CrossEntropy(P, Q) represents the (average) number of bits needed for encoding P with a suboptimal distribution Q, and KL-Divergence (better referred to as relative entropy) is the difference between these two values (how many more bits are needed to encode P with Q, i.e. quantifying the inefficiency):

relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)

Information theory is some of the most accessible and approachable math for ML practitioners, and it shows up everywhere. In my experience, it's worthwhile to dig into the foundations as opposed to just memorizing the formulas.

(bits assume base 2 here)

replies(2): >>45054577 #>>45055146 #

golddust-gecko ◴[28 Aug 25 18:06 UTC] No.45055146[source]▶

>>45053271 #

Agree 100% with this. It gives the illusion of understanding, like when a precocious 6 year old learns the word "precocious" and feels smart because they have can say it. Or any movie with tech or science with <technical speak>.

replies(1): >>45068605 #

1. bbminner ◴[29 Aug 25 19:50 UTC] No.45068605[source]▶

>>45055146 #

While I can share the sentiment, my small experience teaching (and studying the same area for over a decade) suggests that giving students a trivial formula to play with "as is" helps motivate its future usage well. It is difficult to teach everything important about X in one go, knowledge is accumulated in layers.

↑