Presenting information theory as a series of independent equations like this does a disservice to the learning process. Cross-entropy and KL-divergence are directly derived from information entropy, where InformationEntropy(P) represents the baseline number of bits needed to encode events from the true distribution P, CrossEntropy(P, Q) represents the (average) number of bits needed for encoding P with a suboptimal distribution Q, and KL-Divergence (better referred to as relative entropy) is the difference between these two values (how many more bits are needed to encode P with Q, i.e. quantifying the inefficiency):
relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)
Information theory is some of the most accessible and approachable math for ML practitioners, and it shows up everywhere. In my experience, it's worthwhile to dig into the foundations as opposed to just memorizing the formulas.
(bits assume base 2 here)
replies(2):