> The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x. If we are told that a highly improbable event has just occurred, we will have received more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information. Our measure of information content will therefore depend on the probability distribution p(x), and we therefore look for a quantity h(x) that is a monotonic function of the probability p(x) and that expresses the information content. The form of h(·) can be found by noting that if we have two events x and y that are unrelated, then the information gain from observing both of them should be the sum of the information gained from each of them separately, so that h(x, y) = h(x) + h(y). Two unrelated events will be statistically independent and so p(x, y) = p(x)p(y). From these two relationships, it is easily shown that h(x) must be given by the logarithm of p(x) and so we have h(x) = − log2 p(x).
This is the definition of information for a single probabilistic event. The definition of entropy of a random variable follows from this by just taking the expectation.