Recall that Shannon entropy is the expected value of the information content of a distribution :

When all outcomes are equally likely, we don’t expect events to give us much information. When some events are much more likely than others, the improbable events tell us a lot, so the expected information content is much higher.

Shannon’s source coding theorem tells us that we can compress a sequence of random variables drawn from into at least bits with negligible information loss. (One such intuitive encoding is the Huffman encoding.) The uncompressed size of may be much larger than 1 bit; this larger sample space is captured in the entropy .

Suppose that we have compressed a message P into an encoding optimized for Q. How many bits do we expect to require to encode P? Intuitively, it would be the average of the information content of in P’s distribution, weighted over the probability of in Q’s distribution. That is, the required number of bits is the expected information density of Q over the distribution of P.

The definition of the information content of an event , once it occurs, is . So we obtain the needed weighted average as

We call the cross-entropy of and .

Cross entropy as a measure of loss

Let be the ground-truth distribution of some discrete random variable, and let be the distribution of a predictive model of that variable. Then we can ask: how many bits does it take to encode the true labels into a representation optimized for our model? Intuitively, such a measure should give us a pretty good idea of how well the model aligns with what it’s modeling.

Another way to look at it is this: when , then . That is to say, gives us a distance measure, called the Kullback-Leibler divergence, that is zero only when . Cross-entropy differs from this distance measure only by a constant . Therefore we can use cross-entropy as an indicator of the fit between a prediction and the true distribution.

Why use CE loss instead of using the KL divergence directly? Because KL divergence has an explicit term for the entropy of the true distribution. We don’t actually know that; we only know the ground-truth labels of the training data, which we can drop into a discrete version of . So we use CE, which lets us get away with just this.

The Living Machine Learning Reference

Explorer

Cross-entropy

Cross entropy as a measure of loss

Backlinks