Recall that Shannon entropy for a discrete frequency distribution is defined as

Shannon entropy represents the expected information content of the distribution. Exponentiating this quantity, typically with base 2, gives a quantity called perplexity, :

Why is this useful? To start, consider a uniform distribution over outcomes, such that . Then its entropy (in bits) is

Exponentiating both sides, we see that its perplexity is the number of states, :

For non-uniform distributions, perplexity is the “effective number of states,” also called the branching factor of the distribution.

Perplexity in high-cardinality classification tasks

Perplexity is an especially useful idea when considering systems that predict from a number of possible states, such as multi-class classification or large language modeling tasks. For example, consider a large language model with a vocabulary of 50,000 tokens. At the outset of training, when the CE loss is at its highest, we would expect perplexity to be in the ballpark of 50,000. By the time we were done training, perplexity might be in the low hundreds or even tens.

In our example, we can also estimate the expected loss values at the start and end of training as and respectively. This is extremely useful, as estimating loss values makes it much easier to select reasonable starting hyperparameter values.

Note that, for open-domain language models, relatively high perplexity may not be indicative of poor performance or incomplete training, but rather a wide range of valid responses. (A good decoding strategy often comes into play in these situations.)