The Kullback-Leibler divergence of a true distribution and the predicted distribution is the difference between the cross-entropy between them and the Shannon entropy of the true distribution,

Hence, if we can show that the KL divergence is always greater than or equal to zero, we will have also shown that the cross entropy is less than or equal to the Shannon entropy of the true distribution, i.e.

Jensen’s inequality states that for a random variable and a convex function ,

Let , which is convex because

which is always positive. Now let , where and are both probability distributions. Then

Remember that is a PDF, so . Cancelling the terms, we therefore observe that the integral must evaluate to .

Substituting , and into Jensen’s inequality, we obtain:

which simplifies to

So we evaluate the expected value and obtain,

Now recall the logarithm property that , from which it follows that . Then we can rewrite the above as

We recognize that the left-hand side as the KL divergence. Hence the KL divergence is never less than zero. Since the KL divergence is the difference between the cross-entropy and the Shannon entropy, we conclude that cross-entropy can never be less than the Shannon entropy.