Recall that Shannon entropy is the average amount of information needed to describe a random event X:

Recall also that conditional entropy is the average amount of information needed to describe a random event , given that a random event is already known:

Then the difference is the average amount of information about that is explained by . Put another way, this quantity is the amount of information that encodes about . This quantity is the mutual information :

As illuminated by the alternative formulation below, mutual information is symmetric; that is:

I find it surprising and unintuitive that the average amount of information Y encodes about X is exactly the same as the average amount of information that X encodes about Y, but we can see it clearly in the following.

Another way to look at it

A more insightful form of this, suggested by Claude AI, is:

This draws attention to a few facts:

  • It directly compares the joint distribution to the product of the marginal distributions .
  • It shows that mutual information is symmetric.
  • It shows that it’s always non-negative.
  • It shows that mutual information is zero iff and are independent.

Derivation of Claude’s formulation

Let’s start with the original formulation:

Plugging in the integrals above, we obtain

Let’s work on that logarithm in the first integral. We can rewrite it as

Then we can use the logarithm property to get:

We put this back into our integral:

Then we distribute When we integrate out over , it becomes . Using this, the second term becomes:  The last two terms cancel out, leaving us with: