Recall that Shannon entropy 
Recall also that conditional entropy 
Then the difference 
As illuminated by the alternative formulation below, mutual information is symmetric; that is:
I find it surprising and unintuitive that the average amount of information Y encodes about X is exactly the same as the average amount of information that X encodes about Y, but we can see it clearly in the following.
Another way to look at it
A more insightful form of this, suggested by Claude AI, is:
This draws attention to a few facts:
- It directly compares the joint distribution to the product of the marginal distributions . 
- It shows that mutual information is symmetric.
- It shows that it’s always non-negative.
- It shows that mutual information is zero iff and are independent. 
Derivation of Claude’s formulation
Let’s start with the original formulation:
Plugging in the integrals above, we obtain
Let’s work on that logarithm in the first integral. We can rewrite it as
Then we can use the logarithm property 
We put this back into our integral:
Then we distribute