The Wasserstein metric is a metric for comparing two probability distributions. Unlike information-theoretic measures like KL divergence, PSI and JSD, the WM works well for non-overlapping distributions. (The JSD is defined for such distributions, but is less informative for substantially non-overlapping distributions.)

Given distributions and , divided into histogram bins and , the Wasserstein metric between them is defined recursively as

where

Earth mover’s distance

The WM is also called the “Earth mover’s distance” based on a metaphor about shoveling dirt. Imagine that each distribution is a pile of dirt. The Wasserstein distance represents the amount of work required to convert one into the other, where work is defined as mass times distance.

I’ll admit, I find this a little hard to see. It helps to see that the sum over the recursive steps is a bit like multiplication, such that we can think of as distance and as mass.