Curse of dimensionality

The “curse of dimensionality” refers to a collection of phenomena that limit the interpretability of high-dimensional data, even in the presence of significant underlying structure. The claim that this is a “curse” reflects a particular perspective; high-dimensional data also exhibits various advantageous characteristics, which have cheekily been dubbed “the blessing of dimensionality.”

In my experience, the outstanding “curse” of high-dimensional data is the tendency of distance measures to become increasingly uniform. That is, for very high-dimensional data, points that are meaningfully similar don’t end up much “closer” to one another than points that are far less similar. In such cases, the underlying “closeness” is revealed by a dimensionality reduction strategy whose assumptions match the structure of the data.

While the factuality of such an assertion depends on the distribution and the distance measure, it can be helpful to work through a common example.

A worked example

Suppose you’ve got two points in a -dimensional space. Then the squared Euclidean distance between them is given as

(We use squared distance for convenience.) If we assume that the coordinates and are i.i.d. with mean and variance , then by linearity of expectation

By definition of variance,

And by construction,

Substituting these into our expectation above, we have

And so

To see why this implies increasing uniformity of distances, we can look at the ratio of the variance to the mean (called the coefficient of variation). For convenience, we will focus on the square of the Euclidean distance, rather than the distance itself. If this ratio becomes arbitrarily small as increases, it suggests that the difference in distance between any two points becomes negligible.

To begin, consider that since and are i.i.d. with mean and variance , we also know that

It can be shown from the above that

Therefore

Finally, we can use this and our previously calculated expectation of distance to calculate the coefficient of variation as

Hence we see that, as dimension increases, the ratio of mean to variance grows ever smaller, until eventually the mean completely swamps the variance and the differences are comparatively trivial.

The Living Machine Learning Reference

Explorer

Curse of dimensionality

A worked example

Backlinks