Mahalanobis Distance and Outliers

mahalnobis_kldavenportcom

I wrote a short article on  Absolute Deviation Around the Median a few months ago after having a conversation with Ryon regarding robust parameter estimators. I am excited to see a wet lab scientist take a big interest in ways to challenge the usual bench stats paradigm of t-tests, means, and z-scores. When researching KNN and SOFM, I found it odd that some of the applied literature defaulted to cartesian coordinates without explanation.  I’ve understood that a model’s efficacy improvement in using one coordinate system versus another depends on the natural groupings or clusters present in the data.  When taking the simple k-means as an example,  your choice of coordinate systems depends on the full-covariance of your clusters. Utilizing the Euclidean distance assumes that clusters have identity covariances i.e. all dimensions are statistically independent and the variance of along each of the dimensions (columns) is equal to one.
Note: for more on identity covariances check out this excellent post from Dustin Stansbury (Surprising to see so much MATLAB when Fernando Pérez is at Cal),
http://theclevermachine.wordpress.com/tag/identity-covariance/

Simple Conceptualization:
If we were to visualize this characteristic (clusters with identity covariances) in 2-dimensional space, our clusters would appear as circles. Conversely, If the covariances of the clusters in the data are not identity matrices, then the clusters might appear as ellipses.  I’m thinking I can attribute some the simpler distance measure affinity in these papers to interest in maintaining computational efficiency (time/numerical integrity) as the Mahalanobis distance requires the inversion of the covariance matrix which can be costly.

Consider the image above to be a scatterplot of multivariate data where the axes are defined by the data rather than the data being plotted in a predetermined coordinate space. The center (black dot) is the point whose coordinates are the average values of the coordinates of all points of the figure.  The black line that runs along the length of highest variance through the field of points can be called the x-axis. The second axis will be orthogonal to the first (red line). The scale can be determined by a simple three-sigma rule which dictates that 68% of points need to lay within one unit from the center(origin). Essentially the the Mahalanobis distance is an euclidian distance that considers the covariance of the data by down-weighting the axis with higher variance.

I’ll move on to a quick Python implementation of an outlier detection function based on the Mahalanobis Distance calculation. See below for the IPython notebook: