abidibo.net

Notes on the Pearson correlation coefficient

statistics algorithms machine learning

The Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It plays a role in the algorithms which try to find similarities between different set of data.

Think of the common problem of finding the best recommendation for a user visiting your site. One of the strategies is to recommend products (movies, books, ...) by looking at what similar users have previously bought. So the first problem is to find a way to define similar users.

The Pearson correlation coefficient can be used in such case, because it can measure the distance between different datasets.

It can be a better measure of such similarity than the Euclidean distance, because it can find a good match even if data are not so similar, but the data trend is. For example, me and my friend both think that 'Natural Born Killer' is the best movie in the world, but I don't like to give the maximum vote because I think that something better can be made, so I give 8, while my friends gives 10. At the same time I hate 'Batman Vs Superman', and my friend too, but I'm very angry when giving bad votes, so I give it 2, while my friend will not go below 4. And so for a "bad but not too much" movie, I'll give 4, while my friend 6.

Our votes are quite different if seen in an Euclidean way, but our tastes are similar, the difference is just that I keep my votes lower, while my friend generally keeps them higher. The Pearson coefficient can see that our trends are similar and the best fit line between our votes will show an high correlation.

In other words, the Pearson coefficient corrects for grade inflation.

The Pearson correlation coefficient is defined has:

\[ r = \frac{cov(X, Y)}{\sigma_X \sigma_Y} \]

Now we'll just manipulate this formula a bit, in order to get something easier to express in our code.

Given the definitions of covariance and standard deviation:

\[r = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}} \sqrt{\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{n}}} \]

Let's call Num the numerator:

\[Num = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \]

\[Num = \frac{1}{n} \sum_{i=1}^{n} (x_i y_i - x_i \bar{y} - \bar{x} y_i + \bar{x}\bar{y}) \]

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \bar{y} \sum_{i=1}^{n} x_i - \bar{x} \sum_{i=1}^{n} y_i + n \bar{x} \bar{y}) \]

We all now what \(\bar{x}\) is, so:

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} y_i \sum_{i=1}^{n} x_i}{n} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n} + n \frac{\sum_{i=1}^{n} x_i}{n} \frac{\sum_{i=1}^{n} y_i}{n}) \]

and finally we get

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n})\]

I'll not write all the stuff for the denominator, but look at this:

\[ \sum (x_i - \bar{x})^2 = \sum (x_i^2 + \bar{x}^2 - 2x_i \bar{x}) \] \[ = \sum x_i^2 + n \bar{x}^2 - 2\bar{x} \sum x_i = \sum x_i^2 + n (\frac{\sum x_i}{n})^2 - 2 \frac{\sum x_i \sum x_i}{n} \] \[ = \sum x_i^2 - \frac{(\sum x_i)^2}{n} \]

So we'll get that the denominator Den is:

\[ \frac{1}{n} \sqrt{(\sum_{i=1}^{n} x_i^2 - \frac{(\sum_{i=1}^{n} x_i)^2}{n})(\sum_{i=1}^{n} y_i^2 - \frac{(\sum_{i=1}^{n} y_i)^2}{n})} \]

Giving us that

\[ r = \frac{\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n}}{\sqrt{(\sum_{i=1}^{n} x_i^2 - \frac{(\sum_{i=1}^{n} x_i)^2}{n})(\sum_{i=1}^{n} y_i^2 - \frac{(\sum_{i=1}^{n} y_i)^2}{n})}} \]

Which is exactly the formula used in the fantastic book "Programming Collective Intelligence" by Toby Segaran, as you can see in this repository.

Subscribe to abidibo.net!

If you want to stay up to date with new contents published on this blog, then just enter your email address, and you will receive blog updates! You can set you preferences and decide to receive emails only when articles are posted regarding a precise topic.

I promise, you'll never receive spam or advertising of any kind from this subscription, just content updates.

Subscribe to this blog

Comments are welcome!

blog comments powered by Disqus

Your Smartwatch Loves Tasker!

Your Smartwatch Loves Tasker!

Now available for purchase!

Featured