# Notes on the Pearson correlation coefficient

The **Pearson correlation coefficient** is a measure of the linear correlation between two variables *X* and *Y*. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It plays a role in the algorithms which try to find similarities between different set of data.

Think of the common problem of finding the best recommendation for a user visiting your site. One of the strategies is to recommend products (movies, books, ...) by looking at what similar users have previously bought. So the first problem is to find a way to define **similar users**.

The Pearson correlation coefficient can be used in such case, because it can measure the distance between different datasets.

It can be a better measure of such similarity than the Euclidean distance, because it can find a good match even if data are not so similar, but the data trend is. For example, me and my friend both think that 'Natural Born Killer' is the best movie in the world, but I don't like to give the maximum vote because I think that something better can be made, so I give 8, while my friends gives 10. At the same time I hate 'Batman Vs Superman', and my friend too, but I'm very angry when giving bad votes, so I give it 2, while my friend will not go below 4. And so for a "bad but not too much" movie, I'll give 4, while my friend 6.

Our votes are quite different if seen in an Euclidean way, but our tastes are similar, the difference is just that I keep my votes lower, while my friend generally keeps them higher. The Pearson coefficient can see that our trends are similar and the best fit line between our votes will show an high correlation.

In other words, the Pearson coefficient corrects for *grade inflation*.

The Pearson correlation coefficient is defined has:

\[ r = \frac{cov(X, Y)}{\sigma_X \sigma_Y} \]

Now we'll just manipulate this formula a bit, in order to get something easier to express in our code.

Given the definitions of covariance and standard deviation:

\[r = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}} \sqrt{\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{n}}} \]

Let's call *Num* the numerator:

\[Num = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \]

\[Num = \frac{1}{n} \sum_{i=1}^{n} (x_i y_i - x_i \bar{y} - \bar{x} y_i + \bar{x}\bar{y}) \]

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \bar{y} \sum_{i=1}^{n} x_i - \bar{x} \sum_{i=1}^{n} y_i + n \bar{x} \bar{y}) \]

We all now what \(\bar{x}\) is, so:

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} y_i \sum_{i=1}^{n} x_i}{n} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n} + n \frac{\sum_{i=1}^{n} x_i}{n} \frac{\sum_{i=1}^{n} y_i}{n}) \]

and finally we get

\[Num = \frac{1}{n} (\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n})\]

I'll not write all the stuff for the denominator, but look at this:

\[ \sum (x_i - \bar{x})^2 = \sum (x_i^2 + \bar{x}^2 - 2x_i \bar{x}) \] \[ = \sum x_i^2 + n \bar{x}^2 - 2\bar{x} \sum x_i = \sum x_i^2 + n (\frac{\sum x_i}{n})^2 - 2 \frac{\sum x_i \sum x_i}{n} \] \[ = \sum x_i^2 - \frac{(\sum x_i)^2}{n} \]

So we'll get that the denominator *Den* is:

\[ \frac{1}{n} \sqrt{(\sum_{i=1}^{n} x_i^2 - \frac{(\sum_{i=1}^{n} x_i)^2}{n})(\sum_{i=1}^{n} y_i^2 - \frac{(\sum_{i=1}^{n} y_i)^2}{n})} \]

Giving us that

\[ r = \frac{\sum_{i=1}^{n} {x_i y_i} - \frac{\sum_{i=1}^{n} x_i \sum_{i=1}^{n} y_i}{n}}{\sqrt{(\sum_{i=1}^{n} x_i^2 - \frac{(\sum_{i=1}^{n} x_i)^2}{n})(\sum_{i=1}^{n} y_i^2 - \frac{(\sum_{i=1}^{n} y_i)^2}{n})}} \]

Which is exactly the formula used in the fantastic book "*Programming Collective Intelligence*" by Toby Segaran, as you can see in this repository.