Correlation and Covariance

So far we have mostly been focussing on the distribution of a single variable, but most experiments are designed to look at the dependence of one or more variables on another variable. Thus we now turn to looking at the relationship between variables. One way to visualize this is a scatterplot. While this may show a clear functional dependence between the 2 variables, in most situations it does not, either because of observational errors or because a clear functional dependence does not exist.

There are a number of ways to try and measure the relationship between variables. One is covariance which measures the amount two variable vary with each other. For n data points x and y it can be measured as:

$Cov(x,y) = {{1}\over{n}} \sum^n_{i=0} (x_i – \bar{x})(y_i – \bar{y})$

where $\bar{x}$ is the mean value of x and $\bar{y}$ is the mean value of y. The covariance has units and no bound on its range. Alternatively, we can look at the Pearson correlation which divides the covariance by the standard deviations, thus making them dimensionless and bound between -1 and 1.

$Corr(x,y) = r = {{1}\over{n}} \sum_i {(x_i – \bar{x})\over \sigma_x }{(y_i -\bar{y})\over \sigma_y}$

One can instead choose to rank order the data and look at the correlation of the ranked data, this is the Spearman correlation.

Readings:

think-stats: relationships between variables

Code:

Pandas:corr – Pearson, Kendall or Spearman

numpy:corrcoef – Pearson correlation