Comprehensive Guide on Sample Variance
Start your free 7-days trial now!
Sample variance
The sample variance of a sample $(x_1,x_2,\cdots,x_n)$ is computed by:
Where $n$ is the sample size and $\bar{x}$ is the sample mean. For the intuition behind this formula, please consult our guide on measures of spread.
Notice how we compute the average by dividing by $n-1$ instead of $n$. This is because dividing by $n-1$ makes the sample variance an unbiased estimator for the population variance - we give the prooflink below, but please consult our guide to understand what bias means.
Computing the sample variance of a sample
Compute the sample variance of the following sample:
Solution. Here, the size of the sample is $n=4$. We first start by computing the sample mean:
Let's now compute the sample variance $s^2$ using the formula:
This means that, on average, the square of the difference between each point and the sample mean is around $6.67$. This interpretation is precise but quite awkward. Therefore, instead of quoting the sample variance of a single sample, we often compare the sample variance of two different samples to understand which sample is more spread out.
Intuition behind why we divide by n-1 instead of n
Although we will formally provelink that dividing by $n-1$ will give us an unbiased estimator of the population variance, let's understand from another perspective why we should divide by $n-1$.
Ideally, our estimate of the population variance would be:
Where $\mu$ is the population mean. In fact, if the population mean is known, then the sample variance should be computed as above without dividing by $n-1$. However, in most cases, the population mean is unknown, so the best we can do is to replace $\mu$ with the sample mean $\bar{x}$ like so:
However, when we replace $\mu$ with $\bar{x}$, it turns out that we would, on average, underestimate the population variance. We will now mathematically prove this.
Let's focus on the sum of squared differences. Instead of the sample mean $\bar{x}$, let's replace that with a variable $t$ and consider the expression as a function of $t$ like so:
Using calculus, our goal is to show that that $t=\bar{x}$ minimizes this function. Let's take the first derivative of $f(t)$ with respect to $t$ like so:
Setting this equal to zero gives:
Let's also check the nature of this stationary point by referring to the second derivative:
Since the sample size $n$ is positive, we have that the second derivative is always positive. This means that the stationary point $t=\bar{x}$ is indeed a minimum! In other words, out of all the values $t$ can take, setting $t=\bar{x}$ will minimize the sum of squared differences:
The population mean $\mu$ is some unknown constant, but we now know that:
Even though we don't know what $\mu$ is, we know that the sum of squared differences when $t=\mu$ must be at least as large as the sum of squared differences when $t=\bar{x}$.
Let's divide both sides of \eqref{eq:kUfz4YNwhBVtS8B1ZF0} by $n$ to get:
The right-hand side is our ideal estimate \eqref{eq:ohGzVCDYbDArl9d4nZX} from earlier. To make this clear, let's write \eqref{eq:Vd8ISUnkMkIvhi6wExH} as:
This means that estimate of the population variance using the left-hand side of \eqref{eq:mfxzwx5FHb6tVM1v3Zl} will be generally less than the ideal estimate. In order to compensate this underestimation, we must make the left-hand side larger. One way of doing so is by dividing by a smaller amount, say $n-1$:
Of course, this leads to more questions such as why we should divide specifically by $n-1$ instead of say $n-2$ or $n-3$, which all have the effect of making the left-hand side \eqref{eq:mfxzwx5FHb6tVM1v3Zl} larger. The motivation behind this exercise is merely to understand that dividing by some number less than $n$ accounts for the underestimation. As for why we specifically divide by $n-1$, we prove mathematically below that dividing by $n-1$ adjusts our estimate exactly such that we no longer neither underestimate nor overestimate.
Properties of sample variance
Unbiased estimator of the population variance
The sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$, that is:
Proof. We start off with the following algebraic manipulation:
Multiplying both sides by $1/(n-1)$ gives:
The left-hand side is the formula for the sample variance $S^2$ so:
Now, let's take the expected value of both sides and use the property of linearity of expected values to simplify:
Now, from the property of variance, we know that:
We have derived TODO the variance as well as the expected value of $\bar{X}$ to be:
Substituting these values into \eqref{eq:ZQMklBf4CcDEfOxcVdJ} gives:
Once again, from the same property of variance, we have that:
Substituting \eqref{eq:LlQymAMmsVKtv6MIqTc} and \eqref{eq:OPC1YMGbDIHlCRGd6IJ} into \eqref{eq:MGFWQ0zdxObMW1zhXiV} gives:
This proves that the sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$.
Computing sample variance using Python
We can easily compute the sample variance using Python's NumPy
library. By default, the var(~)
method returns the following biased sample variance:
To compute the unbiased sample variance instead, supply the argument ddof=1
:
import numpy as np
6.666666666666667
Note that ddof
stands for degree of freedom and represents the following quantity: