Algorithms for calculating variance
|
In statistics, a formula for calculating the variance of a population of size n is:
- <math>\mathrm{Variance} = \frac {n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2}{n^2}.<math>
A formula for calculating the unbiased estimation of the population variance from n finite samples is:
- <math>\mathrm{Variance} = \frac {n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2}{n(n-1)}.<math>
The method of calculation may be more easily understood from the table below where the mean is 8.
i | xi | xi − mean | (xi − mean)2 |
(index) | (datum) | (deviation) | (squared deviation) |
1 | 5 | −3 | 9 |
2 | 7 | −1 | 1 |
3 | 8 | 0 | 0 |
4 | 10 | 2 | 4 |
5 | 10 | 2 | 4 |
n = 5 | sum = 40 | 0 | 18 |
- mean = 40/5 = 8
- variance = (5 · 338 − 402)/(5 · 4) = 4.5
- standard deviation = <math>\sqrt{\mathrm{Variance}} = 2.12<math>
Note: Details of the variance calculation:
338 = [52 + 72 + 82 + 102 + 102]
40 = [5 + 7 + 8 + 10 + 10]
Algorithm
Therefore a simple algorithm to calculate variance can be described by the following pseudocode:
long n = 0; double sum = 0; double sum_sqr = 0; double variance; foreach x in data: n += 1; sum += x; sum_sqr += x*x; end for variance = (sum_sqr - sum*sum/n)/(n-1);
Algorithm
Another algorithm which avoids large numbers in sum_sqr while summing up
double avg = 0; double var = 0; long n = data.length; // number of elements // the array is 1-indexed for i = 1 to n avg = (avg*(i-1) + data[i]) / i; var = (var * (i - 1) + (data[i] - avg)*(data[i] - avg)) / i; end for return var; // resulting variance