Benford's law
|
- A separate article exists for Benford's law of controversy.
Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit 1 occurs much more often than the others (namely about 30% of the time). Furthermore, the higher the digit, the less likely it is to occur as the leading digit of a number. This applies to figures related to the natural world or of social significance; be it numbers taken from electricity bills, newspaper articles, street addresses, stock prices, population numbers, death rates, areas or lengths of rivers or physical and mathematical constants.
Contents |
Mathematical statement
More precisely, Benford's law states that the leading digit n in base b (n = 1, ..., b − 1) occurs with probability proportional to logb(n + 1) − logb(n). In base 10, the leading digits have the following distribution by Benford's law:
Leading digit | Probability |
---|---|
1 | 30.1 % |
2 | 17.6 % |
3 | 12.5 % |
4 | 9.7 % |
5 | 7.9 % |
6 | 6.7 % |
7 | 5.8 % |
8 | 5.1 % |
9 | 4.6 % |
One can also formulate a law for the first two digits: the probability that the first two-digit block is equal to n (n = 10, ..., 99) is log10(n+1) − log10(n), and similarly for three-blocks without leading zeros and longer blocks.
Explanation
That in general the leading digit 1 should be more common than the other digits can be understood as follows: start counting from 1: 1, 2, 3, ... As you reach 9, every digit will have been equally likely. But then, from 10 to 19, you only have the leading digit 1, so 1 gets a huge head start. Only when you reach 99 will all digits be equally likely again. But then 1 gets another huge head start from 100 to 199. And so it continues: 1 has always a lead, except for very rare exceptions (9, 99, 999, 9999, ...). This is not particularly satisfactory as an explanation, unless some probability of stopping counting at some point is also included.
Perhaps somewhat more precisely, suppose (capital) X is a random variable whose probability of being equal to any positive integer (lower-case) x is a constant times x−s, where s > 1. The aforementioned "constant" must then be 1/ζ(s), where ζ is the Riemann zeta function (see zeta distribution). The probability that the first digit of X is n approaches log10(n + 1) − log10(n) as s approaches 1.
The precise form of Benford's law can be explained if one assumes that the logarithms of the numbers are uniformly distributed; this means that a number is for instance just as likely to be between 100 and 1000 (logarithm between 2 and 3) as it is between 10,000 and 100,000 (logarithm between 4 and 5). For many sets of numbers, especially ones that grow exponentially such as incomes and stock prices, this is a reasonable assumption.
Another explanation is that if a distribution of first digits exists, it should be scale invariant. For example the first (non-zero) digit of the lengths or distances of objects should have the same distribution whether the unit of measurement is planck lengths, inches, feet, yards, metres, miles, light years, or anything else. But, for example, there are three feet in a yard, so the probability that the first digit of a length (e.g. in yards) is 1 must be the same as the probability that the first digit of a length (e.g. in feet) starts 3, 4, 5, 6, 7, or 8. Applying this to all possible measurement scales gives a logarithmic distribution, and combined with the fact that log10(1)=0 and log10(10)=1 gives Benford's law. That is, if there is a distribution of first digits, it must apply to a set of data regardless of what measuring units are used, and the only distribution of first digits that fits that is the Benford Law.
Note that for numbers drawn from many distributions, for example IQ scores, human heights or other variables following normal distributions, the law is not valid. However, if one "mixes" number from those distributions, for example by taking numbers from newspaper articles, Benford's law reappears. This can be proven mathematically: if one repeatedly "randomly" chooses a probability distribution and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford's law.
Applications and limitations
In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford's law ought to show up any anomalous results.
In the same vein, Benford's law can be (and is) used to analyse insurance, accounting or expenses data and identify possible fraud.
Other uses, for example to analyse the results of clinical trials and election results, have also been proposed.
Limitations
Care must be taken with these applications, however. Strictly speaking, only a set of numbers chosen at random (from a given probability distribution) is sure to obey the law. A set of real-life data may or may not obey the law, depending on the extent to which the distribution of numbers it contains are skewed by the category of data under consideration.
For instance, one might well expect a list of numbers representing 'populations of UK villages beginning with A' or 'small insurance claims' to obey Benford's law. But if it turns out that the definition of a 'village' in this case is 'settlement with population between 300 and 999', or that the definition of a 'small insurance claim' in this case is 'claim between $50 and $100', then Benford's law would be manifestly false because certain numbers have been excluded by the definition of the data category. In the case of the villages it could be applied but expecting only first numbers to be 3,4,5,6,7,8,9 each getting the same relative probabilities as in the general case. Being then the probability of 3 equal to: p(3)+ ( p(3)* (p(1)+p(2)) ) Being p(1), p(2), and p(3) the probabilities for the general Benford's Law. In the case of insurance claims, the data will probably not adjust since men could adjust numbers to fit inside or outside that range.
History
The discovery of this fact goes back to 1881, when the American astronomer Simon Newcomb noticed that the first pages of logarithm books (used at that time to perform calculations), the ones containing numbers that started with 1, were much more worn than the other pages. However, it has been argued that any book that is used from the beginning would show more wear and tear on the earlier pages. This story might thus be apocryphal, just like Isaac Newton's supposed discovery of gravity from observation of a falling apple.
The phenomenon was rediscovered in 1938 by the physicist Frank Benford, who checked it on a wide variety on data sets and was credited for it. In 1996, Ted Hill proved the result about mixed distributions mentioned above.
References
- Frank Benford: The law of anomalous numbers, Proceedings of the American Philosophical Society, 78 (1938), p. 551
- Ted Hill: The first digit phenomenon, American Scientist 86 (July-August 1998), p. 358. 10pg pdf file (http://www.math.gatech.edu/~hill/publications/cv.dir/1st-dig.pdf)
- Hal Varian: Benford's law, American Statistician 26, p.65.
See also
External link
- Benford's Law and Zipf's Law (http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml)
- A general article (http://www.rexswain.com/benford.html)de:Benfordsches Gesetz
es:Ley de Benford fr:Loi de Benford it:Legge di Benford
zh:本福特定律