Bayesian inference
|
Bayesian inference is a statistical inference in which probabilities are interpreted not as frequencies or proportions or the like, but rather as degrees of belief. The name comes from the frequent use of Bayes' theorem in this discipline.
Bayes' theorem is named after the Reverend Thomas Bayes. However, it is not clear that Bayes would endorse the very broad interpretation of probability now called "Bayesian". This topic is treated at greater length in the article Thomas Bayes.
Contents |
2.1 From which bowl is the cookie? |
Evidence and the scientific method
Bayesian statisticians claim that methods of Bayesian inference are a formalisation of the scientific method involving collecting evidence which points towards or away from a given hypothesis. There can never be certainty, but as evidence accumulates, the degree of belief in a hypothesis changes; with enough evidence it will often become very high (almost 1) or very low (near 0).
- As an example, this reasoning might be
- The sun has risen and set for billions of years. The sun has set tonight. With very high probability, the sun will rise tomorrow.
Bayesian statisticians believe that Bayesian inference is the most suitable logical basis for discriminating between conflicting hypotheses. It uses an estimate of the degree of belief in a hypothesis before the advent of some evidence to give a numerical value to the degree of belief in the hypothesis after the advent of the evidence. Because it relies on subjective degrees of belief, however, it is not able to provide a completely objective account of induction. See scientific method.
Bayes' theorem also provides a method for adjusting degrees of belief in the light of new information. Bayes' theorem is
- <math>P(H_0|E) = \frac{P(E|H_0)\;P(H_0)}{P(E)}. <math>
For our purposes, <math>H_0<math> can be taken to be a hypothesis which may have been developed ab initio or induced from some preceding set of observations, but before the new observation or evidence <math>E<math>.
- The term <math>P(H_0)<math> is called the prior probability of <math>H_0<math>.
- The term <math>P(E|H_0)<math> is the conditional probability of seeing the observation <math>E<math> given that the hypothesis <math>H_0<math> is true; as a function of <math>H_0<math> given <math>E<math>, it is called the likelihood function.
- The term <math>P(E)<math> is called the marginal probability of <math>E<math>; it is a normalizing constant and can be calculated as the sum of all mutually exclusive hypotheses <math>\sum P(E|H_i) P(H_i)<math>.
- The term <math>P(H_0|E)<math> is called the posterior probability of <math>H_0<math> given <math>E<math>.
The scaling factor <math>P(E|H_0) / P(E)<math> gives a measure of the impact that the observation has on belief in the hypothesis. If it is unlikely that the observation will be made unless the particular hypothesis being considered is true, then this scaling factor will be large. Multiplying this scaling factor by the prior probability of the hypothesis being correct gives a measure of the posterior probability of the hypothesis being correct given the observation.
The keys to making the inference work is the assigning of the prior probabilities given to the hypothesis and possible alternatives, and the calculation of the conditional probabilities of the observation under different hypotheses.
Some Bayesian statisticians believe that if the prior probabilities can be given some objective value, then the theorem can be used to provide an objective measure of the probability of the hypothesis. But to others there is no clear way in which to assign objective probabilities. Indeed, doing so appears to require one to assign probabilities to all possible hypotheses.
Alternately, and more often, the probabilities can be taken as a measure of the subjective degree of belief on the part of the participant, and to restrict the potential hypotheses to a constrained set within a model. The theorem then provides a rational measure of the degree to which some observation should alter the subject's belief in the hypothesis. But in this case the resulting posterior probability remains subjective. So the theorem can be used to rationally justify belief in some hypothesis, but at the expense of rejecting objectivism.
It is unlikely that two individuals will start with the same subjective degree of belief. Supporters of Bayesian method argue that even with very different assignments of prior probabilities sufficient observations are likely to bring their posterior probabilities closer together. This assumes that they do not completely reject each other's initial hypotheses; and that they assign similar conditional probabilities. Thus Bayesian methods are useful only in situations in which there is already a high level of subjective agreement.
In many cases, the impact of observations as evidence can be summarised in a likelihood ratio, as expressed in the law of likelihood. This can be combined with the prior probability to reflect the original degree of belief and any earlier evidence already taken into account. For example, if we have the likelihood ratio
- <math>\Lambda = \frac{L(H_0\mid E)}{L(\mbox{not } H_0|E)} = \frac{P(E \mid H_0)}{P(E \mid \mbox{not } H_0)} <math>
then we can rewrite Bayes' theorem as
- <math>P(H_0|E) = \frac{\Lambda P(H_0)}{\Lambda P(H_0) + P(\mbox{not } H_0)} = \frac{P(H_0)}{P(H_0) +\left(1-P(H_0)\right)/\Lambda }. <math>
With two independent pieces of evidence <math>E_1<math> and <math>E_2<math>, one possible approach is to move from the prior to the posterior probability on the first evidence and then use that posterior as a new prior and produce a second posterior with the second piece of evidence; an arithmetically equivalent alternative is to multiply the likelihood ratios. So
- if <math>P(E_1, E_2 | H_0) = P(E_1 | H_0) \times P(E_2 | H_0)<math>
- and <math>P(E_1, E_2 | \mbox{not }H_0) = P(E_1 | \mbox{not }H_0) \times P(E_2 | \mbox{not }H_0)<math>
- then <math>P(H_0|E_1, E_2) = \frac{\Lambda_1 \Lambda_2 P(H_0)}{\Lambda_1 \Lambda_2 P(H_0) + P(\mbox{not } H_0)} <math>,
and this can be extended to more pieces of evidence.
Before a decision is made, the loss function also needs to be considered to reflect the consequences of making an erroneous decision.
Simple examples of Bayesian inference
From which bowl is the cookie?
To illustrate, suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1 corresponds to bowl #1, and H2 to bowl #2. It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 0.5. The datum D is the observation of a plain cookie. From the contents of the bowls, we know that P(D | H1) = 30/40 = 0.75 and P(D | H2) = 20/40 = 0.5. Bayes' formula then yields
- <math>
\begin{matrix} P(H_1 | D) &=& \frac{P(H_1) \cdot P(D | H_1)}{P(H_1) \cdot P(D | H_1) + P(H_2) \cdot P(D | H_2)} \\ \\ \ & =& \frac{0.5 \times 0.75}{0.5 \times 0.75 + 0.5 \times 0.5} \\ \\ \ & =& 0.6. \end{matrix} <math> Before observing the cookie, the probability that Fred chose bowl #1 is the prior probability, P(H1), which is 0.5. After observing the cookie, we revise the probability to P(H1|D), which is 0.6.
False positives in a medical test
False positives are a problem in any kind of test: no test is perfect, and sometimes the test will incorrectly report a positive result. For example, if a test for a particular disease is performed on a patient, then there is a chance (usually small) that the test will return a positive result even if the patient does not have the disease. The problem lies, however, not just in the chance of a false positive prior to testing, but determining the chance that a positive result is in fact a false positive. As we will demonstrate, using Bayes' theorem, if a condition is rare, then the majority of positive results may be false positives, even if the test for that condition is (otherwise) reasonably accurate.
Suppose that a test for a particular disease has a very high success rate:
- if a tested patient has the disease, the test accurately reports this, a 'positive', 99% of the time (or, with probability 0.99), and
- if a tested patient does not have the disease, the test accurately reports that, a 'negative', 95% of the time (i.e. with probability 0.95).
Suppose also, however, that only 0.1% of the population have that disease (i.e. with probability 0.001). We now have all the information required to use Bayes' theorem to calculate the probability that, given the test was positive, that it is a false positive.
Let A be the event that the patient has the disease, and B be the event that the test returns a positive result. Then, using the second alternative form of Bayes' theorem (above), the probability of a true positive is
- <math>\begin{matrix}P(A|B) &= &\frac{0.99 \times 0.001}{0.99\times 0.001 + 0.05\times 0.999}\, ,\\ ~\\ &\approx &0.019\, .\end{matrix}<math>
and hence the probability of a false positive is about (1 − 0.019) = 0.981.
Despite the apparent high accuracy of the test, the incidence of the disease is so low (one in a thousand) that the vast majority of patients who test positive (98 in a hundred) do not have the disease. (Nonetheless, the proportion of patients who tested positive who do have the disease is 20 times the proportion before we knew the outcome of the test! Thus the test is not useless, and re-testing may improve the reliability of the result.) In particular, a test must be very reliable in reporting a negative result when the patient does not have the disease, if it is to avoid the problem of false positives. In mathematical terms, this would ensure that the second term in the denominator of the above calculation is small, relative to the first term. For example, if the test reported a negative result in patients without the disease with probability 0.999, then using this value in the calculation yields a probability of a false positive of roughly 0.1(1-(0.99x0.001/(0.99x0.001+0.001x0.999))) = 0.050.
In this example, Bayes' theorem helps show that the accuracy of tests for rare conditions must be very high in order to produce reliable results from a single test, due to the possibility of false positives. (The probability of a 'false negative' could also be calculated using Bayes' theorem, to completely characterise the possible errors in the test results.)
In the courtroom
Bayesian inference can be used in a court setting by an individual juror to coherently accumulate the evidence for and against the guilt of the defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.
- Let G be the event that the defendant is guilty.
- Let E be the event that the defendant's DNA matches DNA found at the crime scene.
- Let p(E | G) be the probability of seeing event E assuming that the defendant is guilty. (Usually this would be taken to be unity.)
- Let p(G | E) be the probability that the defendant is guilty assuming the DNA match event E
- Let p(G) be the juror's personal estimate of the probability that the defendant is guilty, based on the evidence other than the DNA match. This could be based on his responses under questioning, or previously presented evidence.
Bayesian inference tells us that if we can assign a probability p(G) to the defendant's guilt before we take the DNA evidence into account, then we can revise this probability to the conditional probability p(G | E), since
- p(G | E) = p(G) p(E | G) / p(E)
Suppose, on the basis of other evidence, a juror decides that there is a 30% chance that the defendant is guilty. Suppose also that the forensic evidence is that the probability that a person chosen at random would have DNA that matched that at the crime scene was 1 in a million, or 10-6.
The event E can occur in two ways. Either the defendant is guilty (with prior probability 0.3) and thus his DNA is present with probability 1, or he is innocent (with prior probability 0.7) and he is unlucky enough to be one of the 1 in a million matching people.
Thus the juror could coherently revise his opinion to take into account the DNA evidence as follows:
- p(G | E) = (0.3 × 1.0) /(0.3 × 1.0 + 0.7 × 10-6) = 0.99999766667.
The benefit of adopting a Bayesian approach is that it gives the juror a formal mechanism for combining the evidence presented. The approach can be applied successively to all the pieces of evidence presented in court, with the posterior from one stage becoming the prior for the next.
The juror would still have to have a prior for the guilt probability before the first piece of evidence is considered. It has been suggested that this could be the guilt probability of a random person of the appropriate sex taken from the town where the crime occurred. Thus, for a crime committed by a adult male in a town containing 50,000 adult males the appropriate initial prior probability might be 1/50,000.
For the purpose of explaining Bayes' theorem to jurors, it will usually be appropriate to give it in the form of betting odds rather than probabilities, as these are more widely understood. In this form Bayes' theorem states that
- Posterior odds = prior odds x Bayes factor
In the example above, the juror who has a prior probability of 0.3 for the defendant being guilty would now express that in the form of odds of 3:7 in favour of the defendant being guilty, the Bayes factor is one million, and the resulting posterior odds are 3 million to 7 or about 429,000 to one in favour of guilt.
In the United Kingdom, Bayes' theorem was explained to the jury in the odds form by a statistician expert witness in the rape case of Regina versus Denis John Adams. A conviction was secured but the case went to Appeal, as no means of accumulating evidence had been provided for those jurors who did not want to use Bayes' theorem. The Court of Appeal upheld the conviction and gave their opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the Jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task." No further appeal was allowed and the issue of Bayesian assessment of forensic DNA data remains controversial.
Gardner-Medwin argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent. He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime and this is an odd piece of evidence to consider in a criminal trial. Consider the following three propositions:
A: The known facts and testimony could have arisen if the defendant is guilty,
B: The known facts and testimony could have arisen if the defendant is innocent,
C: The defendant is guilty.
Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they are probably acquitting a guilty person.
Other court cases in which probabilistic arguments played some role were the Howland Will forgery trial and the Sally Clark case.
Search theory
In May 1968 the US nuclear submarine USS Scorpion (SSN-589) failed to arrive as expected at her home port of Norfolk, Virginia. The US Navy was convinced that the vessel had been lost off the Eastern seaboard but an extensive search failed to discover the wreck. The US Navy's deep water expert, John Craven, believed that it was elsewhere and he organised a search south west of the Azores based on a controversial approximate triangulation by hydrophones. He was allocated only a single ship, the USNS Mizar, and he took advice from a firm of consultant mathematicians in order to maximise his resources. A Bayesian search methodology was adopted. Experienced submarine commanders were interviewed to construct hypotheses about what could have caused the loss of the Scorpion. The sea area was divided up into grid squares and a probability assigned to each square, under each of the hypotheses, to give a number of probability grids, one for each hypothesis. These were then added together to produce an overall probability grid. The probability attached to each square was then the probability that the wreck was in that square. A second grid was constructed with probabilities that represented the probability of successfully finding the wreck if that square were to be searched and the wreck were to be actually there. This was a known function of water depth. The result of combining this grid with the previous grid is a grid which gives the probability of finding the wreck in each grid square of the sea if it were to be searched. This sea grid was systematically searched in a manner which started with the high probability regions first and worked down to the low probability regions last. Each time a grid square was searched and found to be empty its probability was reassessed using Bayes' theorem. This then forced the probabilities of all the other grid squares to be reassessed (upwards), also by Bayes' theorem. The use of this approach was a major computational challenge for the time but it was eventually successful and the Scorpion was found in October of that year. Suppose a grid square has a probability p of containing the wreck and that the probability of successfully detecting the wreck if it is there is q. If the square is searched and no wreck is found, then, by Bayes' theorem, the revised probability of the wreck being in the square is given by
- <math> p' = \frac{p(1-q)}{(1-p)+p(1-q)}.<math>
More mathematical examples
Naive Bayes classifier
See: naive Bayes classifier.
Posterior distribution of the binomial parameter
In this example we consider the computation of the posterior distribution for the binomial parameter. This is the same problem considered by Bayes in Proposition 9 of his essay.
We are given m observed successes and n observed failures in a binomial experiment. The experiment may be tossing a coin, drawing a ball from an urn, or asking someone their opinion, among many other possibilities. What we know about the parameter (let's call it a) is stated as the prior distribution, p(a).
For a given value of a, the probability of m successes in m+n trials is
- <math> p(m,n|a) = \begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n. <math>
Since m and n are fixed, and a is unknown, this is a likelihood function for a. From the continuous form of the law of total probability we have
- <math> p(a|m,n) = \frac{p(m,n|a)\,p(a)}{\int_0^1 p(m,n|a)\,p(a)\,da}
= \frac{\begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n\,p(a)} {\int_0^1 \begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n\,p(a)\,da}.
<math>
For some special choices of the prior distribution p(a), the integral can be solved and the posterior takes a convenient form. In particular, if p(a) is a beta distribution with parameters m0 and n0, then the posterior is also a beta distribution with parameters m+m0 and n+n0.
A conjugate prior is a prior distribution, such as the beta distribution in the above example, which has the property that the posterior is the same type of distribution.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter a. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter a depend on a random event, he cleverly escapes a philosophical quagmire that was an issue he most likely was not even aware of.
Computer applications
Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulation Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while the graphical model structure inherent to all statistical models, even the most complex ones, allows for efficient simulation algorithms like the Gibbs sampling and other Metropolis-Hastings algorithm schemes.
As a particular application of statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying unsolicited bulk e-mail spam. Applications which make use of Bayesian inference for spam filtering include Bogofilter, SpamAssassin, InBoxer, and Mozilla. Spam classification is treated in more detail in the article on the naive Bayes classifier.
In some applications fuzzy logic is an alternative to Bayesian inference. Fuzzy logic and Bayesian inference, however, are mathematically and semantically not compatible: You cannot, in general, understand the degree of truth in fuzzy logic as probability and vice versa.
References
- On-line textbook: Information Theory, Inference, and Learning Algorithms (http://www.inference.phy.cam.ac.uk/mackay/itila/book.html), by David MacKay, has many chapters on Bayesian methods, including introductory examples; compelling arguments in favour of Bayesian methods (in the style of Edwin Jaynes); state-of-the-art Monte Carlo methods, message-passing methods, and variational methods; and examples illustrating the intimate connections between Bayesian inference and data compression.
- Jaynes, E.T. (1998) Probability Theory : The Logic of Science (http://www-biba.inrialpes.fr/Jaynes/prob.html).
- Bretthorst, G. Larry, 1988, Bayesian Spectrum Analysis and Parameter Estimation (http://bayes.wustl.edu/glb/book.pdf) in Lecture Notes in Statistics, 48, Springer-Verlag, New York, New York
- Berger, J.O. (1999) Statistical Decision Theory and Bayesian Statistics. Second Edition. Springer Verlag, New York. ISBN 0-387-96098-8 and also ISBN 3-540-96098-8.
- O'Hagan, A. and Forster, J. (2003) Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York. ISBN 0-340-52922-9.
- Robert, C.P. (2001) The Bayesian Choice. Springer Verlag, New York.
- Lee, Peter M. Bayesian Statistics: An Introduction. Second Edition. (1997). ISBN 0-340-67785-6.
- Dawid, A.P. and Mortera, J. Coherent analysis of forensic identification evidence. Journal of the Royal Statistical Society, Series B, 58,425-443.
- Foreman, L.A; Smith, A.F.M. and Evett, I.W. (1997). Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion). Journal of the Royal Statistical Society, Series A, 160, 429-469.
- Robertson, B. and Vignaux, G.A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester.
- Gardner-Medwin, A. What probability should the jury address?. Significance. Volume 2, Issue 1, March 2005
See also
- Bayesian model comparison
- Bayesian probability
- Bayesian filtering
- Bayes factor
- Occam's Razor
- Prosecutor's fallacy
- Minimum message length
- Minimum description length
- Gaussian process regression
- Important publications in Bayesian statistics
Bolstad, William M. (2004) Introduction to Bayesian Statistics, John Wiley ISBN 0-471-27020-2
External links
- On-line textbook: Information Theory, Inference, and Learning Algorithms (http://www.inference.phy.cam.ac.uk/mackay/itila/), by David MacKay, has many chapters on Bayesian methods, including introductory examples; compelling arguments in favour of Bayesian methods; state-of-the-art Monte Carlo methods, message-passing methods, and variational methods; and examples illustrating the intimate connections between Bayesian inference and data compression.
- Cause, chance and Bayesian statistics (http://www.abelard.org/briefings/bayes.htm), to facilitate understanding Bayesian statistics. The statistical theory developed by Thomas Bayes enables analysis of conditional and marginal probabilities. Bayesian statistics enables logical inference
- Naive Bayesian learning paper (http://citeseer.ist.psu.edu/30545.html)
- A Tutorial on Learning With Bayesian Networks (http://citeseer.ist.psu.edu/heckerman96tutorial.html)
- Paul Graham. "A Plan for Spam" (http://www.paulgraham.com/spam.html) (exposition of a popular approach for spam classification)
- Commentary on Regina versus Adams (http://www.mcs.vuw.ac.nz/~vignaux/docs/Adams_NLJ.html)de:Bayessche Schätzung