Metropolis-Hastings algorithm
|
Metropolis_hastings_algorithm.png
In mathematics and physics, the Metropolis-Hastings algorithm is an algorithm used to generate a sequence of samples from the joint distribution of two or more variables. The purpose of such a sequence is to approximate the joint distribution (as with a histogram), or to compute an integral (such as an expected value). This algorithm is an example of a Markov chain Monte Carlo algorithm. It is a generalization of the Metropolis algorithm suggested by Hastings (citation below). The Gibbs sampling algorithm is a special case of the Metropolis-Hastings algorithm.
The Metropolis-Hastings algorithm can draw samples from any probability distribution <math>p(x)<math>, requiring only that the density can be calculated at <math>x<math>. The algorithm generates a set of states <math>x^t<math> which is a Markov chain because each state <math>x^t<math> depends only on the previous state <math>x^{t-1}<math>. The algorithm depends on the creation of a proposal density <math>Q(x'; x^t)<math>, which depends on the current state <math>x^t<math> and which can generate a new proposed sample <math>x'<math>. For example, the proposal density could be a Gaussian function centred on the current state <math>x^t<math>
- <math>
Q( x'; x^t ) \sim N( x^t, \sigma^2 I) <math>
reading <math>Q(x'; x^t)<math> as the probability of generating <math>x'<math> given the previous value <math>x^t<math>.
This proposal density would generate samples centred around the current state with variance <math>\sigma^2 I<math>. So we draw a new proposal state <math>x'<math> with probability <math>Q(x'; x^t)<math> and then calculate a value
- <math>
a = a_1 a_2\, <math>
where
- <math>
a_1 = \frac{P(x')}{P(x^t)} <math>
is the likelihood ratio between the proposed sample <math>x'<math> and the previous sample <math>x^t<math>, and
- <math>
a_2 = \frac{Q( x^t; x' )}{Q(x';x^t)} <math>
is the ratio of the proposal density in two directions (from <math>x^t<math> to <math>x'<math> and vice versa). This is equal to 1 if the proposal density is symmetric. Then the new state <math>x^{t+1}<math> is chosen with the rule
- <math>
x^{t+1} = \left \{
\begin{matrix} x' & \mbox{if }a > 1 \\ x'\mbox{ with probability }a, & \mbox{if }a < 1 \end{matrix}
\right. <math>
The Markov chain is started from a random initial value <math>x^0<math> and the algorithm is run for a few thousand iterations so that this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The algorithm works best if the proposal density matches the shape of the target distribution <math>p(x)<math>, that is <math>Q(x'; x^t) \approx p(x')<math>, but in most cases this is unknown. If a Gaussian proposal is used the variance parameter <math>\sigma^2<math> has to be tuned during the burn-in period. This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last <math>N<math> samples. This is usually set to be around 60%. If the proposal steps are too small the chain will mix slowly (i.e., it will move around the space slowly and converge slowly to <math>p(x)<math>). If the proposal steps are too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density so <math>a_1<math> will be very small.
References
- Chib, Siddhartha and Edward Greenberg: "Understanding the Metropolis–Hastings Algorithm". American Statistician, 49, 327–335, 1995
- W.K. Hastings. "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", Biometrika, 57:97-109, 1970.
- N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. "Equations of State Calculations by Fast Computing Machines". Journal of Chemical Physics, 21:1087-1091, 1953.