Fisher information
|
In statistics and information theory, the Fisher information (denoted <math>\mathcal{I}(\theta)<math>) is the variance of the score. Fisher information is thought of as the amount of information that an observable random variable carries about an unobservable parameter <math>\theta<math> upon which the probability distribution of <math>X<math> depends. Since the expectation of the score is zero, the variance is also the second moment of the score and so the Fisher information can be written
- <math>
\mathcal{I}(\theta) = \mathrm{E} \left[
\left[ \frac{\partial}{\partial\theta} \log f(X;\theta) \right]^2
\right] <math>
where <math>f<math> is the probability density function of random variable <math>X<math> and, consequently, <math>0 \leq \mathcal{I}(\theta) < \infty<math>. The Fisher information is thus the expectation of the square of the score. A random variable carrying high Fisher information implies that the absolute value of the score is frequently high (remember that the expectation of the score is zero).
This concept is named in honor of the geneticist and statistician Ronald Fisher.
Note that the information as defined above is not a function of a particular observation, as the random variable <math>X<math> has been averaged out. The concept of information is useful when comparing two methods of observation of some random process.
Information as defined above may also be written as
- <math>
\mathcal{I}(\theta) = -\mathrm{E} \left[
\frac{\partial^2}{\partial\theta^2} \log f(X;\theta)
\right] <math>
and is thus the expectation of log of the second derivative of <math>X<math> with respect to <math>\theta<math>. Information may thus be seen to be a measure of the "sharpness" of the support curve near the maximum likelihood estimate of <math>\theta<math>. A "blunt" support curve (one with a shallow maximum) would have low expected second derivative, and thus low information; while a sharp one would have a high expected second derivative and thus high information.
Information is additive, in the sense that the information gathered by two independent experiments is the sum of the information of each of them:
- <math>
\mathcal{I}_{X,Y}(\theta) = \mathcal{I}_X(\theta) + \mathcal{I}_Y(\theta) <math>
This is because the variance of the sum of two independent random variables is the sum of their variances. It follows that the information in a random sample of size <math>n<math> is <math>n<math> times that in a sample of size one (if observations are independent).
The information provided by a sufficient statistic is same as that of the sample <math>X<math>. This may be seen by using Fisher's factorization criterion for a sufficient statistic. If <math>T(X)<math> is sufficient for <math>\theta<math>, then
- <math>
f(X;\theta) = g(T(X), \theta) h(X) <math>
for some functions <math>g<math> and <math>h<math> (see sufficient statistic for a more detailed explanation). The equality of information follows from the fact that
- <math>
\frac{\partial}{\partial\theta} \log \left[f(X ;\theta)\right] = \frac{\partial}{\partial\theta} \log \left[g(T(X);\theta)\right] <math>
(which is the case because <math>h(X)<math> is independent of <math>\theta<math>) and the definition for information given above. More generally, if <math>T=t(X)<math> is a statistic, then
- <math>
\mathcal{I}_T(\theta) \leq \mathcal{I}_X(\theta) <math>
with equality if and only if <math>T<math> is a sufficient statistic.
The Cramér-Rao inequality states that the reciprocal of the Fisher information is a lower bound on the variance of any unbiased estimator of <math>\theta<math>.
Contents |
Matrix form
In the case when there are <math>d<math> parameters, thus making <math>\theta<math> a vector of length <math>d<math>, then the Fisher information matrix (FIM) is defined as having the <math>(i, j)<math> element as
- <math>
{\left(\mathcal{I} \left(\theta \right) \right)}_{i, j} = \mathrm{E} \left[
\frac{\partial}{\partial\theta_i} \log f(X;\theta) \frac{\partial}{\partial\theta_j} \log f(X;\theta)
\right] <math>
The FIM is a <math>d \times d<math> symmetric matrix.
For multivariate normal distribution
The FIM for a multivariate normal distribution takes a special formulation. The <math>(m,n)<math> element of the FIM for <math>X \sim N(\mu(\theta), \Sigma(\theta))<math> is
- <math>
\mathcal{I}_{m,n} = \frac{\partial \mu}{\partial \theta_m} \Sigma^{-1} \frac{\partial \mu^\top}{\partial \theta_n} + \frac{1}{2} \mathrm{tr} \left(
\Sigma^{-1} \frac{\partial \Sigma}{\partial \theta_m} \Sigma^{-1} \frac{\partial \Sigma}{\partial \theta_n}
\right) <math>
where
- <math>
\frac{\partial \mu}{\partial \theta_m} = \begin{bmatrix}
\frac{\partial \mu_1}{\partial \theta_m} & \frac{\partial \mu_2}{\partial \theta_m} & \cdots & \frac{\partial \mu_N}{\partial \theta_m} &
\end{bmatrix} <math>
- <math>
\frac{\partial \mu^\top}{\partial \theta_m} = \left(
\frac{\partial \mu}{\partial \theta_m}
\right)^\top = \begin{bmatrix}
\frac{\partial \mu_1}{\partial \theta_m} \\ \\ \frac{\partial \mu_2}{\partial \theta_m} \\ \\ \vdots \\ \\ \frac{\partial \mu_N}{\partial \theta_m}
\end{bmatrix} <math>
- <math>
\frac{\partial \Sigma}{\partial \theta_m} = \begin{bmatrix}
\frac{\partial \Sigma_{1,1}}{\partial \theta_m} & \frac{\partial \Sigma_{1,2}}{\partial \theta_m} & \cdots & \frac{\partial \Sigma_{1,N}}{\partial \theta_m} \\ \\ \frac{\partial \Sigma_{2,1}}{\partial \theta_m} & \frac{\partial \Sigma_{2,2}}{\partial \theta_m} & \cdots & \frac{\partial \Sigma_{2,N}}{\partial \theta_m} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \frac{\partial \Sigma_{N,1}}{\partial \theta_m} & \frac{\partial \Sigma_{N,2}}{\partial \theta_m} & \cdots & \frac{\partial \Sigma_{N,N}}{\partial \theta_m}
\end{bmatrix} <math>
- <math>\mathrm{tr}<math> is the trace function
Example: single parameter
The information contained in <math>n<math> independent Bernoulli trials, each with probability of success <math>\theta<math>, may be calculated as follows. (The outcome is random and can be either of two possible outcomes called "success" and "failure" and can be thought of as flipping a coin with the probability of flipping a "head" is <math>\theta<math> and the probability of flipping a "tail" is <math>1 - \theta<math>.) In the following, <math>A<math> represents the number of successes, <math>B<math> the number of failures, and <math>n = A + B<math> is the total number of trials.
- <math>
\mathcal{I}(\theta) = -\mathrm{E} \left[
\frac{\partial^2}{\partial\theta^2} \log(f(A;\theta))
\right] <math>
- <math>
= -\mathrm{E} \left[
\frac{\partial^2}{\partial\theta^2} \log \left[ \theta^A(1-\theta)^B\frac{(A+B)!}{A!B!} \right]
\right] <math>
- <math>
= -\mathrm{E} \left[
\frac{\partial^2}{\partial\theta^2} \left[ A \log (\theta) + B \log(1-\theta) \right]
\right] <math>
- <math>
= -\mathrm{E} \left[
\frac{\partial}{\partial\theta} \left[ \frac{A}{\theta} - \frac{B}{1-\theta} \right]
\right] <math> (see logarithm about differentiating <math>\log x<math>)
- <math>
= +\mathrm{E} \left[
\frac{A}{\theta^2} + \frac{B}{(1-\theta)^2}
\right] <math>
- <math>
= \frac{n\theta}{\theta^2} + \frac{n(1-\theta)}{(1-\theta)^2} <math> (as the expected value of <math>A = n \theta<math>, etc.)
- <math>= \frac{n}{\theta(1-\theta)}<math>
The first line is just the definition of information; the second uses the fact that the information contained in a sufficient statistic is the same as that of the sample itself; the third line just expands the log term (and drops a constant), the fourth and fifth just differentiation with respect to <math>\theta<math>, the sixth replaces <math>A<math> and <math>B<math> with their expectations, and the seventh is algebraic manipulation.
The overall result, viz
- <math>\mathcal{I}(\theta) = \frac{n}{\theta(1-\theta)}<math>
may be seen to be in accord with what one would expect, since it is the reciprocal of the variance of the sum of the <math>n<math> Bernoulli random variables.
In case the parameter <math>\theta<math> is vector-valued, the information is a positive-definite matrix, which defines a metric on the parameter space; consequently differential geometry is applied to this topic.
See Fisher information metric.
Physical information
Any information channel is generally imperfect. This effect gives rise to a method of deriving physics. The amount of Fisher information that is lost in observing a physical effect is called the physical information. The mathematical procedure of extremizing the physical information through variation of the system probability amplitudes is called the principle of extreme physical information. The output amplitudes of the principle define the physics of the source effect.
See also
Books
- Science from Fisher Information: A Unification by B. Roy Frieden (ISBN 0-52-100911-1)