Exponential family
|
In probability and statistics, the exponential family is an important class of probability distributions. This is for mathematical convenience, on account of their nice algebraic properties; as well as for generality, as they are in a sense very natural distributions to consider.
There are both discrete and continuous members of the exponential family which are useful and important in theoretical or practical work. We use cumulative distribution functions in order to encompass both discrete and continuous distributions. A member of the exponential family has cdf
- <math>dF(x|\eta) = e^{-\eta^{\top} T(x) - A(\eta)} dH(x)<math>
If F is a continuous distribution with a density, one can write dF(x) = f(x) dx. The meanings of the different symbols in the right-hand side are as follows:
- H(x) is a Lebesgue-Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cdf of a probability distribution. If F is continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then so is H (with the same support).
- η is the natural parameter, a column vector, so that ηT = (η1, ..., ηn), its transpose, is a row vector. The parameter space—i.e., the set of values of η for which this function is integrable—is necessarily convex.
- T(x) is the sufficient statistic of the distribution, and it is a column vector whose number of scalar components is the same as that of η so that ηTT(x) is a scalar. (Note that the concept of sufficient statistic applies more broadly than just to members of the exponential family.)
- and A(η) is a normalization factor without which F would not be a probability distribution. A is important in its own right, as it is the cumulant-generating function of the probability distribution of the sufficient statistic T(X).
The term exponential family is also frequently used to refer to any particular concrete case, i.e., any parametrized family of probability distributions of this form, determined by a choice of H and T.
The Bernoulli, normal, gamma, Poisson and binomial distributions are all exponential families. The Weibull distributions do not comprise an exponential family, nor do the Cauchy distributions.
Contents |
Maximum entropy derivation
The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values?
The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior.
The entropy of dF(x) relative to dH(x) is
- <math>S[dF|dH]=-\int {dF\over dH}\ln{dF\over dH}\,dH<math>
or
- <math>S[dF|dH]=\int\ln{dH\over dF}\,dF<math>
where dF/dH and dH/dF are Radon-Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely
- <math>S=-\sum_{i\in I} p_i\ln p_i<math>
assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I.
Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.
The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.
Role in statistics
Classical frequentist hypothesis testing is seriously impeded in the case of likelihoods which are not exponential families because of the lack of sufficient statistics. By contrast, Bayesian inference can still be carried out if the requisite numerical integrals can be performed either directly or (more usually) by simulation. The exponential family makes Bayesian estimation procedures very straightforward, because they can be simply expressed in terms of using the observed values of the sufficient statistics to update the parameters of the conjugate prior.
Classical estimation: sufficiency
According to the Pitman-Koopman-Darmois theorem, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. More long-windedly, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.
Bayesian estimation: conjugate distributions
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior.
For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution.
An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
Statistical inference
Sampling distributions
As discussed above, the sufficient statistics (T1, ..., Tn) plays a pivotal role in statistical inference, whether classical or Bayesian. Accordingly, it is interesting to study its sampling distribution. That is, if X1, ..., Xm is a random sample—that is, a collection of independent, identically-distributed random variables—drawn from a distribution in the exponential family, we want to know the probability distribution of the statistics
- <math>\widehat t_i={1\over m}\sum_{j=1}^m T_i(X_j).<math>
Letting T0=1, we can write
- <math>dF(\eta)=e^{-\eta^\alpha T_\alpha}dH<math>
using Einstein's summation convention, namely
- <math>\eta^\alpha T_\alpha=\eta^0 T_0+\eta^i T_i=\eta^0T_0+\eta^1T_1+\cdots+\eta^nT_n<math>
Then,
- <math>Z[\eta]=\int dF=e^{-\eta^0+A(\eta)}<math>
is what physicists call the partition function in statistical mechanics. The condition that dF be normalized implies that η0=A(η), as anticipated in the above section on information entropy.
Next, it is straightforward to check that
- <math>{\partial\over\partial\eta^i}\ln Z(\eta)={\partial\over\partial\eta^i}A(\eta)=E[T_i\mid\eta],<math>
denoted ti, and
- <math>{\partial^2\over\partial\eta^i\partial\eta^j}\ln Z(\eta)={\partial^2\over\partial\eta^i\partial\eta^j}A(\eta)={\rm Cov}[T_i,T_j\mid\eta]<math>
denoted tij. As the same information can be obtained from either Z or A, it is not necessary to normalize the probability distribution dF by setting η0=A before taking the derivatives. Also, the function A(η) is the cumulant generating function of the distribution of T not just for dF or dH but for the entire exponential subfamily with the given dH and T.
The equations
- <math>E[T_i(X)\mid\eta]=t_i<math>
can usually be solved to find η as a function of ti, which means that either set of parameters can be used to completely specify a member of the specific subfamily under consideration. In that case, the covariances tij can also be expressed in terms of the ti, which is useful for estimation purposes as we shall see below.
We are now ready to consider the random ramples mentioned earlier. It follows that
- <math>E[\widehat t_i]=t_i,<math>
that is, the statistic <math>\widehat{t_i}<math> is an unbiased estimator of ti. Moreover, since the elements of a random sample are assumed to be mutually independent,
- <math>{\rm Cov}[\widehat{t_i},\widehat{t_j}]={1\over m^2}\sum_{k,l=1}^m{\rm Cov}[T_i(X_k),T_j(X_l)]={1\over m}t_{ij}<math>
Because the covariance vanishes in the limit of large samples, the estimators <math>\widehat{t_i}<math> are said to be consistent.
More generally, the kth cumulant of the distribution of <math>\widehat{t_i}<math> can be seen to decay with the (k-1)th power of sample size, so the distribution of these statistics is asymptotically a multivariate normal distribution. To use asymptotic normality (as one would in the construction of confidence intervals) one needs an estimate of the covariances. Therefore we also need to look at the sampling distribution of
- <math>\widehat t_{ij}={1\over m-1}\sum_{k=1}^m (T_i(X_k)-\widehat{t_i})(T_j(X_k)-\widehat{t_j}).<math>
This is easily seen to be an unbiased estimator of tij, but consistency and asymptotic chi-squared behaviour are rather more involved, and depend on the third and fourth cumulants of dF.
Hypothesis testing
Confidence intervals
External links
- A primer on the exponential familiy of distributions (http://www.casact.org/pubs/dpp/dpp04/04dpp117.pdf)