Information entropy

Binary_entropy_plot.png
Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. The concepts of information and entropy have deep links with one another, although it took many years for the development of the theories of statistical mechanics and information theory to make this apparent. This article is about information entropy, the informationtheoretic formulation of entropy.
Contents 
Basic concept
The basic concept of entropy in information theory has to do with how much randomness is in a signal or in a random event. An alternative way to look at this is to talk about how much information is carried by the signal.
As an example consider some English text, encoded as a string of letters, spaces and punctuation (so our signal is a string of characters). Since some characters are not very likely (e.g. 'z') while others are very common (e.g. 'e') the string of characters is not really as random as it might be. On the other hand, since we cannot predict what the next character will be, it does have some 'randomness'. Entropy is a measure of this randomness, suggested by Claude E. Shannon in his 1949 paper A Mathematical Theory of Communication (http://cm.belllabs.com/cm/ms/what/shannonday/paper.html).
Shannon derives his definition of entropy from the assumptions that:
 The measure should be proportional (continuous)  i.e. changing the value of one of the probabilities by a very small amount should only change the entropy by a small amount.
 If all the outcomes (letters in the example above) are equally likely then increasing the number of letters should always increase the entropy.
 We should be able to make the choice (in our example of a letter) in two steps, in which case the entropy of the final result should be a weighted sum of the entropies of the two steps.
Formal definitions
Claude E. Shannon defines entropy in terms of a discrete random event x, with possible states 1..n as:
 <math>H(x)=\sum_{i=1}^np(i)\log_2 p(i).\,\!<math>
That is, the entropy of the event x is the sum, over all possible outcomes i of x, of the product of the probability of outcome i times the log of the probability of i. We can also apply this to a general probability distribution, rather than a discretevalued event.
Shannon shows that any definition of entropy satisfying his assumptions will be of the form:
 <math>K\sum_{i=1}^np(i)\log p(i).\,\!<math>
where K is a constant (and is really just a choice of measurement units).
Shannon defined a measure of entropy (H = − p_{1} log_{2} p_{1} − … − p_{n} log_{2} p_{n}) that, when applied to an information source, could determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. Shannon's formula can be derived by calculating the mathematical expectation of the amount of information contained in a digit from the information source. Shannon's entropy measure came to be taken as a measure of the uncertainty about the realization of a random variable. It thus served as a proxy capturing the concept of information contained in a message as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures. For example, redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.
Shannon's definition of entropy is closely related to thermodynamic entropy as defined by physicists and many chemists. Boltzmann and Gibbs did considerable work on statistical thermodynamics, which became the inspiration for adopting the word entropy in information theory. There are relationships between thermodynamic and informational entropy. For example, Maxwell's demon reverses thermodynamic entropy with information but getting that information exactly balances out the thermodynamic gain the demon would otherwise achieve.
It is important to remember that entropy is a quantity defined in the context of a probabilistic model for a data source. Independent fair coin flips have an entropy of 1 bit per flip. A source that always generates a long string of A's has an entropy of 0, since the next character will always be an 'A'. Empirically, it seems that entropy of English text is about 1.5 bits per character (try compressing it with the PPM compression algorithm!), though clearly that will vary from text source to text source. The entropy rate of a data source means the average number of bits per symbol needed to encode it.
 Many data bits may not convey information. For example, data structures often store information redundantly, or have identical sections regardless of the information in the data structure.
 The amount of entropy is not always an integer number of bits.
Entropy effectively bounds the performance of the strongest nonlossy (or nearly nonlossy) compression possible, which can be realized in theory by using the typical set or in practice using Huffman, LempelZiv or arithmetic coding.
A common way to define entropy for text is based on the Markov model of text. For an order0 source (each character is selected independent of the last characters), the binary entropy is:
 <math>H(\mathcal{S}) =  \sum p_i \log_2 p_i, \,\!<math>
where p_{i} is the probability of i. For a firstorder Markov source (one in which probabilities are dependent on the immediately preceding character but not on older history except through the immediately preceding character), the entropy rate is:
 <math>H(\mathcal{S}) =  \sum_i p_i \sum_j \ p_i (j) \log_2 p_i (j), \,\!<math>
where i is a state (certain preceding characters) and <math>p_i(j)<math> is the probability of <math>j<math> given <math>i<math> as the previous character (s).
For a second order Markov source, the entropy rate is
 <math> H(\mathcal{S}) = \sum_i p_i \sum_j p_i(j) \sum_k p_{i,j}(k)\ \log \ p_{i,j}(k). \,\!<math>
In general the bary entropy of a source <math>\mathcal{S}<math> = (S,P) with source alphabet S = {a_{1}, …, a_{n}} and discrete probability distribution P = {p_{1}, …, p_{n}} where p_{i} is the probability of a_{i} (say p_{i} = p(a_{i})) is defined by:
 <math> H_b(\mathcal{S}) =  \sum_{i=1}^n p_i \log_b p_i \,\!<math>
Note: the b in "bary entropy" is the number of different symbols of the "ideal alphabet" which is being used as the standard yardstick to measure source alphabets. In information theory, two symbols are necessary and sufficient for an alphabet to be able to encode information, therefore the default is to let b = 2 ("binary entropy"). Thus, the entropy of the source alphabet, with its given empiric probability distribution, is a number equal to the number (possibly fractional) of symbols of the "ideal alphabet", with an optimal probability distribution, necessary to encode for each symbol of the source alphabet. Also note that "optimal probability distribution" here means a uniform distribution: a source alphabet with n symbols has the highest possible entropy (for an alphabet with n symbols) when the probability distribution of the alphabet is uniform. This optimal entropy turns out to be <math> log_b \, n <math>.
Another way to define the entropy function H (not using the Markov model) is by proving that H is uniquely defined (as earlier mentioned) iff H satisfies 1)  3):
1) H(p_{1}, …, p_{n}) is defined and continuous for all p_{1}, …, p_{n} where p_{i} <math>\in<math>[0,1] for all i = 1, …, n and p_{1} + … + p_{n} = 1. (Remark that the function solely depends on the probability distribution, not the alphabet.)
2) For all positive integers n, H satisfies
 <math>
H\underbrace{\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)}_{n\ \mathrm{arguments}} < H\underbrace{\left(\frac{1}{n+1}, \ldots, \frac{1}{n+1}\right).}_{n+1\ \mathrm{arguments}} <math>
3) For positive integers b_{i} where b_{1} + … + b_{n} = n, H satisfies
 <math>
H\underbrace{\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)}_n = H\underbrace{\left(\frac{b_1}{n}, \ldots, \frac{b_n}{n}\right)}_n + \sum_{i=1}^n \frac{b_i}{n} H\underbrace{\left(\frac{1}{b_i}, \ldots, \frac{1}{b_i}\right)}_{b_i}. <math>
Efficiency
A source alphabet encountered in practice should be found to have a probability distribution which is less than optimal. If the source alphabet has n symbols, then it can be compared to an "optimized alphabet" with n symbols, whose probability distribution is uniform. The ratio of the entropy of the source alphabet with the entropy of its optimized version is the efficiency of the source alphabet, which can be expressed as a percentage.
This implies that the efficiency of a source alphabet with n symbols can be defined simply as being equal to its nary entropy.
Derivation of Shannon's entropy
Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.
Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A_{1}, A_{2}, …, A_{n}) where A_{i} is the number of times pocket i was landed on and
 <math> P = \sum_{i=1}^n A_i \,\!<math>
is the total number of balllanding events?
A. The probability is a multinomial distribution, viz.
 <math> p = {\Omega \over \Tau} = {P! \over A_1! \ A_2! \ A_3! \ \dots \ A_n!} \left(\frac1n\right)^P \,\!<math>
where
 <math> \Omega = {P! \over A_1! \ A_2! \ A_3! \ \dots \ A_n!} \,\!<math>
is the number of possible combinations of outcomes (for the events) which fit the given distribution, and
 <math> \Tau = n^P \ <math>
is the number of all possible combinations of outcomes for the set of P events.
Q. And what is the entropy?
A. The entropy of the distribution is obtained from the logarithm of Ω:
 <math> H = \log \Omega = \log \frac{P!}{A_1! \ A_2! \ A_3! \dots \ A_n!} \,\!<math>
 <math> = \log P!  \log A_1!  \log A_2!  \log A_3!  \dots  \log A_n! \ <math>
 <math> = \sum_i^P \log i  \sum_i^{A_1} \log i  \sum_i^{A_2} \log i  \dots  \sum_i^{A_n} \log i \,\!<math>
The summations can be approximated closely by being replaced with integrals:
 <math> H = \int_1^P \log x \, dx  \int_1^{A_1} \log x \, dx  \int_1^{A_2} \log x \, dx  \dots  \int_1^{A_n} \log x \, dx. \,\!<math>
The integral of the logarithm is
 <math> \int \log x \, dx = x \log x  \int x \, {dx \over x} = x \log x  x. \,\!<math>
So the entropy is
 <math> H = (P \log P  P + 1)  (A_1 \log A_1  A_1 + 1)  (A_2 \log A_2  A_2 + 1)  \dots  (A_n \log A_n  A_n + 1) <math>
 <math> = (P \log P + 1)  (A_1 \log A_1 + 1)  (A_2 \log A_2 + 1)  \dots  (A_n \log A_n + 1) <math>
 <math> = P \log P  \sum_{x=1}^n A_x \log A_x + (1  n) \,\!<math>
Change A_{x} to p_{x} = A_{x}/P and change P to 1 (in order to measure the "bias" or "unevenness", in the probability distribution of the pockets for a single event), then
 <math> H = (1  n)  \sum_{x=1}^n p_x \log p_x \,\!<math>
and the term (1 − n) can be dropped since it is a constant, independent of the p_{x} distribution. The result is
 <math> H =  \sum_{x=1}^n p_x \log p_x \,\!<math>.
Thus, the Shannon entropy is a consequence of the equation
 <math> H = \log \Omega \ <math>
which relates to Boltzmann's definition,
 <math> \mathcal{S} = K \ln \Omega <math>,
of thermodynamic entropy.de:Entropie (Informationstheorie) it:Entropia (teoria dell'informazione) nl:entropie (informatietheorie) pl:Entropia (teoria informacji) da:Entropi zh:熵 (信息论) ru:Информационная энтропия