Probability: Information entropy

Friday, 7 February 2014

Information entropy

I spoke about the notion of information entropy: $H(X)=-\sum_i p_i\log p_i$. This has already featured on Examples sheet 1, #16. There we were to consider the multinomial distribution for dice rolls,

$\phi(k_1,\dotsc,k_6)=\displaystyle\frac{n!}{k_1!\cdots k_6!}\frac{1}{6^n}$, where $\sum_ik_i=n$ and $\sum_iik_i=\rho n$.

Using Stirling's formula and setting $k_i=np_i$ we see that this is maximized by maximizing the entropy of $-\sum_i p_i\log p_i$, subject to $\sum_i ik_i = \rho n$. You'll find something like this at the start of the Wikipedia article on Large deviations theory.

Entropy is an important concept in many fields, particular in communication/information theory. Shannon's coding theorem says, informally,

"$N$ i.i.d. random variables each with entropy $H(X)$ can be compressed into more than $N H(X)$ bits with negligible risk of information loss, as $N$ tends to infinity; but conversely, if they are compressed into fewer than $N H(X)$ bits it is virtually certain that information will be lost."

You can learn about this in the Part II course: Coding and Cryptology.