Wednesday 11 March 2015

Lecture 24

The overhead slides I used for our last lecture are here (best downloaded and viewed with a pdf reader, rather than as a web page).

I showed you a bit of Large deviation theory and Chernoff's upper bound. The lower bound  was quoted without proof (except that in Example 24.1 we did the lower bound for the special case of $B(1,p)$). If perhaps some time mid-summer you have a moment and are curious to see the lower bound proved (which you have learned enough to understand), and more about large deviations, see these slides from a talk I once gave, pages 5–9. This note on the Cramer-Chernoff theorem is also good. Interestingly, Chernoff says in an interview with John Bather that the Chernoff bound should really be named after someone else!

I have used large deviation theory in my research to analyse buffer overflows in queues. But I know almost nothing about the subject of Random matrices. I prepared the content for Section 24.2 because I was intrigued by the results, wanted to learn, and thought it would be something entertaining with which to conclude. I did some of my reading in Chapter 2 of An Introduction to Random Matrices, by Anderson, Guiomet and Zeotouni, and then simplified their treatment to make it suitable for IA.

I thought it was nice that in showing you these two advanced topics, I could bring into play so many of the ideas we have had in our course: Markov and Chebyshev inequalities, moment generating function, sums of Bernoulli r.vs, Stirling’s formula, normal distribution, gambler’s ruin, Dyke words, generating functions, and the Central limit theorem.

So I hope you will consider the applicable courses in your next year of study. In Appendix H I have written some things about applicable mathematics courses in IB.

Monday 9 March 2015

Lecture 23

Lévy's continuity theorem is the same thing as the continuity theorem given in today's lecture, but for characteristic functions.

Here is a little more history about the Central Limit Theorem.

Henk Tijms writes in his book, Understanding Probability: Chance Rules in Everyday Life, Cambridge: Cambridge University Press, 2004,

"The Central Limit Theorem for Bernoulli trials was first proved by Abraham de Moivre and appeared in his book, The Doctrine of Chances, first published in 1718. He used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812.

De Moivre spend his years from age 18 to 21 in prison in France because of his Protestant background. When he was released he left France for England, where he worked as a tutor to the sons of noblemen. Newton had presented a copy of his Principia Mathematica to the Earl of Devonshire. The story goes that, while de Moivre was tutoring at the Earl's house, he came upon Newton's work and found that it was beyond him. It is said that he then bought a copy of his own and tore it into separate pages, learning it page by page as he walked around London to his tutoring jobs. De Moivre frequented the coffeehouses in London, where he started his probability work by calculating odds for gamblers. He also met Newton at such a coffeehouse and they became fast friends. De Moivre dedicated his book to Newton."

The Wikipedia article on the Central limit theorem mentions two things that would be suitable for the television programme QI.

1. There is a quite interesting explanation of why the term "Central" is used.

The actual term "central limit theorem" (in German: "zentraler Grenzwertsatz") was first used by George Pólya in 1920 in the title of a paper. Pólya referred to the theorem as "central" due to its importance in probability theory. According to Le Cam, the French school of probability interprets the word central in the sense that "it describes the behaviour of the centre of the distribution as opposed to its tails".

Personally, I had always thought it was the second of these, but the first is also plausible.

2. There is a quite interesting Cambridge connection.

A proof of a result similar to the 1922 Lindeberg CLT was the subject of Alan Turing's 1934 Fellowship Dissertation for King's College at the University of Cambridge. Only after submitting the work did Turing learn it had already been proved. Consequently, Turing's dissertation was never published.

Friday 6 March 2015

Lecture 22

You might like to experiment with the bivariate normal. Here is the Mathematica code that I used to plot the joint density function. You can copy and paste this code into Mathematica and they play with viewing from different angles or changing the value of the correlation, r.

S = ParallelTable[
  Plot3D[PDF[
    MultinormalDistribution[{1, 1}, {{1, r }, {r , 1}}], {x, y}],
    {x, -3, 4}, {y, -3, 4}, PlotRange -> All, 
   MeshFunctions -> {#3 &}, PlotPoints -> 50], {r, {0, 0.6}}]

You can also get the code here, and for Kelly betting here.

Inversion of p.g.f and m.g.f.

For discrete random variables working with the p.g.f. is easy. We have

Theorem 12.2The distribution of $X$ is uniquely determined by the p.g.f. $p(z)$. 

The proof is trivial because the distribution of $X$ is contained in coefficients of the power series $p(z)=p_0+p_1z+p_2z^2+\cdots$. It is also easy to tell if any given power series is a p.g.f. All we need is check is that the coefficients are nonnegative and sum to 1.

However, things are not so easy for moment generating functions. Given a power series $m(\theta)=1+m_1\theta  + m_2\frac{1}{2}\theta^2 +\cdots$
it is not easy to tell if there exists a r.v. that has moments $m_1,m_2,\dotsc$.
Also we had without proof the theorem analogous to Theorem 12.2,

Theorem 21.4 The moment generating function determines the distribution of $X$, provided $m(\theta)$ is finite for all $\theta$ in some interval containing the origin.

To prove Theorem 21.4 we would need to show how to recover the p.d.f. from the m.g.f.
In fact, it is easier to use the characteristic function. Let's take a small look at that.

In Lecture 22 we saw that the moment generating function of $N(0,1)$ is $m(\theta)=e^{\theta^2/2}$. This looks a lot like the p.d.f., which is $f(x)=e^{-x^2/2}/\sqrt{2\pi}$.

The characteristic function (c.f.) $\chi$ is obtained by replacing $\theta$ with $it$, $t$ real,

$\chi(t) = \int_{-\infty}^\infty f(x)e^{itx} dx$. This is $\chi(t)=e^{-t^2/2}$ for $N(0,1)$.

We can compute nearly the same integral of $\chi$ to recover $f$:

$\int_{-\infty}^\infty \chi(t)e^{-itx} dt=2\pi e^{-x^2/2}=2\pi f(x)$.

In general, given the c.f. we can recover the p.d.f. using the inversion formula

$f(x)=\frac{1}{2\pi}\int_{-\infty}^\infty \chi(t)e^{-itx} dt$.

The m.g.f. exists only if all moments are finite. However, the c.f. exists even when moments do not exist, since $e^{itX}$ is just some point on a unit circle in the complex plane, and hence nothing is going off to infinity. The Cauchy distribution is an example of a distribution that has no m.g.f, but does have a c.f., namely $e^{-|t|}$. 

Feedback

On Friday March 6, I distributed the usual feedback forms. Alternatively, you can complete this equivalent on-line form, or print the paper form here and hand it to me. If you prefer, you may wait until closer to the last lecture to complete the on-line form or to hand me a paper form.

The on-line version will be sent to my email anonymously (unless you choose to sign your name). After reading the responses, I will forward printouts to the Faculty Office.

Examples Sheet 3

Most of you will have now done Examples Sheet 3. Here are a few notes.

 #4. You'll want to compare this with Examples sheet 4 #5.

 #5. What would be the analogous question to this one, posed for continuous r.vs?

 #8. You should know how to answer this two ways: by using p.g.fs, and also directly via that tower property of conditional expectation.

 #16. How do we know that the answer is $a_z=a^z$, where $a$ the smallest positive root of $a=pa^3+1-p$? We have such a theorem for the extinction probability in a branching processes. How can this coin tossing problem be viewed through the lens of some branching process?

#17. If you need a hint, here is one: $P(|X+Y|>\epsilon)\leq P(|X|\geq \epsilon/2)+P(|Y|\geq \epsilon/2)$.

 #18. The hint is for the third part. But it also useful in answering the second part: "Could you load the die so that the totals {2,3,4,5,6,7} are obtained with equal probabilities?"

Planet Zog problems

Here are two challenging tripos questions from recent years. The first is a problem in geometric probability (as in Lecture 18). To do this question it will help to know that the surface area of a spherical cap is $2\pi R h$, where $R$ is the radius of the sphere and $h$ is the height of the cap. The surface area of the entire sphere is $4\pi R^2$.

2003/II/12
Planet Zog is a ball with centre $O$. Three spaceships $A$, $B$ and $C$ land at random on its surface, their positions being independent and each uniformly distributed on its surface. Calculate the probability density function of the angle $AOB$ formed by the lines $OA$ and $OB$.

Spaceships $A$ and $B$ can communicate directly by radio if  $\angle AOB  < \pi /2$, and similarly for spaceships $B$ and $C$ and spaceships $A$ and $C$. Given angle $\angle AOB =\gamma < \pi /2$, calculate the probability that $C$ can communicate directly with either $A$ or $B$. Given $\angle AOB =\gamma > \pi /2$, calculate the probability that $C$ can communicate directly with both $A$ and $B$. Hence, or otherwise, show that the probability that all three spaceships can keep in in touch (with, for example, $A$ communicating with $B$ via $C$ if necessary) is $( \pi +2)/(4\pi )$.

Spoiler alert! Here is the answer.

The second question is rather different. Here you are being tested on the idea that a sample space can be partitioned into disjoint events.

2004/II/12
Let $A_1,A_2,\dotsc,A_r$ be events such that $A_i\cap A_j = \emptyset$; for $i \neq j$. Show that the number $N$ of events that occur satisfies

$ P(N = 0) = 1 −\sum_{i=1}^rP(A_i) . $

Planet Zog is a sphere with centre $O$. A number $N$ of spaceships land at random on its surface, their positions being independent, each uniformly distributed over the surface. A spaceship at $A$ is in direct radio contact with another point $B$ on the surface if $\angle AOB < \pi/2$. Calculate the probability that every point on the surface of the planet is in direct radio contact with at least one of the $N$ spaceships.

[Hint: The intersection of the surface of a sphere with a plane through the centre of the sphere is called a great circle. You may find it helpful to use the fact that $N$ random great circles partition the surface of a sphere into $N(N − 1) + 2$ disjoint regions with probability one.]

Wednesday 4 March 2015

Lecture 21

I mentioned the beta distribution today. There is more about that distribution in Appendix D. I became much better acquainted with this distribution myself when I was doing some work for a reinsurance company in 2013.

Monday 2 March 2015

Lecture 20

Today we saw how to transform random variables. By this means we can discover further interesting relationships between distributions.

There is a particularly strange fact in Examples sheet 4, #19.

By the way, if (like me) you have ever been puzzled why it is that the absolute value of the Jacobian, $|J|$, gives the volume change of a $n$-dimensional parallelepiped under an linear transformation in $\mathbb{R}^n$, you might enjoy reading this essay A short thing about determinants, or “an attempt to explain volume forms, determinants and adjugates by staying in 3-d”, by Gareth Taylor.

I have also now added to the notes an Appendix C, in which I explain how under a linear transformation a sphere is mapped into an ellipsoid of a volume that greater by a factor $|J|$. I think that in previous years lecturers have said at this point "you already know this from Vector Calculus". But I am not sure that's that case, and so decided it would be nice to try to remove some of the mystery.

The convolution we need to add two Cauchy random variables is tricky. On page 81, the partial fractions method of computing this integral is to write

$\frac{1}{(1+x^2)(1+(z-x)^2)}=\frac{A+Bx}{(1+x^2)}+\frac{C+D(x-z)}{(1+(z-x)^2)}$

for $(A,B,C,D)=\frac{1}{4+z^2}(1,2/z,1,2/z)$. Upon integrating w.r.t. $x$ the terms with $B$ and $D$ give $0$ (by symmetry either side of $0$), and the terms with $A$ and $C$ provide the claimed result of $\frac{2}{\pi(4+z^2)}$.

Using the ideas in today's lectures, one can show that if $X,Y$ are independent $N(0,1)$ then $W=X^2+Y^2$ and $T=\tan^{-1}(Y/X)$ are independent $\mathscr{E}(1/2)$ and $U[-\pi/2,\pi/2]$. We can use this result in the reverse direction to generate normal random variables. Make two independent samples from $U[0,1]$, say $U$ and $V$. Let $R^2=-2\log U$ (which is $\mathscr{E}(1/2)$) and let $ \Theta=(V-\frac{1}{2})\pi$. Then $X=R\cos\Theta$ and $Y=R\sin\Theta$ are independent $N(0,1)$.

Friday 27 February 2015

Lecture 19

The fictional example in Section 19.2 about the heights of UK and Netherlands basketball players showed something striking. Can you think of any real life instances of this sort of effect?

The graphs on page 7 of this report on climate change are interesting in respect of showing how a small change in the mean can give rise to an amplified impact in the tail of a normal distribution. See Full SREX report: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation, Special Report of the Intergovernmental Panel on Climate Change.

In this graph we see "hot weather" increasing by a factor of 2. 
But "extreme hot weather" increases even more, by a factor of 3

Wednesday 25 February 2015

Lecture 18

The Wikipedia article on the Bertrand paradox has more interesting information and pictures. The is also a Wolfram demonstration project called the Random chord paradox that you can play with.

Examples sheet 4, #2 is this classic question: "A stick is broken in two places, independently uniformly distributed along its length. What is the probability that the three pieces will make a triangle?" For a more a more complicated geometric probability problem see the blog posting on Planet Zog problems.

Monday 23 February 2015

Table of distributions

Here is a table of 5 discrete and 6 continuous distributions. It is supplied to second year students studying Statistics IB, as a reminder of things they are to know from Probability IA.

For each of distribution the table shows its usual notation, p.m.f. or p.d.f., mean, variance, and p.g.f. or m.g.f.

You will want to have at your fingertips most of the information in this table (or the method to by which to produce it). You might find it helpful to make your own table, using this one as a guide.

Some of these 11 distributions are not explicitly mentioned in the schedules for Probability IA. Rather they appear as examples of other things which are mentioned in the schedules. For example, the gamma distribution arises as an example of summing i.i.d. r.vs (in this case exponential r.vs). The multivariate normal provides an example of the p.d.f. of jointly distributed continuous r.vs. The beta distribution appears as an exercise on the joint distribution of order statistics of i.i.d. uniform r.vs. We have not mentioned in a lecture the negative binomial, but this is simply the sum of i.i.d. geometric r.vs. (as done in Examples sheet 3, #5).

Lecture 17

An important special case of Theorem 17.1 is when $Y=aX$, for some constant $a$.
The p.d.f. of $Y$ is then $f_Y(y)=(1/a)f_X(y/a)$.

There are many ways to define transitive ordering relations between random variables. Now you know about $EX\geq EY$ and  $EX\geq_{\text{st}} EY$. Such relationships are important in applications, such as when $X$ and $Y$ are lifetimes of components and we are interested ordering their reliabilities. (Reliability engineering is concerned with determining the ability of a system or component to function under some stated conditions for a specified period of time.)

There are other ordering relations, such as "hazard rate ordering", and "convex ordering". These are described within the Wikipedia article on stochastic ordering. I explained in the lecture that $EX\geq_{\text{hr}} EY\implies EX\geq_{\text{st}} EY\implies EX\geq EY$. The first implication is from
\[
1-F(x) = \exp\left(-\int_0^x h(t)dt\right)
\]and the second from Theorem 17.4. An important order relation in statistics is "likelihood ratio ordering". If $X$ and $Y$ have p.d.fs $f$ and $g$ such that $f(t)/g(t)$ increases with $t$ then $X\geq_{\text{lr}}Y$. It is a stronger ordering than the two we have met, since $X\geq_{\text{lr}}Y\implies X\geq_{\text{st}}Y$. For example, if $X$ and $Y$ are exponential random variables with parameters $1/2$ and $1$, then $X\geq_{\text{lr}}Y$

The treatment of the inspection paradox is this lecture derived from a paper by Sheldon Ross, The inspection paradox, Probability in the Engineering and Informational Sciences, 17, 2003, 47–51. Of course the use of the word "paradox" is somewhat in appropriate, since it is not hard to understand why $X_J\geq_{\text{st}}X_1$.

There are other such inspection paradoxes. Suppose $X_1,X_2,\dotsc$ are i.i.d. $U[0,1]$ and $N$ is the least integer such that $X_1+\cdots+X_N\geq 1$. Then conditional on this,

$X_N\geq_{\text{st}}X_{N-1}\geq_{\text{st}}\cdots \geq_{\text{st}}X_1$.

I did not discuss in the lecture the coda at the end of page 69. I leave you with this puzzle, "Do girls have more brothers than boys do?" The "proof" that the answer is "yes" is false, though initially it looks almost convincing. What is wrong?

What do you think is the truth of the matter? Let $G$ and $B$ be the mean number of brothers possessed by a typical girl and typical boy, respectively.

If every family were to consists of some number of boys and just one girl then $G > B$.

If every family were to consists of only boys or only girls then $G < B$.

What are the minimal conditions under which $G=B$? Must you assume that each child is independently a boy or girl with probability $1/2$?

Friday 20 February 2015

Probability distributions

The chart I showed you, with the distributions we are studying marked in yellow, can be downloaded here. It is from the paper Univariate Distribution Relationships, L. M. Leemis and J. T. McQueston. In that paper you can find explanations of the various relationships that are indicated by the arrows.

Here also is a simpler chart about Relationships between probability distributions. This one shows some relations that the above chart misses out, such as how the binomial can come from the Poisson as $X \mid X+Y$ (see Examples sheet 2, #5).

Wikipedia has a nice List of probability distributions.

In 16.5 of the notes I mention that if $X$ and $Y$ are i.i.d. exponential r.vs then $X/(X+Y)$ is a uniformly distributed r.v. You will prove this in Examples sheet 4, #8. A general method will be given in Lecture 20. But it can also be done this way:

$P(X/(X+Y) \leq t)=P(X(1-t)/t\leq Y)=\int_{x=0}^\infty\left(\int_{y=x(1-t)/t}^{\infty}\lambda e^{-\lambda  y}dy\right)\lambda e^{-\lambda x}dx = t$

which is the c.d.f. of $U[0,1]$.

Lecture 16

The human life plots of force of mortality that I showed in today's lecture can be viewed on the Understanding Uncertainly page: How long are you going to live?

David Spiegelhalter (Winton Professor for the Public Understanding of Risk) makes this comment:

When 7 years old, there's only 1 in 10,000 chance of not making your 8th birthday - this is the safest that anyone has been, anywhere, ever, in the history of humanity. But the jump from 16 to 17 is the biggest relative increase in risk in your life (all this, of course, assumes you're utterly average). 

But I think the most amazing thing is the old Gompertz observation of log-linearity - the log-hazard is quite extraordinarily straight over such a long range (it would be 7 to 90 if it were not for idiotic youth). So, on average, every year older means an extra 9% chance of dying before your next birthday - so your force of mortality doubles every 8 years, like a nasty compound interest. And it will get you in the end.
He makes reference to the Gompertz distribution, which is one with a linearly increasing hazard rate.

Notice that the hazard rate curve (force of mortality) has a 'bath-tub' shape - there is an early high-risk period immediately following birth, followed by long low period of annual risk, until accidents start kicking in for 17-year-old-boys, and then steadily increases. At the above link there are some tools for interactively exploring how your life expectancy changes as a function of your present age, gender and lifestyle (non-smoker, moderate alcohol, 5 a day, exercise).

Most things (such as car engines, tyres, light bulbs) have an increasing hazard rate. But some things have a decreasing hazard rate. A business that has lasted two centuries is less likely to go bankrupt next year than one that only started two years ago.

Notice that if $X\sim U[0,1]$ then $h(x)=f(x)/(1-F(x))=1/(1-x)$, which tends to $\infty$ as $x\to 1$.

Some fine fine print. I said that a continuous random variable is one that has a continuous c.d.f., and it is this that puts the "continuous" in its name. However, the fine print (which you can find in my notes near the middle of page 63) is that we actually need the c.d.f. to absolutely continuous, and this will ensure that a p.d.f. exists. The qualifier "absolutely" rules out weirdly pathological c.d.fs. For example, the Cantor distribution has a continuous c.d.f., but it has no p.d.f. Of course none of this fine print is examinable in Part IA. But I know that some of you find this type of thing interesting.

Wednesday 18 February 2015

Simulation of a random walk

Here is something for Lecture 15 (on random walk).

Simulating the Simple Random Walk
from the Wolfram Demonstrations Project by Heikki Ruskeepää

Click above to play with a simulation of random walk produced by tossing a fair coin. The "seed" reruns with a different set of random coin tosses. The curves are confidence intervals, within which the walk will lie with high probability. Random walks are very important in physics, biology, and finance.



There are many interesting questions that one can ask about symmetric random walk, i.e. the random paths produced by tossing a fair coin (up 1 for a head, down 1 for a tail).
  • How many times on-average does a walk of length $n$ cross the $x$-axis?
  • What is the distribution of the last crossing of the $x$-axis? (recall Lecture 1!)
  • What is the distribution of the terminal value $X_n$?
  • As you try different seeds, do you think some walks look more "typical" than others?

Lecture 15

The Wikipedia page about random walk is worth a look. Today, we focused upon the Gambler's ruin problem. However, there are many more interesting things that can be said about random walks. For example, one can consider random walks on arbitrary graphs. More about random walks will be presented to you in the course Markov Chains, Part IB.

If you wish to ensure you never go bankrupt then rather than bet £1 at each go you could bet £ $fz$, where $0<f<1$. That way you come next to $(1+f)z$ or $(1-f)z$. Kelly betting, is about choosing the optimal value of $f$.

The continuous version of random walk is Brownian motion, which is presented in Stochastic Financial Models, Part II. One of its beautiful properties is that it is self-similar. The graphic below shows how Brownian motion looks when we place it under the microscope and gradually increase the magnification. It is possible to say of Brownian motion in 2 dimensions that "almost surely every arc of the path contains the complete works of Shakespeare in the handwriting of Bacon." (Peter Whittle). Just as we might say in dimension 1 that at some magnifications the path below looks like a graph of FTSE all-share index for the past 5 years.

Monday 16 February 2015

Simulation of a branching process

Here is something for Lecture 14 (on branching processes).

Readme first. You can run this demo in your browser, but you will need to install the Wolfram CDF player. It may take a few moments to load, so please be patient. Also, there seems to a problem at present with the plugin running in Chrome, so you might have to use a different browser. The best thing is to download the demo from the link on the page ("Download Demonstration as CDF") and then run the demo in Mathematica on your own computer. Here is a screenshot from the demo.

This Demonstration shows up to six simulated generations of a branching process. The distribution of the number of descendants of each member of the population can have either a Bernoulli, Poisson, or geometric distribution; the parameters of these distributions can be set with sliders. From the three figures, the left one shows the simulated branching process, the upper-right one shows the probability of ultimate extinction as a function of a parameter of the distribution, and the lower-right one shows the probabilities of the size of the population at the selected generation.

To run the demo click on Simulating the Branching Process
[from the Wolfram Demonstrations Project by Heikki Ruskeepää]

Lecture 14

In this lecture you have had a taste of the theory of branching processes.The calculations we did today in Theorem 14.2 are similar to the sort of thing you need to do in answering Examples Sheet 3, #10, 14, 15. Question #16 does not look like one about branching processes, but you might think about trying to map into a version of a branching process so that you can apply theorems from this lecture.

If you would like to read something more about branching processes, I recommend to A Short Introduction to Branching Processes by Gesine Reinert (a professor of statistics at Oxford).

Branching processes have interesting applications beyond the obvious ones of population biology, e.g. in rumours and advertising. I asked Gesine if she might recommend something beyond the basics. She has written to me, "One application of branching processes is to assess the shortest path length between two randomly chosen vertices in sparse networks; but that may be a bit much for first-year students."

In talking about branching processes with more than one type of individual I was drawing on  Gesine's slides 26-28, and 30-32. In these you can see the idea of a multi-type branching process. We see that it is possible to prove a theorem similar to our Theorem 14.3, but now the probability of extinction depends on the maximum eigenvalue of the matrix $M=(m_{ij})$, where $m_{ij}$ is the mean number of offspring of type $j$ produced by a parent of type $i$. The branching process becomes extinct with a probability 1 or $<1$ as the eigenvalue is $\leq1$ or $>1$.

Suppose there are two types. The key quantities are the two multi-dimensional p.g.f.s

$F_i(z_1,z_2)=E_i\Bigl[z_1^{X_{i1}}z_2^{X_{i2}}\Bigr]$, $i=1,2$,

where $X_{ij}$ is a r.v. having the distribution of the number of offspring of type $j$ produced by a parent of type $i$.

 Let $q_i$ be the probability of extinction when we start with one individual of type $i$. Then, analogous to  Theorem 14.3, $(q_1,q_2)$ are smallest solutions in $[0,1]^2$ to

$q_1 = F_1(q_1,q_2)$
$q_2 = F_2(q_1,q_2)$.

Saturday 14 February 2015

Examples Sheet 2

Most of you will now be doing Examples Sheet 2. Let me make a few comments about some things I hope you learn from that sheet. Some of what I say below is "in brief', and you may need to fill in details or ask your supervisor.

 #3,4. There is often more than one nice way to solve a problem. I think that both of these problems have at least two nice methods of solution. Have you been able to think of more than one way to solve them? For example, #4 can be solved by summing a geometric series. It can also be solved via a simple order 0 recurrence relation.

 #6 (ii). You should discover that $Y$ and $X$ are independent. So $X=10$ provides no information about $Y$. What is the intuition for this? Well, we might imagine that $X$ and $Y$ are initially numbers of "detectable" and "non-detectable" misprints, created as independent samples from the Poission distribution with parameter $\lambda/2$. The page has $X+Y$ misprints (which is distributed Poisson, parameter $\lambda$). The proofreader finds the detectable ones. In doing this she discovers nothing about $Y$. This model gives the same distribution of $X$ and $Y$ as one in which there are $X+Y$ misprints and each is detected with probability $1/2$.

 #7,8. The results in these questions are often proved by students in something like 2 pages of working. But they can also be set out in perhaps only 2-4 lines of working. Make sure you learn (from your supervisor) the fast way to derive these results.

 #10. There are several ways to show that $Y_1,\dotsc,Y_n$ are independent. One way is via the fact that $P(Y_1=1,Y_2=1,\dotsc,Y_n=1)=1/n!=P(Y_1=1)\cdots P(Y_n=1)$. Then apply the result of question #2.

 #12. The coupon collector's problem. This question is one that should lodge permanently in your brain. There are many problems in probability, design of computer science algorithms, and elsewhere, which at some point involve this problem. I think it is interesting that the expected time to collect the $n$ figures is about $n\log n$. However, the expected time needed to collect $99\%$ of the figures is only about $n\log 100$. So it is the last fraction of figures that are costly to collect.

 #14. Did you notice that this method would work for simulating any probability $\alpha\in (0,1)$? For example, if we want $P(A)=1/5$, we write this in binary as $\alpha=0.0011001100...$ and then toss a coin. If we see TTHHH we would say $A$ has not occurred, because translating this to $.00111$, it differs from the binary form of $\alpha$ by representing a number $>\alpha$.

 #16. This questions has been set to IA students for many years. In my experience few students are able to solve it. But the solution is so beautiful that it would be a pity if your supervisor were not given the opportunity to explain it to you.

 #17,18. When I say, "This section is for enthusiasts", I am hoping that most students will think "This means me". After all, if you are reading Mathematics at Cambridge you are by definition a mathematics enthusiast. Even if you cannot solve these questions fully you should at least make some attempt. I would expect almost all students to be able to do #17 for the case $k=2$, $n=3$. It is a good general principle to tackle a difficult problem by first trying to solve a simple special case.

Discrete Fourier transforms and p.g.f.s

In non-examinable Section 13.4 and 13.5 I showed you some moderately sophisticated applications of the simple ideas we are learning about p.g.f. and conditional expectation. I have added an Appendix B to the notes to say a bit more about the fast Fourier transform.

The brief story is this. To find the distribution of $Y$, the sum of two i.i.d. r.vs $X_1$ and $X_2$ taking values in $\{0,1,\dotsc,N-1\}$, we might find its p.g.f. $$p_Y(z)=p(z)^2=(p_0+p_1z+\cdots+p_{N-1}z^{N-1})^2.$$ This would take $O(N^2)$ multiplications of the form $p_ip_j$.

However, it is quicker to evaluate $p(z)$ at $2N$ complex values $z=\omega^0,\dotsc,\omega^{2N-1}$, square those values, and then recover the real coefficients of the polynomial $p(z)^2$. The calculation simplifies because by taking $\omega$ as a $2N$th root of unity, $\omega=\exp(-i\pi/N)$, the calculation of the $p(\omega^k)$, $k=0,\dotsc,2N-1$, can be done by a clever method, the discrete fast Fourier transform, which needs only $O(N\log N)$ multiplications.

Further details are in Appendix B of the notes, and the Part II course, Numerical Analysis.

Friday 13 February 2015

Lecture 13

The tower property of conditional expectation (also  called the law of iterated expectations) is akin to the law of total probability that we met in Section 6.3. Another name is law of total expectation. A special case is

$E[X]=\sum_{i:P(A_i)>0} E[X \mid A_i]P(A_i)$,

where $A_1,\dotsc,A_n$ are events giving a partition of the sample space $\Omega$. Compare the above to

$E[X]=E\big[E[X \mid Y]\bigr]=\sum_y E[X \mid Y=y]P(Y=y)$.

Sometimes we might write $E\Big[E[X \mid Y]\Bigr]$ as $E_Y\Big[E[X \mid Y]\Bigr]$, just to remind ourselves that the outer expectation is being applied to the variable $Y$. Of course it is implicit in all this that $E[X]$ exists.

Remember that $P(A | B)$ is only defined for $P(B)>0$.  If you want to see something that illustrates this in a cute way, then read this nice description of Borel's paradox. The paradox occurs because of a mistaken step in conditioning on an event of probability $0$, which can be reached as a limit of a sequence of events in different ways.

Wednesday 11 February 2015

Lecture 12

Let me remind you that you can install copies of Mathematica and Matlab software. It can be fun to use these to help with an appreciation of things we are doing. Today I used this Mathematica program to calculate some probability generating functions.

(* Roll a fair die 10 times *) 
(* What is the probability the sum is 50? *)

p[z_] = (1/6) (z + z^2 + z^3 + z^4 + z^5 + z^6)
SeriesCoefficient[p[z]^10, {z, 0, 50}]
% // N

Series[1/(1 - x - x^2), {x, 0, 10}]

Table[Fibonacci[n], {n, 0, 10}]

Series[(1 - Sqrt[1 - 4 x])/(2 x), {x, 0, 10}]

Table[CatalanNumber[n], {n, 10}]

Mathematica is available here http://www.maths.cam.ac.uk/computing/software/mat... 
Matlab is here http://www.maths.cam.ac.uk/undergrad/catam/softwa...


A very nice tripos question on probability generating functions is the following, 2004/2/II/9F. You should now be able to do this (and could discuss in your next supervision)
    A non-standard pair of dice is a pair of six-sided unbiased dice whose faces are numbered with strictly positive integers in a non-standard way (for example, (2,2,2,3,5,7) and (1,1,5,6,7,8)). Show that there exists a non-standard pair of dice A and B such that when thrown

    P(total shown by A and B is $n$) = P(total shown by pair of ordinary dice is $n$).
    for all $2\leq n\leq 12$,

    [Hint: $(x+x^2+x^3+x^4+x^5+x^6)=x(1+x)(1+x^2+x^4)=x(1+x+x^2)(1+x^3)$.]
We saw today that generating functions are useful in realms beyond probability, as example being that

"the $n$th Fibonacci number, $F_n$, is the coefficient of $x^n$ in the expansion of the
function $1/(1−x−x^2)$ as a power series about the origin."

That is a quote from chapter 1 of Herbert Wilf's fun book called generatingfunctionology, whose second edition is free to download. In his book you can read about many weird and wonderful things that can be done with generating functions. He starts Chapter 1 by writing

"A generating function is a clothesline on which we hang up a sequence of numbers for display."

Generating functions can encapsulate complicated things in a lovely way.  For example, the coefficient of $x^n/n!$ in the power series expansion of $e^{e^x-1}$ is the Bell number $B(n)$, i.e. the number of different partitions that can be made from a set of $n$ elements. In Chapter 1 of Wilf's book you can see a derivation of this fact.

Examples Sheet 1

Most of you will have now done Examples Sheet 1. Let me make a few comments about some things I hope you learn from that sheet.

#7. (about infinite sequence of coin tosses and limsup of events) This was done in Section 4.4 of the notes. An even quicker solution can be done using the fact that $P$ is a continuous function (proved in Section 7.1 of notes).

$P(C)=P(\bigcap_{n=1}^\infty\bigcup_{k=n}^\infty A_k)=\lim_{n\to\infty}P(\bigcup_{k=n}^\infty A_k)\leq\lim_{n\to\infty}(p_n+p_{n+1}+\cdots)=0$.

This was part of the tripose question 2014, 2, II9.

#8. (sets none of which is a subset of the other) Think how this question would look if everything except the first and last sentences of its statement were removed. It would look hard! See how adding a layer of probability helps prove a theorem that seems to have nothing to do with probability. We saw this in Example 4.5 (Erdos's probabilistic method) and we will see this a couple more times in the course, next in Section 11.4 (Weierstrass approximation theorem).

#13 (Mary tosses more heads than John) and #18 (41 bags of red and blue balls). Both these questions have very beautiful solutions. Please make sure your supervisor shows you the beautiful solutions, if you have not found them yourself.

#16 This question relates to what I did in Lecture 10 about information entropy.

#17 (goat ate the cards!) I invented this question as a sort of joke - but I am pleased that it has turned out to be very thought provoking for many students. If the "penny has dropped" for you about this question then you are on your way to developing some good intuition in the field of probability.

Comments are welcome if you have any ideas for good questions on this part of the course.

Monday 9 February 2015

Lecture 11

The probabilistic proof of the Weierstrass approximation theorem was to show you how probability theory can be used to prove an important theorem in a different field. We did something similar in Example 4.5 (Erdos probabilistic method in combinatorics) and in Examples Sheet 1, #8. The Wikipedia page on Bernstein polynomials has a list of Bernstein polynomials and an animation showing how they are used to approximate a continuous function on $[0,1]$.

I cheated today by using a result from the schedules for Analysis II ("a continuous function on a closed bounded interval is uniformly continuous"). I had thought about giving an easier proof by not appealing to Analysis II, but I decided it would be nicer to show you the proper full result.

The proof used Chebyshev's inequality and illustrated a nice idea that is frequently used in probability calculations. We partitioned the sample space into $A$ and $A^c$ and reasoned differently in each. For $A$ we had $P(A)$ small and $|f(X/n)-f(x)|\leq 2$, whereas for $A^c$ we had $P(A^c)\leq 1$ and $|f(X/n)-f(x)|$ small.

Benford's law is surprising when first encountered. The fact that the leading digit should be "1" with a frequency of 0.301 (rather than $1/9$) has been used as a test to check for fraud in financial records. The idea is that if a person is making up numbers, he unlikely to arrange that his distribution of first digits agree with Benford's law. You can read more about that, and many other examples of Benford's law, in the Wikipedia article referenced above.

In discussing why Benford's Law applies to leading digits of Fibonacci numbers I used a result that if $\beta=\log_{10}\alpha$ is irrational, then the sequence of fractional parts $\{]n\beta[\}_{n=1}^\infty$ is uniformly distributed. This result is certainly very plausible, but a proper proof is beyond our scope. It involves an analysis via the discrete Fourier transform of the sequence of length $N$, showing that this approaches the Fourier transform of a sequence of numbers that are uniformly distributed on $[0,1]$.

Friday 6 February 2015

Lecture 10

I mentioned  a 318 page book completely devoted to the Cauchy-Schwarz inequality. It is The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Its author, Michael J. Steele, is a past President of the Institute of Mathematical Statistics. One reviewer wrote "... this is one of that handful of mathematics books that you can read almost like a novel.Chapter 1 is free to download and really interesting to read. I can also recommend his highly entertaining web site, which includes material for advanced courses in probability and finance, and also Semi-Random Rants, Videos and Favorite Quotes.

There are many proofs of this inequality. Here are two in addition to the one I did in lectures.

1. Consider $E(tX-Y)^2]=t^2E[X^2]-2t E[XY]+E[Y^2]\geq 0$, and $>0$ unless $tX=Y$. This quadratic has most one real roots and so the discriminant must be nonpositive. This is essentially the proof that was given by Schwarz and is a textbook favorite.

2. We started the lecture with Jensen's inequality. It is natural to ask if the Cauchy-Schwarz inequality can be deduced from it. One version of the Cauchy-Schwarz inequality is

$(\sum_i a_ib_i)^2\leq \sum_ia_i^2\sum_ib_i^2$,

for all numbers $(a_1,\dotsc,a_n)$ and $(b_1,\dotsc,b_n)$. WLOG we may assume all $a_j,b_j> 0$. Consider a random variable $X$ for which $P(X=a_j/b_j)=b_j^2/\sum_ib_i^2$. Apply Jensen's inequality with the choice of convex function $f(x)=x^2$. $Ef(X)\geq f(EX)$ gives the result.

This paper presents 12 different proofs of the Cauchy-Schwarz inequality, and this web page has 10.

Wednesday 4 February 2015

Non-transitive dice

A set of 3 non-transitive dice can be bought from Maths Gear. They have sides like this:

Die A: 333336
Die B: 222555
Die C: 144444   $P(A>B)=P(B>C)=21/36$ and $P(C>A)=25/36$.

These are optimal, in that they maximize the smallest winning probability. However, there are other dice for which the sum of the 3 winning probabilities is greater.

Die A: 114444
Die B: 333333
Die C: 222255   $P(A>B)=P(B>C)=24/36$ and $P(C>A)=20/36$. 

There is an interesting article by James Grime (of the Millennium Mathematics Project) in which he says lots more about non-transitive dice. He has devised a set of 5 non-transitive dice. In another interesting article there is a video in which James discusses non-transitive dice with David Spiegelhalter.

James made a contribution to this blog last year, and you can read his comment at the end of the post here: http://weberprobability.blogspot.co.uk/2014/02/nontransitive-dice.html

Lecture 9

The account of Warren Buffett, Bill Gates and Efron's dice that I gave in lectures you can find here.

The "Names in Boxes Problem", which I talked about today, is a puzzle which I first heard described around a lunch table at AT&T Laboratories, New Jersey, around 2003. The problem was posed in a paper, with a less catchy title:

Anna Gal and Peter Bro Miltersen. The Cell Probe Complexity of Succinct Data Structures.
Proceedings of 30th Intemational Colloquium on Automata. Languages and Programming (ICALP) 2003, 332-344.

It won a best paper award at this ICALP conference. The solution was found by Sven Skyum, a colleague of Miltersen's at the University of Aarhus If you would like to read more about the history of the problem, I recommend this paper: The Locker Problem, (by Max Warshauer, Eucene Curtin, Michael Kleber, Ravi Vakil, The Mathematical Intelligencer, 2006, Volume 28, Issue 1, pp 28-31).

The puzzle is also described in a collection complied by Peter Winkler callled, "Seven Puzzles You Think You Must Not Have Heard Correctly". I think it is remarkable that we can work out all the needed facts about cycle lengths in a random permutation by using only the simple ideas from probability that we have encountered in our course thus far.

Peter is great collector of good puzzles. He has some nice ones that are geographic. How about these (answer without looking at a map):
  1. If you travel due south from Miami, Florida, what part of the South American continent do you first hit?
  2. What two cities, one in North America and one Africa, are closest?
I have been told that Question 2 in "Seven Puzzles You Think You Must Not Have Heard Correctly" has been used by some colleagues as an entrance interview question. But I think that is rather tough!

Monday 2 February 2015

Lecture 8

In Example 8.3 we came to the conclusion that the expected number of couples sitting together is $EN=2$. However, --- as someone pointed out to me afterwards ---, this is wrong for $n=1$, since we would be double-counting when at a table of two places Mr Smith has Mrs Smith on both his left and right. I have changed the notes so that this example specifies $n\geq 2$.

Cycling home I was thinking how to prove that as $n\to\infty$ the distribution of $N$ tends to the Poisson distribution with mean 2. One way to do this would be to calculate $EN^r$ for $r=1,2,\dotsc\,$, just as we have already done for $r=1,2$. That will be difficult, but not impossible. Then we show that as $n\to\infty$ these moments tend to the moments of the Poisson random variable. As we shall see in Lecture 21, there is a 1-1 relationship between distributions and their moments.

Zipf's law.
Simply as a fun fact, I told you today about Zipf's law. This is a probability mass function such that $P(X=i)\propto 1/i^s$, $i=1,2,\dotsc,m$, for some $s > 0$, typically $s=1$. It is very mysterious why this distribution should be a good fit to frequencies with which words are used in a language.

According to the Wikipedia article linked above, "Zipf himself proposed that neither speakers nor hearers using a given language want to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution". My guess is thinking that he is thinking that to retrieve the $i$th most popular word from your memory has a cost proportional to $i$ (like you were digging down into a list), and that the average workload that each word puts on the brain should be constant, so $ip(i)$ should constant. But this is clearly hand-waving. Maybe you can think of a better explanation

In queueing theory and other theories of congestion, delays and long waiting times can be caused by variability in service times. Suppose customers enter a queue at a constant rate of 1 every 5 minutes. They are served one at a time, service times are independent, taking $1,2,\dotsc$ minutes with probabilities $p_1,p_2,\dotsc$, respectively

An interesting fact (researched extensively during the past 30 years) is that there are large qualitative difference between how such queues behave when $p_i$ decays geometrically or exponentially (as do the Poisson, or geometric distribution, which has $p_i= q^ip$), and when it decays according to a power law, like $p_i\propto 1/i$ or $\propto 1/i^{0.9}$.

Friday 30 January 2015

The St Petersburg Paradox

In Lecture 7 I mentioned that $EX$ may not exist, and that for a random variable $X$ such that $X\geq 0$ it is possible for $EX=\infty$. Here's a famous example arising from the case $EX=\infty$. See also, Examples sheet 3, #19.

The St Petersburg Paradox
How much would you be willing to pay to play the following game? We will toss a fair coin repeatedly. We start you with £1 and then double this every time the coin shows a head. We stop when the first tail occurs. You keep your winnings, denoted by the random variable $X$. Now

$EX = (1/2)1+(1/4)2+(1/8)4+(1/16)8 +\cdots = \infty$.

Since the expected value of your winnings is infinite, it seems like you should be willing to pay an arbitrarily large amount to play this game! But would you really be willing to pay £1,000,000? (if you could!) Surely not. If you paid £1,000,000 for the right to play this game, the probability is 0.9749 that you will win less than you paid.

This is the St. Petersburg paradox  posed by Nicolas Bernoulli 1713.The paradox concerns what the mathematics predicts versus what a rational person might actually do in practice. One resolution is that a person's utility for money is not linear. We should really be looking at $Eu(X)$, where perhaps $u(X)=\log X$. In that case $Eu(X) < \infty$.

Lecture 7

You might like to revisit the blog for Lecture 4. The example I do there about tossing coins for which $\sum_k p_k=\infty$ makes use of both continuity of $P$ (proved today), and independence of events (Lecture 5).

We were careful today in defining the notion of a random variable. It is a function mapping elements of the sample space $\Omega$ to some other set $\Omega_X$ (which is typically some set of real numbers). So $X$ is a function and $X(\omega)$ is value in its range. Most of the time we leave out the $\omega$ and speak of an event like $X\leq 1.45$. This really means the event $\{\omega : X(\omega)\leq 1.45\}$.

I remarked that the terminology ‘random variable’ is somewhat inaccurate, since a random variable is neither random nor a variable. However, a random variable has an associated probability distribution and one can think informally about a random variable as "a variable which takes its value according to a probability distribution".

The reason we like to define a random variable $X$ as a function on $\Omega$ is that it ties $X$ back to an underlying probability space, for which we have postulated axioms and know their consequences. If we try to define a random variable $X$ only in terms of its probability distribution, we would not have such a strong axiomatic basis from which to derive further properties of $X$.

Fine print. In Lecture 4 I gave you the definition of a probability space $(\Omega,\mathscr{F},P)$. For $T\subset\Omega_X$, we can calculate $P(X\in T)=P(\{\omega:X(\omega)\in T\})$ only if $\{\omega:X(\omega)\in T\}=X^{-1}(T)\in\mathscr{F}$. This may put restrictions on allowable $T$ and $X$. But this will not bother us in Probability IA, since we will only look at situations in which $\mathscr{F}$ consists of all subsets of $\Omega$, or $T$ is something nice like a subinterval of the reals.

Wednesday 28 January 2015

Bayes's theorem

Bayes' theorem looks innocent enough, but from it emerges an entire school of statistical reasoning, Bayesian statistics. We have a hypothesis $A$, which might be true or false. A priori we think it is true with probability $p$. We collect some data and then want to update our belief that $A$ is true, that is find $P(A\mid\text{data})$. We can find the answer using $P(\text{data}\mid A)$ and Bayes' theorem.

Here is a YouTube video of students in a sing-a-long of George Box's "There's no theorem like Bayes' theorem", sung to the tune of "There's no business like show business".

There's no theorem like Bayes' theorem
Like no theorem we know
Everything about it is appealing
Everything about it is a wow
Let out all that a priori feeling
You've been concealing right up to now!

Further words are here.

Lecture 6

I mentioned David Aldous's blog post Presenting probability via math puzzles is harmful.

Someone says to you: "I have two children one of whom is a boy"? To assess the probability that both are boys we must make assumptions about what the speaker really means, and what other things he might have said in other circumstances. You might like to read David Aldous's twins example:

In casual conversation with a stranger in the next airplane seat, you ask "do you have children?" and the response is "yes, two boys". Based on this, what is the chance that the two boys are twins?

Aldous argues that the probability is very close to $0$, because most people, if asked this question, would mention having twins if that were the case. We should not abandon common sense when doing maths problems. He gives this example:

"If you ask a politician in government "has unemployment gone up?" and they answer "long-term unemployment has gone down" then you can be fairly confident that short-term unemployment has indeed gone up!"

Here is some interesting reading for you about the law courts and conditional probability.
  • A formula for justice
    Bayes' theorem is a mathematical equation used in court cases to analyse statistical evidence. But a judge has ruled it can no longer be used. Will it result in more miscarriages of justice?
  • Court of Appeal bans Bayesian probability (and Sherlock Holmes)
    In a recent judgement the English Court of Appeal has not only rejected the Sherlock Holmes doctrine shown above, but also denied that probability can be used as an expression of uncertainty for events that have either happened or not.
  • Prosecutor's fallacy
    The prosecutor's fallacy is a fallacy of statistical reasoning, typically used by the prosecution to argue for the guilt of a defendant during a criminal trial. 

I told you that an alternative name for Simpson's paradox is the Yule-Simpson effect. Udny Yule (a Fellow of St John's College, and a lecturer in statistics at Cambridge) appears to be the first person to have commented on this phenomenon (in 1903). Coincidentally, David Aldous (mentioned above) was also an undergraduate and Fellow at St John's (1970-79).

Example 6.7 was artificial, so it would be easy to see what was happening (women predominated amongst independent school applicants and women were more likely to gain admission.) In the Wikipedia article Simpson's paradox you can read about more practical examples. One of the best-known real-life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there.

I once extracted data from Cambridge University Tripos results to show that in each of certain set of carefully selected subjects (including Engineering, English, ...) women were more likely to obtain Firsts than men, but that men were more likely to obtain Firsts when results over all subjects were combined. This sounds paradoxical. However, it happened because some subjects award greater percentages of Firsts than do others. But also some subjects are relatively more popular with women. 

Tuesday 27 January 2015

limsup of a sequence of sets (or events)

When seen for the first time it can be difficult to immediately grasp the meaning of  $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty A_k$.

I found that when I was a student and encountering this for the first time. This is one reason I included #7 on Examples Sheet 1. My advice is to pick it apart slowly.

$A_k$ is the event that the $k$th toss is a head.

First: The event $\bigcup_{k=i}^\infty A_k$ is true (occurs) if and only if there is at least one head somewhere amongst the tosses $i,i+1,i+2,\dots$ .

Second: When we put $\bigcap_{i=1}^\infty$ in front we create an event that is true if and only if the above is true for all $i$. So how can the result be true? Well, if and only if there are an infinite number of heads. (There might also be an infinite number of tails.)

In general, $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k$ is occurs (or is true) if and only if an infinite number of the events $E_1,E_2,\dotsc$ occur.

The idea is similar to the "$\limsup$" of a sequence $\{a_1,a_2,\dots\}$, which is defined as

$\lim_{i\to\infty} \sup_{k\geq i} a_k$.

How about $\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty A_k$? That is the event that eventually we see only heads. Now perhaps you can see that

$\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty E_k \subseteq\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k $.

Here is an article on it that might also help. 

Monday 26 January 2015

Lecture 5

You might compare the answer to Examples Sheet 1 #10 to a Bonferroni inequality working in the reverse direction, $\geq$.

We saw by examples today that it is possible to have three events $A_1,A_2,A_3$ such that every pair of events is independent, but the three events are not mutually independent. Here is another example. Suppose the sample sample consists of 4 equally likely outcomes $\{w_1,w_2,w_3,w_4\}$. Let

$A_1=\{w_1,w_2\}$,
$A_2=\{w_1,w_3\}$,
$A_3=\{w_1,w_4\}$.

Then $P(A_i\cap A_j)=P(A_i)P(A_j)=1/4$, but $1/4=P(A_1\cap A_2\cap A_3)\neq P(A_1)P(A_2)P(A_3)=1/8$.

We might generalize this to a puzzle. Can you find five events $A_1,A_2,A_3,A_4,A_5$ such that every subset of either two, three or four events is a set of mutually independent events, but the five events are not mutually independent?

Last year a student told me this nice answer. Roll a die 4 times. Let $A_i$ be the probability that the $i$th dice roll is odd, $i=1,2,3,4$ and let $A_5$ be the event the sum of the four rolls is even.

You can find another answer at the end of this page, but I think the above is nicer.

A famous example of data that is Poisson distributed is due to von Bortkiewicz (1898), who studied the numbers of Prussian cavalryman being killed by the kick of a horse in each of 20 years. This is an example with $n$ (number of cavalryman) large, and $p$ (chance a particular cavalryman being killed) small. You can read more about this famous example here.

A non-measureable set

In Lecture 4 we met the idea of a probability space, defined by the triple $(\Omega,\mathscr{F},P)$. It may not be possible for $\mathscr{F}$ to be all subsets of $\Omega$. Here is an example to illustrate the point.

Let $\Omega$ be the set of all points on the circumference of a unit circle. Define a equivalence relation on points in $\Omega$ such that $x\sim y$ if the angle between them is a rational fraction of $\pi$. This partitions $\Omega$ into equivalence classes (in fact an uncountable number of them). Now pick one point from each equivalence class and call the set of these points $A$. If we shift every point in $A$ around the circumference of the circle by the same rational angle $q$ we get a new set, say $A_q$. Clearly, $A$ and $A_q$ are disjoint, and it ought to be that $P(A_q)=P(A)$. Now taking the union over all rationals, $\Omega=\bigcup_q A_q$ is a union of a countable number of disjoint sets (since every point in $\Omega$ is some rational angle away from some point in $A$). But it is impossible to satisfy the probability axioms, which would require $P(\bigcup_q A_q)=1=\sum_q P(A_q)$, and also have $P(A)=P(A_q)$ for all $q$. We are forced to conclude that no probability measure $P$ can cope with a subset of $\Omega$ like $A$.

Mozart Cafe Problem

Here is one of my favorite problems, related to the calculation of the probability of derangement in Lecture 4.

Two friends travelling independently to Vienna wish to meet for a coffee on the afternoon they arrive. Neither has been to Vienna before, but they guess it must have a Mozart Cafe. So in an email exchange they agree to meet at the Mozart Cafe. Unfortunately, upon arrival they each discover that there are in fact $m >1$ Mozart Cafes in Vienna. Assuming they have no way to communicate, what should they do?

Suppose that each does one of the following. He either (a) chooses one cafe at random and stays there all afternoon, or (b) visits all $m$ cafes in random order spending equal time in each. On average, the best outcome is achieved when one does (a) and the other does (b) — on average they will meet half way through the afternoon. However, this cannot be guaranteed since they have no way to agree who should do (a) and who (b). The worst outcome is when both do (a) (unless they are lucky and choose the same cafe).

Suppose each friend decides to choose his strategy randomly, doing (a) with probability $p$ and (b) with probability $1-p$. If they fail to meet then they repeat the same strategy on subsequent afternoons, with new random choices, until they meet.

The calculation of the probability of derangement done in Example 4.4 helps with the evaluation of what happens when they both choose (b), an event that happens with probability $(1-p)^2$.

They wish to minimize the expected time until they meet. It turns out that for $m=3$, the optimal strategy is with $p=1/3$. As $m\to\infty$ the optimal $p\to \approx 0.24749$.

Here is a completely unsolved problem to which we would love to know the answer:

Two astronauts land at random spots on a planet (which is assumed to be a uniform sphere, without any known distinguishing marks or directions). They can each move at a constant speed. How should they move so as to be within 1 kilometre of one another in the least expected time?

Friday 23 January 2015

Lecture 4

The inclusion-exclusion formula is included in the list of Problem Solving Strategies.

I showed you in Example 4.5 a use of Erdos's probabilistic method in combinatorics. If you liked that, you might like to read about other examples, for instance in this link.

I made a small slip at the end of the lecture.
$$
1 - \frac{1}{2!}+\frac{1}{3!}+\cdots =1-e^{-1}.
$$ So the probability of derangement tends to $e^{-1}=0.368$.

In Example 4.4 we considered an infinite sequence of tosses of biased coins and saw that if $\sum_{i=1}^\infty p_i<\infty$  then $P(\text{infinite number of heads})=0.$

Now let's prove that $\sum_{i=1}^\infty p_i=\infty \implies P(\text{finite number of heads})=0.$ This is trickier.

We are assuming the coin tosses are independent (which is terminology from Lecture 5, but you can guess intuitively what this means). Let $A_k$ be the event that the $k$th toss is a head. Let $E_n=\bigcap_{k=n}^\infty A_k^c=[\text{no heads after the $(n-1)$th toss}]$. Then for $m > n$

$P(E_n)\leq P(\bigcap_{k=n}^m A_k^c)=\prod_{k=n}^m P(A_k^c) =\prod_{k=n}^m (1-p_k)\leq \prod_{k=n}^m e^{-p_k} =e^{-\sum_{k=n}^m p_k}$.

The right hand side $\to 0$ as $m\to\infty$.

Hence $P(E_n)=0$, and then using Property (v) of $P$ (continuity), we have that

$P(\text{finite number of heads})=P(\bigcup_{n=1}^\infty\bigcap_{k=n}^\infty A_k^c)=\lim_{n\to\infty}P(E_n)=0.$

These two results are examples of applications of the Borel-Cantelli lemmas, which appear in the Part II course Probability and Measure.

I did not talk through Example 4.8 (counting surjective functions), but it is another nice use of the inclusion-exclusion formula. See also Examples Sheet 1, #9.

Wednesday 21 January 2015

Big O notation

I wrote Stirling's formula as
$$
\log\left(\frac{n!e^n}{n^{n+1/2}}\right)=\log\sqrt{2\pi}+O(1/n)
$$ and thought later that maybe you do not know what $O(1/n)$ means (which is read as "Big O of $1/n$").

Jacqueline Kennedy Onassis was known to her friends as "the Big O". That is not what we mean here.

We say $f(n)=O(g(n))$ if there exists some $B$ such that $|f(n)| \leq B|g(n)|$ for all $n$ sufficiently large. We proved that $A<d_n<A+1/(12n)$, so
$$
0<\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)<\frac{1}{12n}
$$ is consistent with this, with $B=1/12$.

There is also a notation for lower bounds. We say $f(n)=\Omega(g(n))$ if there exists some $B'$ such that $|f(n)| \geq B'|g(n)|$ for all $n$ sufficiently large. This is the same as saying $g(n)=O(f(n))$.

And $g(n)=\Theta(g(n))$ if both $f(n)=O(g(n))$ and $f(n)=\Omega(g(n))$ are true. Since
$$
\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right) > \frac{1}{12n+1}>\frac{1}{13n}
$$ we may actually claim that
$$
\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)=\Theta(1/n).
$$ Because this tells how how fast is the convergence to $0$ it is stronger than just saying
$$
\lim_{n\to \infty}\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)=0.
$$ Read more about "Big O", "Big omega" and "Big theta" notations. They are very commonly used in theoretical computer science.

Wallis's product

Here's a thought. Suppose we knew Wallis's product
$$ \frac{2}{1}\frac{2}{3}\frac{4}{3}\frac{4}{5}\frac{6}{5}\frac{6}{7}\cdots =\frac{\pi}{2}. $$ Define $I_{2n}=\tfrac{1}{2}\tfrac{3}{4}\cdots \frac{2n-1}{2n}\frac{\pi}{2}$ and $I_{2n+1}=\tfrac{2}{3}\tfrac{4}{5}\cdots \frac{2n}{2n+1}$.
If we truncate the left hand side of Wallis' product after $2n$ terms we make it less, and this implies $I_{2n+1}/I_{2n} < 1$. Now we can complete the proof of Stirling's formula without ever needing to consider $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta$.

Of course a standard way to prove Wallis's product is to work with $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta=(n-1)(I_{n-2}-I_n)$,

But there are other ways to prove it that require just basic algebra, Pythagoras' theorem and the formula for the area of a circle. See here.

John Wallis  (1616-1703) was a student of Emmanuel College and a Fellow of Queens' College. He is credited with introducing the symbol $\infty$. 

Lecture 3

On the basis of today's lecture you can attempt Examples Sheet 1, #3, #4, #16. You might also look at #14 and think about how using a multinomial coefficient might help sort this one out.

The proof of Stirling's formula is not examinable. So why do we include it? Well, it would be a pity to so often use in our other work a result whose proof we did not understand. I also think it is interesting for you to see how a careful analysis of an asymptotic like this can be made.

Section 3.3 on Improved Stirling's formula (not examinable), which shows that $n!/\left(\sqrt{2\pi}n^{n+1/2}e^{-n}\right)$ lies between $e^{1/(12n+1)}$ and $e^{1/(12n)}$. This is from a paper, A remark on Stirling's formula, by Herbert Robbins.

Tim Gowers once lectured this course and wrote a blog post called Removing the magic from Stirling’s formula. He was looking for a more natural proof. He comments: It turned out to be hard to find a proof that didn’t look magic: the arguments had a “consider the following formula, do such-and-such, observe that this is less than that, bingo it drops out” sort of flavour.

I tried to explain in this lecture how we can understand the motivation behind most of the steps in this proof so that it feels less like pulling-a-rabbit-out-of-a-hat.

Why do we need to make the definition $I_n=\int_0^{\pi/2} \sin^n\theta\, d\theta$? Why not directly define $I_{2n}=\tfrac{1}{2}\tfrac{3}{4}\cdots \frac{2n-1}{2n}\frac{\pi}{2}$ and $I_{2n+1}=\tfrac{2}{3}\tfrac{4}{5}\cdots \frac{2n}{2n+1}$?

The answer is that we need to show that $I_n$ is decreasing in $n$, and this follows from the fact that $\sin^n\theta$ is decreasing in $n$ since $0<\sin\theta<1$.

Our calculation today of $I_{2n}/I_{2n+1}\to 1$ is related to a proof of Wallis's product. This is $$ \frac{2}{1}\frac{2}{3}\frac{4}{3}\frac{4}{5}\frac{6}{5}\frac{6}{7}\cdots =\frac{\pi}{2}. $$ Perhaps the trickiest thing in the proof was the evaluation of the constant $\sqrt{2\pi}$. The following is a method of doing it in a less magical way. Suppose $n!\sim B n^{n+1/2}e^{-n}$. Then $p=2^{-2n}\binom{2n}{n}=(2n)!/(2^{n}n!)^2\sim \sqrt{2/n}/B$ is the probability that in $2n$ tosses of a fair coin there are $n$ heads and $n$ tails. If we knew the central limit theorem (Lecture 23), we could use it to estimate $p=1/\sqrt{n\pi}$, and from that find $B=\sqrt{2\pi}$.

Monday 19 January 2015

Lecture 2

At the end of the lecture someone asked me about keys sampled without replacement and how it can be easily seen that the probability that you succeed on the $r$th attempt is $1/n$. This can be found in several ways. Either you think  about the number of ways the first $r$ keys can be seen and get
$$
\frac{(n-1)(n-2)\cdots(n-r+1)1}{n(n-1)\cdots(n-r+1)} =\frac{1}{n}.
$$
Or you think about the key you want being at the $r$th position and the other $(n-1)$ keys being in any of $(n-1)!$ possible orders, giving
$$
\frac{(n-1)!}{n!} =\frac{1}{n}.
$$
You might also think about a fixed ordering of taking the keys from you pocket, like 1,3,4,2,5. Perhaps 3 is the key you want. Notice that all cyclic permutations must be equally likely. So key 3 is equally likely to be seen as the $1,2,\dotsc,5$th key drawn from your pocket. This example serves to show that can be many ways to calculate the required probability. In the above three, we have used a different sample space $\Omega$ in each case.

There are some good Wikipedia articles on topics that we are covering. The one on the Birthday Problem is nice. You will find in that article more than ever thought it would be possible to say about such a simple problem. The article explains that the problem is not just of recreational interest — it is relevant to several hashing algorithms in computer science.

Here is a quiz to test your understanding. Complete the following sentence:

There are several ways to count the number of distinct injective functions that map elements of a set $N=\{1,\dotsc,n\}$ to elements of a set $X=\{1,\dotsc,x\}$. This is because ...

You might like to look at the Wikipedia article about the Twelvefold Way classification system for counting functions between two finite sets. I talked about a subset of four of these ways today. The paper of Robert Proctor, Let's expand Rota's twelvefold way for counting partitions! takes this systematic approach even further in an attempt to organize what he calls, "a bewildering zoo of  enumeration problems". The other counts involve some more complicated notions, such as the Bell numbers and Stirling numbers.

Please regard this as a bit of fun. If as in (ii) we say two functions $f,g:N\to X$ are the same if there exists a permutation $\pi$ for which $g(i)=f(\pi(i))$ we are defining an equivalence relation, and so this bit of work has really been about counting the number of equivalence classes for certain equivalence relations defined on the set of functions that map $N$ to $X$. However, in practice, it is often useful to work from scratch, relying on what I called the Fundamental Rule of Counting.

Try Examples sheet 1, #14. This apparently innocuous question may do your head in — or maybe you'll find it easy. There is really no systematic way to a answer question like this. Try it different ways (perhaps colouring the balls or not.) But it can help to figure out the answer for small $n$ before attempting to find the general formula.

Friday 16 January 2015

Note to supervisors

I like to know who is supervising this course. I can then put you on an email circulation list, where we can discuss how students are getting on with the course and examples sheets.

It is unlikely supervisors will read this blog, so perhaps helpful students might point it out to their supervisor.

Lecture 1

I spoke about the Problem of Points, which was puzzling people as long as 500 years ago. On pages 32-34 of Grimstead and Snell you can read translations of the letters exchanged by Fermat and Pascal about this problem in 1654. G&S also write:

The problem had been a standard problem in mathematical texts; it appeared in Fra Luca Paccioli’s book summa de Arithmetica, Geometria, Proportioni et Proportionalita, printed in Venice in 1494 in the form:
    "A team plays ball such that a total of 60 points are required to win the game, and each inning counts 10 points. The stakes are 10 ducats. By some incident they cannot finish the game and one side has 50 points and the other 20. One wants to know what share of the prize money belongs to each side. In this case I have found that opinions differ from one to another but all seem to me insufficient in their arguments, but I shall state the truth and give the correct way."
In fact his "correct way" turns out to be wrong!

I mentioned classical, frequentist, and subjective approaches to probability. There are others: such as the propensity approach. Indeed, the question, "What is Probability?" can take us into the realm of philosophy. If I roll a die and do not show you the answer, what is the probability it is a six? Some people might say it is 1/6. Others might say it is 1 or 0, because it either is a six or not. (If it appeals to you to delve into philosophy, then have a look at this article on Probability interpretations.) Fortunately, in our course we work within the context of some mathematical axioms and therefore have a well-defined approach that does not depend on interpretation.

At the start of each chapter of my notes the non-examinable topics are placed between *s. The *arcsine law* that I mentioned today is the second of the three laws described here. The other arcsine laws concern the proportion of the time that the random walk is positive, and the time at which the random walk achieves its maximum. My purpose in including this in our first lecture was to show you something that is surprising and which is proved in a clever way (by the reflection principle and matching paths in a 1-1 manner).

There is a space after each post for you to write comments or ask questions using the IntenseDebate commenting system. Please feel free to do so. Other students will probably benefit.

Wednesday 14 January 2015

Course starts Friday, 16 January

This is the first post on the blog page for my IA Probability course, starting 16 January 2015. I am leaving in place the blog entries from last year's course. However, by writing new entries as we go, I will provide material that is relevant for 2015.

What is this blog for? Sometimes upon leaving a lecture I think, "I wish I had said ...". This blog gives me a place to say it. Or I can use this space to tell you about some extra interesting things, talk about a question that a student has asked during a lecture. Or I might comment on an examples sheet question. I use this blog as a way of continuing a conversation with you. I am happy to take questions and comments, either in email or in the comment boxes below each blog.

Mathematics in this blog is written in LaTeX and rendered by MathJax. You need JavaScript on your viewing device. These pages should look great on a PC, Mac or a tablet like an iPad or Nexus. If it is working properly for you then \$(a+b)^2=a^2+2ab+b^2\$ should be rendered as: $(a+b)^2=a^2+2ab+b^2$.

If you would like to learn more about how I installed MathJax in these pages, see here.

At the time this course began, the statcounter for page views, shown below, stood at just over 26,000.