Wednesday, 14 May 2014

Continuity theorem

The lecture schedules mention "Moment generating functions and statement (no proof) of continuity theorem.''

The continuity theorem (which I forgot to name) is in fact described in Remark 1 on page 91 of the notes. It is the result that if a sequence of random variables $X_1,X_2,\dotsc$ have moment generating functions $m_i(\theta)$, $i=1,2,\dotsc$, and $m_i(\theta)\to m(\theta)$ as $i\to\infty$, pointwise for every $\theta$, then $X_i$ tends in distribution to the random variable having m.g.f. $m(\theta)$.

We use this when we give a "sketch of proof" of the Central limit theorem on page 91.

Lévy's continuity theorem is the same thing, but for characteristic functions.

Wednesday, 12 March 2014

Final lecture notes

The notes have been a work in progress, and many small corrections have been made along our way. You should make sure now to obtain an up-to-date copy.

I appreciate you patience. Students in another year will have perfect notes from the start. But you have had to cope with changing notes, and temporary typos. Perhaps it has been a compensation that you have had the fun of seeing a course being designed, modified along the way, and delivered freshly the first time.

I think that as of now, March 12, 12.30pm, very few typos remain. But if, during your reading this vacation, you think you find any further typos please let me know! If you find some part hard to understand because is it too cryptic, also let me know. Perhaps I can make a small improvement. Obviously, I could include more details, and more worked examples, if I allowed myself to write more than 4 pages per lecture, but I think that page limit is a good one. I am trying to cover everything, without being boringly verbose. For more lengthy expositions you can go to textbooks.

I have appreciated the comments that some people have made, either in the blog, or privately to me in email. If any of you come across something in your reading that you think would make a nice addition to this course, then do please send me an email about it. For instance, someone has added a comment to Lecture 23 about something recently on the BBC web site about the Zipf distribution.

Non-examinable bits are put in * * just after the title of each lecture. But in most cases the non-examinable bits have some bearing on the examinable topics. So although you would not want to learn the * * bits for regurgitation in exams, I have been hoping that they will help you in developing a better over-all appreciation of some of the examinable topics.

Lecture 24

The overhead slides I used for our last lecture are here (best downloaded and viewed with a pdf reader, rather than as a web page).

I showed you a bit of Large deviation theory and Chernoff's upper bound. The lower bound  was quoted without proof (except that in Example 24.1 we did the lower bound for the special case of $B(1,p)$). If perhaps some time mid-summer you have a moment and are curious to see the lower bound proved (which you have learned enough to understand), and more about large deviations, see these slides from a talk I once gave, pages 5–9. This note on the Cramer-Chernoff theorem is also good. Interestingly, Chernoff says in an interview with John Bather that the Chernoff bound should really be named after someone else!

I have used large deviation theory in my research to analyse buffer overflows in queues. But I know almost nothing about the subject of Random matrices. I prepared the content for Section 24.2 because I was intrigued by the results, wanted to learn, and thought it would be something entertaining with which to conclude. I did some of my reading in Chapter 2 of An Introduction to Random Matrices, by Anderson, Guiomet and Zeotouni, and then simplified their treatment to make it suitable for IA.

I thought it was nice that in showing you these two advanced topics, I could bring into play so many of the ideas we have had in our course: Markov and Chebyshev inequalities, moment generating function, sums of Bernoulli r.vs, Stirling’s formula, normal distribution, gambler’s ruin, Dyke words, generating functions, and the Central limit theorem.

Feedback. I am pleased that over 80% have said that the course has been interesting (4) or very interesting (5). So I hope you will consider the applicable courses in your next year of study. In Appendix H I have written some things about applicable mathematics courses in IB.

Monday, 10 March 2014

Lecture 23

On the way home today, I realised that I had forgotten to tell you why the Central limit theorem has that name. See below. First, here is a little more history about this important theorem.

Henk Tijms writes in his book, Understanding Probability: Chance Rules in Everyday Life, Cambridge: Cambridge University Press, 2004,

"The Central Limit Theorem for Bernoulli trials was first proved by Abraham de Moivre and appeared in his book, The Doctrine of Chances, first published in 1718. He used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812.

De Moivre spend his years from age 18 to 21 in prison in France because of his Protestant background. When he was released he left France for England, where he worked as a tutor to the sons of noblemen. Newton had presented a copy of his Principia Mathematica to the Earl of Devonshire. The story goes that, while de Moivre was tutoring at the Earl's house, he came upon Newton's work and found that it was beyond him. It is said that he then bought a copy of his own and tore it into separate pages, learning it page by page as he walked around London to his tutoring jobs. De Moivre frequented the coffeehouses in London, where he started his probability work by calculating odds for gamblers. He also met Newton at such a coffeehouse and they became fast friends. De Moivre dedicated his book to Newton."

I was looking at the Wikipedia article on the Central limit theorem. It mentions two things that I did not already know, but which would be suitable for the television programme QI.

1. There is a quite interesting explanation of why the term "Central" is used.

The actual term "central limit theorem" (in German: "zentraler Grenzwertsatz") was first used by George Pólya in 1920 in the title of a paper. Pólya referred to the theorem as "central" due to its importance in probability theory. According to Le Cam, the French school of probability interprets the word central in the sense that "it describes the behaviour of the centre of the distribution as opposed to its tails".

Personally, I had always thought it was the second of these, but the first is also plausible.

2. There is a quite interesting Cambridge connection.

A proof of a result similar to the 1922 Lindeberg CLT was the subject of Alan Turing's 1934 Fellowship Dissertation for King's College at the University of Cambridge. Only after submitting the work did Turing learn it had already been proved. Consequently, Turing's dissertation was never published.

Friday, 7 March 2014

Lecture 22

You might like to experiment with the bivariate normal. Here is the Mathematica code that I used to plot the joint density function. You can copy and paste this code into Mathematica and they play with viewing from different angles or changing the value of the correlation, r.

S = ParallelTable[
  Plot3D[PDF[
    MultinormalDistribution[{1, 1}, {{1, r }, {r , 1}}], {x, y}],
    {x, -3, 4}, {y, -3, 4}, PlotRange -> All, 
   MeshFunctions -> {#3 &}, PlotPoints -> 50], {r, {0, 0.6}}]

Inversion of probability and moment generating functions

For discrete random variables working with the p.g.f. is easy. We have

Theorem 12.2The distribution of $X$ is uniquely determined by the p.g.f. $p(z)$. 

The proof is trivial because the distribution of $X$ is contained in coefficients of the power series $p(z)=p_0+p_1z+p_2z^2+\cdots$. It is also easy to tell if any given power series is a p.g.f. All we need is check is that the coefficients are nonnegative and sum to 1.

However, things are not so easy for moment generating functions. Given a power series $m(\theta)=1+m_1\theta  + m_2\frac{1}{2}\theta^2 +\cdots$
it is not easy to tell if there exists a r.v. that has moments $m_1,m_2,\dotsc$.
Also we had without proof the theorem analogous to Theorem 12.2,

Theorem 21.4 The moment generating function determines the distribution of $X$, provided $m(\theta)$ is finite for all $\theta$ in some interval containing the origin.

To prove Theorem 21.4 we would need to show how to recover the p.d.f. from the m.g.f.
In fact, it is easier to use the characteristic function. Let's take a small look at that.

Today we saw that the moment generating function of $N(0,1)$ is $m(\theta)=e^{\theta^2/2}$. This looks a lot like the p.d.f., which is $f(x)=e^{-x^2/2}/\sqrt{2\pi}$.

The characteristic function (c.f.) $\chi$ is obtained by replacing $\theta$ with $it$, $t$ real,

$\chi(t) = \int_{-\infty}^\infty f(x)e^{itx} dx$. This is $\chi(t)=e^{-t^2/2}$ for $N(0,1)$.

We can compute nearly the same integral of $\chi$ to recover $f$:

$\int_{-\infty}^\infty \chi(t)e^{-itx} dt=2\pi e^{-x^2/2}=2\pi f(x)$.

In general, given the c.f. we can recover the p.d.f. using the inversion formula

$f(x)=\frac{1}{2\pi}\int_{-\infty}^\infty \chi(t)e^{-itx} dt$.

The m.g.f. exists only if all moments are finite. However, the c.f. exists even when moments do not exist, since $e^{itX}$ is just some point on a unit circle in the complex plane, and hence nothing is going off to infinity. The Cauchy distribution is an example of a distribution that has no m.g.f, but does have a c.f., namely $e^{-|t|}$. 

Examples sheet 3

Most of you will have now done Examples Sheet 3. Here are a few notes.

 #3. I have amended this question to read "Prove that for $t > 0$, ...." Why?

 #4. You'll want to compare this with Examples sheet 4 #5.

 #5. What would be the analogous question to this one, posed for continuous r.vs?

 #8. You should know how to answer this two ways: by using p.g.fs, and also directly via that tower property of conditional expectation.

 #16. How do we know that the answer is the smallest positive root of $a=pa^3+1-p$? We have such a theorem for the extinction probability in a branching processes. How can this coin tossing problem be viewed through the lens of some branching process?

 #18. The hint is for the third part. But it also useful in answering the second part: "Could you load the die so that the totals {2,3,4,5,6,7} are obtained with equal probabilities?"

Wednesday, 5 March 2014

Lecture 21

For Lectures 21–23 I plan to focus only on examinable topics. Lecture 24 should have space for some digressions, my favourite mathematical joke, and some wrapping up.

I mentioned the beta distribution today. There is more about that distribution in Appendix D. I became much better acquainted with this distribution myself when I was doing some work for a reinsurance company last year.

Today I handed out the feedback forms. If you were not at the lecture and would like to give feedback you can do it on-line as described here: Feedback request.

Last few lectures

I have just done some reorganization of the remaining lectures, so that we will finish all examinable material by the end of Lecture 23.

In Lecture 24 I will be free to chat about some other things and wrap up.

Tuesday, 4 March 2014

Feedback request

In tomorrow's lecture, Wednesday March 5, I will distribute the usual feedback forms. Alternatively, in the comfort of your room you can complete this equivalent on-line form, or print the paper form here and hand it to me. If you prefer, you may wait until closer to the last lecture to complete the on-line form or hand me a paper form.

The on-line version will be sent to my email anonymously (unless you choose to sign your name). After reading the responses, I will forward printouts to the Faculty Office.

New to the course this year were
  • printed notes,
  • blog,
  • examples sheet "puzzles" section,
  • follow-up blogs about the examples sheets,
  • problem solving strategies page,
  • non-examinable "fun" topics (arsine law, Zipf's law, Benford's distribution, information entropy, actuarial mathematics in insurance, stochastic bin-packing, Kelly criterion, random matrices, and so on).
It will be interesting for both me, and perhaps other lecturers, to hear what you thought about those things - and whatever else you wish to say.

I could have skipped all the non-examinable stuff, and used the extra time to write-out more careful lecture notes. Instead, I have chosen to put (eventually ...) perfect lecture notes on-line and use part of each hour to chat off-schedule about some unconventional things which would not usually be included in a first (or even second) course in Probability, but which I think make interesting connections with our core syllabus.

Monday, 3 March 2014

Lecture 20

Today we saw how to transform random variables. By this means we can discover further interesting relationships between distributions.

I mentioned in this lecture that if $X,Y$ are independent $N(0,1)$ then $W=X^2+Y^2$ and $T=\tan^{-1}(Y/X)$ are independent $\mathscr{E}(1/2)$ and $U[-\pi/2,\pi/2]$. We can use this result in the reverse direction to generate normal random variables. Make two independent samples from $U[0,1]$, say $U$ and $V$. Let $R^2=-2\log U$ (which is $\mathscr{E}(1/2)$) and let $ \Theta=(V-\frac{1}{2})\pi$. Then $X=R\cos\Theta$ and $Y=R\sin\Theta$ are independent $N(0,1)$.

There is a particularly strange fact in Examples sheet 4, #19.

By the way, if (like me) you have ever been puzzled why it is that the absolute value of the Jacobian, $|J|$, gives the volume change of a $n$-dimensional parallelepiped under an linear transformation in $\mathbb{R}^n$, you might enjoy reading this essay A short thing about determinants, or “an attempt to explain volume forms, determinants and adjugates by staying in 3-d”, by Gareth Taylor.

I have also now added to the notes an Appendix C, in which I explain how under a linear transformation a sphere is mapped into an ellipsoid of a volume that greater by a factor $|J|$. I think that in previous years lecturers have said at this point "you already know this from Vector Calculus". But I am not sure that's that case, and so decided it would be nice to try to remove some of the mystery.

The convolution we need to add two Cauchy random variables is tricky. On page 81, the partial fractions method of computing this integral is to write

$\frac{1}{(1+x^2)(1+(z-x)^2)}=\frac{A+Bx}{(1+x^2)}+\frac{C+D(x-z)}{(1+(z-x)^2)}$

for $(A,B,C,D)=\frac{1}{4+z^2}(1,2/z,1,2/z)$. Upon integrating w.r.t. $x$ the terms with $B$ and $D$ give $0$ (by symmetry either side of $0$), and the terms with $A$ and $C$ provide the claimed result of $\frac{2}{\pi(4+z^2)}$.

Friday, 28 February 2014

Lecture 19

The fictional example in Section 19.2 about the heights of UK and Netherlands basketball players showed something striking. Can you think of any real life instances of this sort of effect?

The graphs on page 7 of this report on climate change are interesting in respect of showing how a small change in the mean can give rise to an amplified impact in the tail of a normal distribution. See Full SREX report: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation, Special Report of the Intergovernmental Panel on Climate Change.

In this graph we see "hot weather" increasing by a factor of 2. 
But "extreme hot weather" increases even more, by a factor of 3

Wednesday, 26 February 2014

Last arrivals problem

The last arrivals problem that I discussed in 18.5 is similar to the famous Secretary Problem that is addressed in the Part II course Optimization and Control. In that problem the task is sequentially to interview $n$ secretaries for a job, stop & hire one who has the quality of better than any previously interviewed, and then find that choice remains better than any others that could be subsequently interviewed. This is one example of an interesting problem at the interface of probability and optimization.

The problem in 18.5 is slightly different because the number of gold coins is completely unknown. This problem has been studied by F. Thomas Bruss and Marc Yor in Stochastic Processes with Proportional Increments and The Last-Arrival Problem, Stochastic processes and their Applications, 122:3239–3261, 2012.

A game version of the problem is addressed by Johan Wästlund in the paper When only the last one will do. He discusses the following discrete game.

There is a deck of $d$ cards, and at the start of the game a devil labels the face side of some of them. There must be at least one labelled card, and at most $N$. The deck is then shuffled and the cards are turned up one by one. The selector (John), knowing $d$ and $N$, must select online the last labelled card.

The devil’s strategy is simply a probability distribution on $\{1,\dotsc, N\}$. Wästlund shows that, as $N\to\infty$, John's success probability will be about 0.352917000207196 (which is $<1/e=0.367879$).

It might be fun to find the answer for $d=52$ and $N=2$. The devil marks either 1 or 2 cards, with probabilities $p$ and $1-p$, respectively, and John will stop at the first marked card if it occurs $k+1$ or deeper in the deck of cards, or if it is the $k$th card he will stop w.p. $\theta$. Otherwise he continues, hoping to find a second marked card. What are optimal $p$, $k$ and $\theta$?

Answer:  I couldn't resist figuring this out after lunch today. Both players in this game use mixed strategies. The devil will choose $p=52/103$ and John will stop if he finds the first marked card as the $21$st or deeper; if it is $20$th card then he will stop with probability $\theta=28/115$. His success probability is $927/1495\approx 0.62$.

A student suggested at the end of the lecture that we might consider a game in which our task is to stop at the point that only finitely many gold coins remain to be found. I'm not sure how we set up the problem so as to initially distribute an infinite number of gold coins in the early part of the road. But it would be certainly be interesting to ask how we might stop so as to maximize the probability that we have found all but exactly $k$ coins, generalizing the original problem, which was the case $k=0$.

Planet Zog problems

Here are two challenging tripos questions from recent years. The first is a problem in geometric probability (as in Lecture 18). To do this question it will help to know that the surface area of a spherical cap is $2\pi R h$, where $R$ is the radius of the sphere and $h$ is the height of the cap. The surface area of the entire sphere is $4\pi R^2$.

2003/II/12
Planet Zog is a ball with centre $O$. Three spaceships $A$, $B$ and $C$ land at random on its surface, their positions being independent and each uniformly distributed on its surface. Calculate the probability density function of the angle $AOB$ formed by the lines $OA$ and $OB$.

Spaceships $A$ and $B$ can communicate directly by radio if  $\angle AOB  < \pi /2$, and similarly for spaceships $B$ and $C$ and spaceships $A$ and $C$. Given angle $\angle AOB =\gamma < \pi /2$, calculate the probability that $C$ can communicate directly with either $A$ or $B$. Given $\angle AOB =\gamma > \pi /2$, calculate the probability that $C$ can communicate directly with both $A$ and $B$. Hence, or otherwise, show that the probability that all three spaceships can keep in in touch (with, for example, $A$ communicating with $B$ via $C$ if necessary) is $( \pi +2)/(4\pi )$.

Spoiler alert! Here is the answer.

The second question is rather different. Here you are being tested on the idea that a sample space can be partitioned into disjoint events.

2004/II/12
Let $A_1,A_2,\dotsc,A_r$ be events such that $A_i\cap A_j = \emptyset$; for $i \neq j$. Show that the number $N$ of events that occur satisfies

$ P(N = 0) = 1 −\sum_{i=1}^rP(A_i) . $

Planet Zog is a sphere with centre $O$. A number $N$ of spaceships land at random on its surface, their positions being independent, each uniformly distributed over the surface. A spaceship at $A$ is in direct radio contact with another point $B$ on the surface if $\angle AOB < \pi/2$. Calculate the probability that every point on the surface of the planet is in direct radio contact with at least one of the $N$ spaceships.

[Hint: The intersection of the surface of a sphere with a plane through the centre of the sphere is called a great circle. You may find it helpful to use the fact that $N$ random great circles partition the surface of a sphere into $N(N − 1) + 2$ disjoint regions with probability one.]

Lecture 18

The Wikipedia article on the Bertrand paradox has more interesting information and pictures. The is also a Wolfram demonstration project called the Random chord paradox that you can play with.

Examples sheet 4, #2 is this famous question: "A stick is broken in two places, independently uniformly distributed along its length. What is the probability that the three pieces will make a triangle?" For a more a more complicated geometric probability problem see the blog posting on Planet Zog problems.

Tuesday, 25 February 2014

Table of distributions

Here is a table of 5 discrete and 6 continuous distributions. It is supplied to second year students studying Statistics IB, as a reminder of things they are to know from Probability IA.

For each of distribution the table shows its usual notation, p.m.f. or p.d.f., mean, variance, and p.g.f. or m.g.f.

You will want to have at your fingertips most of the information in this table (or the method to by which to produce it). You might find it helpful to make your own table, using this one as a guide.

Some of these 11 distributions are not explicitly mentioned in the schedules for Probability IA. Rather they appear as examples of other things which are mentioned in the schedules. For example, the gamma distribution arises as an example of summing i.i.d. r.vs (in this case exponential r.vs). The multivariate normal provides an example of the p.d.f. of jointly distributed continuous r.vs. The beta distribution appears as an exercise on the joint distribution of order statistics of i.i.d. uniform r.vs. We have not mentioned in a lecture the negative binomial, but this is simply the sum of i.i.d. geometric r.vs. (as done in Examples sheet 3, #5).

Monday, 24 February 2014

Lecture 17

An important special case of Theorem 17.1 is when $Y=aX$, for some constant $a$.
The p.d.f. of $Y$ is then $f_Y(y)=(1/a)f_X(y/a)$.

There are many ways to define transitive ordering relations between random variables. Now you know about $EX\geq EY$ and  $EX\geq_{\text{st}} EY$. Such relationships are important in applications, such as when $X$ and $Y$ are lifetimes of components and we are interested ordering their reliabilities. (Reliability engineering is concerned with determining the ability of a system or component to function under some stated conditions for a specified period of time.)

There are other ordering relations, such as "hazard rate ordering", and "convex ordering". These are described within the Wikipedia article on stochastic ordering. An important order relation in statistics is "likelihood ratio ordering". If $X$ and $Y$ have p.d.fs $f$ and $g$ such that $f(t)/g(t)$ increases with $t$ then $X\geq_{\text{lr}}Y$. It is a stronger ordering than the two we have met, since $X\geq_{\text{lr}}Y\implies X\geq_{\text{st}}Y$.

The treatment of the inspection paradox is this lecture derived from a paper by Sheldon Ross, The inspection paradox, Probability in the Engineering and Informational Sciences, 17, 2003, 47–51. Of course the use of the word "paradox" is somewhat in appropriate, since it is not hard to understand why $X_J\geq_{\text{st}}X_1$.

There are other such inspection paradoxes. Suppose $X_1,X_2,\dotsc$ are i.i.d. $U[0,1]$ and $N$ is the least integer such that $X_1+\cdots+X_N\geq 1$. Then conditional on this,

$X_N\geq_{\text{st}}X_{N-1}\geq_{\text{st}}\cdots \geq_{\text{st}}X_1$.

I decided not to discuss it in the lecture, but at the end of page 69 of today's notes is the puzzle, "Do girls have more brothers than boys do?" The "proof" that the answer is "yes" is false, though initially it looks almost convincing. What is wrong?

What do you think is the truth of the matter? Let $G$ and $B$ be the mean number of brothers possessed by a typical girl and typical boy, respectively.

If every family were to consists of some number of boys and just one girl then $G > B$.

If every family were to consists of only boys or only girls then $G < B$.

What are the minimal conditions under which $G=B$? Must you assume that each child is independently a boy or girl with probability $1/2$?

Friday, 21 February 2014

Probability distributions

The chart I showed you, with the distributions we are studying marked in yellow, can be downloaded here. It is from the paper Univariate Distribution Relationships, L. M. Leemis and J. T. McQueston. In that paper you can find explanations of the various relationships that are indicated by the arrows.

Here also is a simpler chart about Relationships between probability distributions. This one shows some relations that the above chart misses out, such as how the binomial can come from the Poisson as $X \mid X+Y$ (see Examples sheet 2, #5).

Wikipedia has a nice List of probability distributions.

In 16.5 of the notes I mention that if $X$ and $Y$ are i.i.d. exponential r.vs then $X/(X+Y)$ is a uniformly distributed r.v. You will prove this in Examples sheet 4, #8. A general method will be given in Lecture 20. But it can also be done this way:

$P(X/(X+Y) \leq t)=P(X(1-t)/t\leq Y)=\int_{x=0}^\infty\left(\int_{y=x(1-t)/t}^{\infty}\lambda e^{-\lambda  y}dy\right)\lambda e^{-\lambda x}dx = t$

which is the c.d.f. of $U[0,1]$.

Lecture 16

The human life plots of force of mortality that I showed in today's lecture can be viewed on the Understanding Uncertainly page: How long are you going to live?

David Spiegelhalter (Winton Professor for the Public Understanding of Risk) has now added a comment for us in the comments section at the end of this post. He makes reference to the Gompertz distribution, which is one with a linearly increasing hazard rate.

Notice that the hazard rate curve (force of mortality) has a 'bath-tub' shape - there is an early high-risk period immediately following birth, followed by long low period of annual risk, until accidents start kicking in for 17-year-old-boys, and then steadily increases. At the above link there are some tools for interactively exploring how your life expectancy changes as a function of your present age, gender and lifestyle (non-smoker, moderate alcohol, 5 a day, exercise).

Most things (such as car engines, tyres, light bulbs) have an increasing hazard rate. But some things have a decreasing hazard rate. A business that has lasted two centuries is less likely to go bankrupt next year than one that only started two years ago.

Notice that if $X\sim U[0,1]$ then $h(x)=f(x)/(1-F(x))=1/(1-x)$, which tends to $\infty$ as $x\to 1$.

Some fine fine print. I said that a continuous random variable is one that has a continuous c.d.f., and it is this that puts the "continuous" in its name. However, the fine print (which you can find in my notes near the middle of page 63) is that we actually need the c.d.f. to absolutely continuous, and this will ensure that a p.d.f. exists. The qualifier "absolutely" rules out weirdly pathological c.d.fs. For example, the Cantor distribution has a continuous c.d.f., but it has no p.d.f. Of course none of this fine print is examinable in Part IA. But I know that some of you find this type of thing interesting.

Wednesday, 19 February 2014

Examples sheet 2

Most of you will have now done Examples Sheet 2. Let me make a few comments about some things I hope you learn from that sheet. Some of what I say below is "in brief', and you may need to fill in details or ask your supervisor.

 #3,4. There is often more than one nice way to solve a problem. I think that both of these problems have at least two nice methods of solution. Have you been able to think of more than one way to solve them? For example, #4 can be solved by summing a geometric series. It can also be solved via a simple order 0 recurrence relation.

 #6 (ii). You should have discovered that $Y$ and $X$ are independent. So $X=10$ provides no information about $Y$. What is the intuition for this? Well, we might imagine that $X$ and $Y$ are initially numbers of "detectable" and "non-detectable" misprints, created as independent samples from the Poission distribution with parameter $\lambda/2$. The page has $X+Y$ misprints (which is distributed Poisson, parameter $\lambda$). The proofreader finds the detectable ones. Clearly, in doing this she discovers nothing about $Y$. This model gives the same distribution of $X$ and $Y$ as one in which there are $X+Y$ misprints and each is detected with probability $1/2$.

 #7,8. The results in these questions are often proved by students in something like 2 pages of working. But they can also be set out in perhaps only 2-4 lines of working. Make sure you learn (from your supervisor) the fast way to derive these results.

 #10. There are several ways to show that $Y_1,\dotsc,Y_n$ are independent. One way is via the fact that $P(Y_1=1,Y_2=1,\dotsc,Y_n=1)=1/n!=P(Y_1=1)\cdots P(Y_n=1)$. Then apply the result of question #2.

 #12. The coupon collector's problem. This question is one that should lodge permanently in your brain. There are many problems in probability, design of computer science algorithms, and elsewhere, which at some point involve this problem. I think it is interesting that the expected time to collect the $n$ figures is about $n\log n$. However, the expected time needed to collect $99\%$ of the figures is only about $n\log 100$. So it is the last fraction of figures that are costly to collect.

 #14. Did you notice that this method would work for simulating any probability $\alpha\in (0,1)$? For example, if we want $P(A)=1/5$, we write this in binary as $\alpha=0.0011001100...$ and then toss a coin. If we see TTHHH we would say $A$ has not occurred, because translating this to $.00111$, it differs from the binary form of $\alpha$ by representing a number $>\alpha$.

 #16. This questions has been set to IA students for many years. In my experience few students are able to solve it. But the solution is so beautiful that it would be a pity if your supervisor were not given the opportunity to explain it to you.

 #17,18. When I say, "This section is for enthusiasts", I am hoping that most students will think "This means me". After all, if you are reading Mathematics at Cambridge you are by definition a mathematics enthusiast. Even if you cannot solve these questions fully you should at least make some attempt. I would expect almost all students to be able to do #17 for the case $k=2$, $n=3$. It is a good general principle to tackle a difficult problem by first trying to solve a simple special case.

Generating functions for random walk

Here is Mathematica code for the generating functions for the step at which the random walk first hits a barrier.
$U_z(s)=\sum_{n=0}^\infty P(\text{hit $0$ at $n$}\mid\text{start at $z$})s^n$,
$V_z(s)=\sum_{n=0}^\infty P(\text{hit $a$ at $n$}\mid\text{start at $z$})s^n$.

As one would expect, $U_z(1)=q_z$, $V_z(1)=p_z$ and $U_z'(1)+V_z'(1)=D_z$.

I used these to compute the table on page 60 of the notes. You could use these to find the variance of the duration of the game. By the way, notice in this table that in lines 2-4 the duration of the game is 10 times the expected loss. This is because the expected loss per game is $EX_1=-0.1$. So we are seeing $E[G]=E[X_1+\cdots+X_N]=EN\cdot EX_1=D_z\cdot EX_1$, where the r.v. $N$ is the length of the game.
L1[s_] = (1 + Sqrt[1 - 4 p q s^2])/(2 p s);
L2[s_] = (1 - Sqrt[1 - 4 p q s^2])/(2 p s);

U[s_] = (q/p)^z (L1[s]^(a - z) - L2[s]^(a - z))/(L1[s]^a - L2[s]^a) 
U[s] /. {a -> 10, z -> 9, p -> 45/100, q -> 55/100};
Series[%, {s, 0, 15}] // Simplify

M1[s_] = (1 + Sqrt[1 - 4 p q s^2])/(2 q s);
M2[s_] = (1 - Sqrt[1 - 4 p q s^2])/(2 q s);
V[s_] = (p/q)^(a - z) (M1[s]^z - M2[s]^z)/(M1[s]^a - M2[s]^a) 
V[s] /. {a -> 10, z -> 9, p -> 45/100, q -> 55/100};
Series[%, {s, 0, 15}] // Simplify

Lecture 15

The Wikipedia page about random walk is worth a look. Today, we focused upon the Gambler's ruin problem. However, there are many more interesting things that can be said about random walks. For example, one can consider random walks on arbitrary graphs. More about random walks will be presented to you in the course Markov Chains, Part IB.

I mentioned that if you wish to ensure you never go bankrupt then rather than bet £1 at each go you could bet £ $fz$, where $0<f<1$. That way you come next to $(1+f)z$ or $(1-f)z$. Kelly betting, that I will talk about later (probably in Lecture 21), is about choosing the optimal size of $f$.

The continuous version of random walk is Brownian motion, which is presented in Stochastic Financial Models, Part II. One of its beautiful properties is that it is self-similar. The graphic below shows how Brownian motion looks when we place it under the microscope and gradually increase the magnification. It is possible to say of Brownian motion in 2 dimensions that "almost surely every arc of the path contains the complete works of Shakespeare in the handwriting of Bacon." (Peter Whittle). Just as we might say in dimension 1 that at some magnifications the path below looks like a graph of FTSE all-share index for the past 5 years.

Monday, 17 February 2014

Simulation of a random walk

Here is something for Lecture 15 (on random walk).

Simulating the Simple Random Walk
from the Wolfram Demonstrations Project by Heikki Ruskeepää

Click above to play with a simulation of random walk produced by tossing a fair coin. The "seed" reruns with a different set of random coin tosses. The curves are confidence intervals, within which the walk will lie with high probability. Random walks are very important in physics, biology, and finance.



There are many interesting questions that one can ask about symmetric random walk, i.e. the random paths produced by tossing a fair coin (up 1 for a head, down 1 for a tail).
  • How many times on-average does a walk of length $n$ cross the $x$-axis?
  • What is the distribution of the last crossing of the $x$-axis? (recall Lecture 1!)
  • What is the distribution of the terminal value $X_n$?
  • As you try different seeds, do you think some walks look more "typical" than others?

Lecture 14

In this lecture you have had a taste of the theory of branching processes. Please check that you have an up-to-date copy of my notes, as I corrected several typos this morning. The calculations we did today in Theorem 14.2 are similar to the sort of thing you need to do in answering Examples Sheet 3, #10, 14, 15. Question #16 does not look like one about branching processes, but you might think about trying to map into a version of a branching process so that you can apply theorems from this lecture.

If you would like to see something more about branching processes, I recommend to A Short Introduction to Branching Processes by Gesine Reinert (a professor of statistics at Oxford).

I mentioned that branching processes have interesting applications beyond the obvious ones of population biology, e.g. in rumours and advertising. I asked Gesine if she might recommend something beyond the basics to tell you about. She has written to me this morning, "One application of branching processes is to assess the shortest path length between two randomly chosen vertices in sparse networks; but that may be a bit much for first-year students."

I said a bit at the end of the lecture that was drawn from Gesine's slides 26-28, and 30-32. In these you can see the idea of a multi-type branching process. We see that it is possible to prove a theorem similar to our Theorem 14.3, but now the probability of extinction depends on the maximum eigenvalue of the matrix $M=(m_{ij})$, where $m_{ij}$ is the mean number of offspring of type $j$ produced by a parent of type $i$. The branching process becomes extinct with a probability 1 or $<1$ as the eigenvalue is $\leq1$ or $>1$.

Suppose there are two types. The key quantities are the two multi-dimensional p.g.f.s

$F_i(z_1,z_2)=E_i\Bigl[z_1^{X_{i1}}z_2^{X_{i2}}\Bigr]$, $i=1,2$,

where $X_{ij}$ is a r.v. having the distribution of the number of offspring of type $j$ produced by a parent of type $i$.

 Let $q_i$ be the probability of extinction when we start with one individual of type $i$. Then, analogous to  Theorem 14.3, $(q_1,q_2)$ are the unique solution in $[0,1]^2$ to

$q_1 = F_1(q_1,q_2)$
$q_2 = F_2(q_1,q_2)$.

Friday, 14 February 2014

Simulation of a branching process

Here is something for Lecture 14 (on branching processes).

Readme first. You can run this demo in your browser, but you will need to install the Wolfram CDF player. It may take a few moments to load, so please be patient. If it seems slow to load then an alternative is to download the demo from the link on the page ("Download Demonstration as CDF") and then run the demo in Mathematica on your own computer. Here is a screenshot from the demo.

This Demonstration shows up to six simulated generations of a branching process. The distribution of the number of descendants of each member of the population can have either a Bernoulli, Poisson, or geometric distribution; the parameters of these distributions can be set with sliders. From the three figures, the left one shows the simulated branching process, the upper-right one shows the probability of ultimate extinction as a function of a parameter of the distribution, and the lower-right one shows the probabilities of the size of the population at the selected generation.

To run the demo click on Simulating the Branching Process
[from the Wolfram Demonstrations Project by Heikki Ruskeepää]

Discrete Fourier transforms and p.g.f.s

In non-examinable sections Section 13.4 and 13.5 I have tried to show you some moderately sophisticated applications of the simple ideas we are learning about p.g.f. and conditional expectation. I have added an Appendix B to the notes to say a bit more about the fast Fourier transform.

The brief story is this. To find the distribution of $Y$, the sum of two i.i.d. r.vs $X_1$ and $X_2$ taking values in $\{0,1,\dotsc,N-1\}$, we might find its p.g.f. $p_Y(z)=p(z)^2=(p_0+p_1z+\cdots+p_{N-1}z^{N-1})^2$.
This would take $O(N^2)$ multiplications of the form $p_ip_j$.

However, it is quicker to evaluate $p(z)$ at $2N$ complex values $z=\omega^0,\dotsc,\omega^{2N-1}$, square those values, and then recover the real coefficients of the polynomial $p(z)^2$. The calculation simplifies because by taking $\omega$ as a $2N$th root of unity the calculation of the $p(\omega^k)$, $k=0,\dotsc,2N-1$, can be done by a clever method, the discrete fast Fourier transform, which needs only $O(N\log N)$ multiplications.

Further details are in the notes on pages 52, 53 and 99, and then in the Part II course, Numerical Analysis.

Valentine's Day story

Today's lecture was on Valentine's Day. It is appropriate that you hear this story.

 “You haven’t told me yet,” said Lady Nuttal, “what it is your fiancé does for a living.”

“He’s a statistician,” replied Lamia, with an annoying sense of being on the defensive.

Lady Nuttal was obviously taken aback. It had not occurred to her that statisticians entered into normal social relationships. The species, she would have surmised, was perpetuated in some collateral manner, like mules.

“But Aunt Sara, it’s a very interesting profession,” said Lamia warmly.

“I don’t doubt it,” said her aunt, who obviously doubted it very much. “To express anything important in mere figures is so plainly impossible that there must be endless scope for well-paid advice on how to do it. But don’t you think that life with a statistician would be rather, shall we say, humdrum?”

Lamia was silent. She felt reluctant to discuss the surprising depth of emotional possibility which she had discovered below Edward’s numerical veneer. “It’s not the figures themselves,” she said finally, “it’s what you do with them that matters.”

— Ascribed to K. A. C. Manderville, The Undoing of Lamia Gurdleneck, in Maurice G. Kendall and Alan Stuart, The Advanced Theory of Statistics, Volume 2 (1979, frontispiece).

Lecture 13

The tower property of conditional expectation (also  called the law of iterated expectations) is akin to the law of total probability that we met in Section 6.3. Another name is law of total expectation. A special case is

$E[X]=\sum_{i:P(A_i)>0} E[X \mid A_i]P(A_i)$,

where $A_1,\dotsc,A_n$ are events giving a partition of the sample space $\Omega$. Compare the above to

$E[X]=E\big[E[X \mid Y]\bigr]=\sum_y E[X \mid Y=y]P(Y=y)$.

Sometimes we might write $E\Big[E[X \mid Y]\Bigr]$ as $E_Y\Big[E[X \mid Y]\Bigr]$, just to remind ourselves that the outer expectation is being applied to the variable $Y$. Of course it is implicit in all this that $E[X]$ exists.

Remember that $P(A | B)$ is only defined for $P(B)>0$.  If you want to see something that illustrates this in a cute way, then read this nice description of Borel's paradox. The paradox occurs because of a mistaken step in conditioning on an event of probability $0$, which can be reached as a limit of a sequence of events in different ways.

Wednesday, 12 February 2014

Mathematica and MATLAB

I highly recommend that you install Mathematica and use it while you study and work on examples sheets (in many courses). I personally use it on almost a daily basis.
You can download a free copy:
http://www.maths.cam.ac.uk/computing/software/mathematica/

It is very easy install and learn to use. The time you spend learning to use it (and similarly MATLAB) is a very good investment.
http://www.maths.cam.ac.uk/undergrad/catam/software/matlabinstall/matlab-personal.htm

Here is a Mathematica program for Example Sheet 1, #15.

(* POLYA URN *)
{w, b} = {1, 1}
For[i = 1, i <= 50, i++, {
  For[j = 1, j <= 1000, j++, {
      m = RandomInteger[{1, w + b}];
      If[m <= w, w = w + 1, b = b + 1];
      }]
    Print[w/(w + b) // N];
  }]

Lecture 12

Today's lecture was on probability generating functions. A very nice tripos question is the following, 2004/2/II/9F. You should now be able to do this (and could discuss in your next supervision)
    A non-standard pair of dice is a pair of six-sided unbiased dice whose faces are numbered with strictly positive integers in a non-standard way (for example, (2,2,2,3,5,7) and (1,1,5,6,7,8)). Show that there exists a non-standard pair of dice A and B such that when thrown

    P(total shown by A and B is $n$) = P(total shown by pair of ordinary dice is $n$).
    for all $2\leq n\leq 12$,

    [Hint: $(x+x^2+x^3+x^4+x^5+x^6)=x(1+x)(1+x^2+x^4)=x(1+x+x^2)(1+x^3)$.]
We saw today that generating functions are useful in realms beyond probability, as example being that

"the $n$th Fibonacci number, $F_n$, is the coefficient of $x^n$ in the expansion of the
function $1/(1−x−x^2)$ as a power series about the origin."

That is a quote from chapter 1 of Herbert Wilf's fun book called generatingfunctionology, whose second edition is free to download. In his book you can read about many weird and wonderful things that can be done with generating functions. He starts Chapter 1 by writing

"A generating function is a clothesline on which we hang up a sequence of numbers for display."

Generating functions can encapsulate complicated things in a lovely way.  For example, the coefficient of $x^n/n!$ in the power series expansion of $e^{e^x-1}$ is the Bell number $B(n)$, i.e. the number of different partitions that can be made from a set of $n$ elements. In Chapter 1 of Wilf's book you can see a derivation of this fact.

Tuesday, 11 February 2014

Supervisors

If you are reading this, supervising Probability IA, and have not already been in touch with me, then please do so. I would enjoy knowing that names of people who are supervising. Together we can share experiences and improve the examples sheets and course. Here are people who I know to be supervising the course this year.

Anthony Ashton
Nathanael Berestycki
Stephen Burgess
Nilanjana Datta
Henry Jackson
Frank Kelly
Ross Lawther
Roger Sewell
Alexandra Turner
Richard Weber

Monday, 10 February 2014

Lecture 11

The probabilistic proof of the Weierstrass approximation theorem was to show you how probability theory can be used to prove an important theorem in a different field. We did something similar in Example 4.5 (Erdos probabilistic method in combinatorics) and in Examples Sheet 1, #8. The Wikipedia page on Bernstein polynomials has a list of Berstein polynomials and an animation showing how they are used to approximate a continuous function on $[0,1]$.

I cheated today by using a result from the schedules for Analysis II ("a continuous function on a closed bounded interval is uniformly continuous"). I had thought about giving an easier proof by not appealing to Analysis II, but I decided it would be nicer to show you the proper full result.

The "easy" route would have been to make the stronger assumption that there exists $\lambda$ such that $|f(x)-f(y)|\leq \lambda |x-y|$ for all $x,y\in[0,1]$. This is  Lipschitz continuity. Then the proof could have been done using only Jensen's inequality:

$E\Bigl[|f(X/n)-f(x)|\Bigr] \leq \lambda E\Bigl[|X/n-x|\Bigr] \leq \lambda \sqrt{E\Bigl[(X/n-x)^2\Bigr]}=\lambda\sqrt{\displaystyle\frac{x(1-x)}{n}}\leq  \displaystyle\frac{\lambda}{2\sqrt{n}}$

which $\to 0$ as $n\to\infty$. However, I like giving you the stronger result. This needed Chebyshev's inequality and illustrated a nice idea that is frequently used in probability calculations. We partitioned the sample space into $A$ and $A^c$ and reasoned differently in each. For $A$ we had $P(A)$ small and $|f(X/n)-f(x)|\leq 2$, whereas for $A^c$ we had $P(A^c)\leq 1$ and $|f(X/n)-f(x)|$ small.

Benford's law is surprising when first encountered. The fact that the leading digit should be "1" with a frequency of 0.301 (rather than $1/9$) has been used as a test to check for fraud in financial records. The idea is that if a person is making up numbers, he unlikely to arrange that his distribution of first digits agree with Benford's law. You can read more about that, and many other examples of Benford's law, in the Wikipedia article referenced above.

Friday, 7 February 2014

Examples Sheet 1

Most of you will have now done Examples Sheet 1. Let me make a few comments about some things I hope you learn from that sheet.

#7. (about infinite sequence of coin tosses and limsup of events) This was done in Section 4.4 of the notes. An even quicker solution can be done using the fact that $P$ is a continuous function (proved in Section 7.1 of notes).

$P(C)=P(\bigcap_{n=1}^\infty\bigcup_{k=n}^\infty A_k)=\lim_{n\to\infty}P(\bigcup_{k=n}^\infty A_k)\leq\lim_{n\to\infty}(p_n+p_{n+1}+\cdots)=0$.

#8. (sets none of which is a subset of the other) Think how this question would look if everything except the first and last sentences of its statement were removed. It would look hard! See how adding a layer of probability helps prove a theorem that seems to have nothing to do with probability. We saw this in Example 4.5 (Erdos's probabilistic method) and we will see this a couple more times in the course, next in Section 11.4 (Weierstrass approximation theorem).

#13 (Mary tosses more heads than John) and #18 (41 bags of red and blue balls). Both these questions have very beautiful solutions that should give you a WOW! feeling. Please make sure your supervisor shows you the beautiful solutions, if you have not found them yourself.

#16 Some students found this hard because they were not really sure what I was asking you to show. I have rewritten the question and posted a new version of Sheet 1. See if the new version on #16 is easier to understand. That may help your successors in IA 2015.  Note that this question relates to what I did in Lecture 10 about information entropy.

#17 (goat ate the cards!) I invented this question as a sort of joke - but I am pleased that it has turned out to be very thought provoking for many students. If the "penny has dropped" for you about this question then you are on your way to developing some good intuition in the field of probability.

Comments are welcome and will help your successors.

Information entropy

I spoke about the notion of information entropy: $H(X)=-\sum_i p_i\log p_i$. This has already featured on Examples sheet 1, #16. There we were to consider the multinomial distribution for dice rolls,

$\phi(k_1,\dotsc,k_6)=\displaystyle\frac{n!}{k_1!\cdots k_6!}\frac{1}{6^n}$, where $\sum_ik_i=n$ and $\sum_iik_i=\rho n$.

Using Stirling's formula and setting $k_i=np_i$ we see that this is maximized by maximizing the entropy of $-\sum_i p_i\log p_i$, subject to $\sum_i ik_i = \rho n$. You'll find something like this at the start of the Wikipedia article on Large deviations theory.

Entropy is an important concept in many fields, particular in communication/information theory. Shannon's coding theorem says, informally,

"$N$ i.i.d. random variables each with entropy $H(X)$ can be compressed into more than $N H(X)$ bits with negligible risk of information loss, as $N$ tends to infinity; but conversely, if they are compressed into fewer than $N H(X)$ bits it is virtually certain that information will be lost."

You can learn about this in the Part II course: Coding and Cryptology.

Lecture 10

I mentioned that there is a 318 page book completely devoted to the Cauchy-Schwarz inequality. It is The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Its author, Michael J. Steele, is a past President of the Institute of Mathematical Statistics. One reviewer wrote "... this is one of that handful of mathematics books that you can read almost like a novel." Chapter 1 is free to download and really interesting to read. I can also recommend his highly entertaining web site, which includes material for advanced courses in probability and finance, and also Semi-Random Rants, Videos and Favorite Quotes.

There are many proofs of this inequality. Here are two in addition to the one I did in lectures.

1. Consider $E(tX-Y)^2]=t^2E[X^2]-2t E[XY]+E[Y^2]\geq 0$, and $>0$ unless $tX=Y$. This quadratic has most one real roots and so the discriminant must be nonpositive. This is essentially the proof that was given by Schwarz and is a textbook favorite.

2. We started the lecture with Jensen's inequality. It is natural to ask if the Cauchy-Schwarz inequality can be deduced from it. One version of the Cauchy-Schwarz inequality is

$(\sum_i a_ib_i)^2\leq \sum_ia_i^2\sum_ib_i^2$,

for all numbers $(a_1,\dotsc,a_n)$ and $(b_1,\dotsc,b_n)$. WLOG we may assume all $a_j,b_j> 0$. Consider a random variable $X$ for which $P(X=a_j/b_j)=b_j/\sum_ib_i^2$. Apply Jensen's inequality with the choice of convex function $f(x)=x^2$. $Ef(X)\geq f(EX)$ gives the result.

This paper presents 12 different proofs of the Cauchy-Schwarz inequality, and this web page has 10.

Thursday, 6 February 2014

Non-transitive dice

In a comment to the Lecture 9's blog someone has alerted me to the fact that a set of 3 non-transitive dice can be bought from Maths Gear. They have sides like this:

Die A: 333336
Die B: 222555
Die C: 144444   $P(A>B)=P(B>C)=21/36$ and $P(C>A)=25/36$.

These are optimal, in that they maximize the smallest winning probability. However, there are other dice for which the sum of the 3 winning probabilities is greater.

Die A: 114444
Die B: 333333
Die C: 222255   $P(A>B)=P(B>C)=24/36$ and $P(C>A)=20/36$. 

Following some links on that page I have found this interesting article by James Grime (of the Millennium Mathematics Project) in which he says lots more about non-transitive dice. He has devised a set of 5 non-transitive dice. In another interesting article there is a video in which James discusses non-transitive dice with David Spiegelhalter.

Reading this made me wonder more about optimal designs. If dice were 2-sided (i.e. coins with a number on each side) then non-transitivity clearly is impossible. But what about 3-sided dice? Is a set of three 3-sided non-transitive dice possible? I quickly found one: A: 333, B: 225, C:441. Is there a set of 4 non-transitive 3-sided dice? What is the greatest number of 6-sided non-transitive dice possible? Is it 5, as found by James Grime, or could we have 6 non-transitive dice?

James Grime has now added a comment below.

Wednesday, 5 February 2014

Expectations of sums and products

In Theorem 8.1 (expectation of a sum) and Theorem 9.2 (expectation of a product)  we proved results for a finite number of r.vs, $X_1,\dotsc,X_n$. In general, the corresponding results for a countably infinite number of r.vs are not true. You might like to think about why the proofs do not work if $n$ is replaced by $\infty$. Can you find $X_1,X_2,\dotsc$ such that the following is not true?
    $E\sum_{n=1}^\infty X_n = \sum_{n=1}^\infty EX_n$
Here is a hint. There do exist special cases for which the above is true, such as when $\sum_{n=1}^\infty E|X_n|$ converges.

Lecture 9

The account of Warren Buffett, Bill Gates and Efron's dice that I gave in lectures you can find here.

The "Names in Boxes Problem", which I talked about today, is a puzzle which I first heard described around a lunch table at AT&T Laboratories, New Jersey, around 2003. The problem was posed in a paper, with a less catchy title:

Anna Gal and Peter Bro Miltersen. The Cell Probe Complexity of Succinct Data Structures.
Proceedings of 30th Intemational Colloquium on Automata. Languages and Programming (ICALP) 2003, 332-344.

It won a best paper award at this ICALP conference. The solution was found by Sven Skyum, a colleague of Miltersen's at the University of Aarhus If you would like to read more about the history of the problem, I recommend this paper: The Locker Problem, (by Max Warshauer, Eucene Curtin, Michael Kleber, Ravi Vakil, The Mathematical Intelligencer, 2006, Volume 28, Issue 1, pp 28-31).

Someone asked me today about whether $1-\log 2\approx 0.30$ is the best possible success probability for the prisoners. I think this question is answered "yes" by the discussion on page 30 of The Locker Problem.

The puzzle is listed as #1 in a collection complied by Peter Winkler callled, "Seven Puzzles You Think You Must Not Have Heard Correctly". I think it is remarkable that we can work out all the needed facts about cycle lengths in a random permutation by using only the simple ideas from probability that we have encountered in our course thus far.

Peter is great collector of good puzzles. He has some nice ones that are geographic. How about these (answer without looking at a map):
  1. If you travel due south from Miami, Florida, what part of the South American continent do you first hit?
  2. What two cities, one in North America and one Africa, are closest?
I have been told that Question 2 in "Seven Puzzles You Think You Must Not Have Heard Correctly" has been used by some colleagues as an entrance interview question. But I think that is rather tough!

Tuesday, 4 February 2014

Notes for Lectures 1–24

I have a pretty clear idea how the rest of the course should be put into lectures. So I have done a bit more work on the notes and am posting everything I have prepared through Lecture 24.

There are some gaps, which I will fill before we come to each lecture, and I will do some more proof-reading and cosmetics. However, even if there are occasional typos and changes to be made, these notes are close to the eventual finished product. So I think it might help you to have these now — particularly for those of you whose supervisions on Sheets 2, 3, 4 arise sooner that we reach that point in lectures.

Monday, 3 February 2014

Problem solving strategies

Students sometimes say that the theory in Probability IA is all very well, but that the tripos questions require ingenuity. That is sometimes true, of questions like 2004/II/12: "Planet Zog is a sphere with centre O. A number N of spaceships land at random on its surface. ... ". Of course this makes the subject fun.

But to help with this, let's compile a collection of some frequently helpful problem solving strategies. More will be added as the course progresses.
  1. You are asked to find $P(A)$. For example, $A$ might be the event that at least two out of $n$ people share the same birthday. Might it be easier to calculate $P(A^c)$? The simple fact that $P(A)=1-P(A^c)$ can sometimes be remarkably useful. For example, in Lecture 1 we calculated the probability that amongst $n$ people no two have the same birthday.
  2. You are asked to find $P(A)$. Is there a partition of $A$ into disjoint events, $B_1,B_2,\dotsc, B_n$, so that $P(A)=\sum_i P(B_i)$?
  3. Probabilities of intersections are usually easier to calculate than probabilities of unions. Fortunately, the inclusion-exclusion formula gives us a way to convert between the two. You are asked to find $P(A)$.  Can $A$ be written as the union of some other events? Might the inclusion-exclusion formula be helpful?
  4. The expectation of a sum of random variables is the sum of their expectations. The random variables need not be independent. This is particularly useful when applied to indicator random variables. You are asked to find $EN$, the expected number of times that something occurs. Can you write $N=I_1+\cdots+I_k$, where this is a sum of indicator variables, each of which is concerned with a distinct way in which $N$ can be incremented? This idea was used Example 8.3 in lectures, where $N$ was the number of couples seated next to one another. It is the way to do the Planet Zog question, mentioned above. You can use it in Examples sheet 2, #11.
  5. You are asked to place a bound on some probability. Can you use one of the inequalities that you know? Boole, Bonferroni, Markov, Chebyshev, Jensen, Cauchy-Schwarz, AM-GM. The ones in bold are key ones.
  6. You are asked something about sums of independent random variables. Might a probability generating function help? For example, suppose $X$ has the discrete uniform distribution $P(X=r)=1/(n+1)$, $r=0,1,\dotsc,n$. Is it possible for $X$ to have the same distribution as $Y_1+Y_2$, where $Y_1$, $Y_2$ are some two independent random variables with the same distribution? Hint: what would the p.g.f. of $Y_i$ have to be?
  7. This is like 2 above, but with the tower property of conditional expectation. You are asked to find $EX$. Maybe there is some $Y$ so that $E[X]$ is most easily computed as $E[E[X\mid Y]]=\sum_y E[X \mid Y=y]\,P(Y=y)$.
  8. It is often useful to use what you know about the relations between different standard distributions. For example, if $X\sim U[0,1]$ then $-\log X\sim \mathscr{E}(1)$. So make sure to learn well all the distributions that we cover in the course and understand the relations between them. Recall their special properties: such the memoryless property of the geometric and exponential distributions. As you approach a question ask yourself, what distribution(s) are involved in this question?
  9. You are given the joint density function of continuous random variables $X_1,\dotsc,X_n$ and want to prove that they are independent. Try to spot, by inspection, how to factor this as $f(x_1,\dotsc,x_n)=f_1(x_i)\cdots f_n(x_n)$, where each $f_i$ is a p.d.f.
  10. In questions about transformation of continuous random variables there are a couple things to look out for. Always start with a bijection between $R\subseteq\mathbb{R}^n$ and $S\subseteq\mathbb{R}^n$ and make sure that you specify $S$ correctly. If $Y_1,\dotsc,Y_n$ is more variables that really interest you, then you can always integrate the unneeded ones out at the end. In computing the Jacobian $J$, remember that it is sometimes easier to compute $1/J= \partial(y_1,\dotsc,y_n)/\partial(x_1,\dotsc,x_n)$.

Lecture 8

Zipf's law.
As a small digression, or "fun fact", I told you today about Zipf's law. This is a probability mass function such that $P(X=i)\propto 1/i^s$, $i=1,2,\dotsc,m$, for some $s > 0$, typically $s=1$. It's certainly mysterious why this distribution should be a good fit to frequencies with which words are used in a language. According to the Wikipedia article linked above, "Zipf himself proposed that neither speakers nor hearers using a given language want to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution". My guess is thinking that he is thinking that to retrieve the $i$th most popular word from your memory has a cost proportional to $i$ (like you were digging down into a list), and that the average workload that each word puts on the brain should be constant, so $ip(i)$ should constant. But this is clearly hand-waving. Maybe you can think of a better explanation

In queueing theory and other theories of congestion, delays and long waiting times can be caused by variability in service times. Suppose customers enter a queue at a constant rate of 1 every 5 minutes. They are served one at a time, service times are independent, taking $1,2,\dotsc$ minutes with probabilities $p_1,p_2,\dotsc$, respectively

An interesting fact (researched extensively during the past 30 years) is that there are large qualitative difference between how such queues behave when $p_i$ decays geometrically or exponentially (as do the Poisson, or geometric distribution, which has $p_i= q^ip$), and when it decays according to a power law, like $p_i\propto 1/i$ or $\propto 1/i^{0.9}$.

Sunday, 2 February 2014

The “sexiest job of the 21st century”

Probability lies at the foundations of statistics. In today's Sunday Times (2 February, 2014) there is an article entitled "Mystic Nate Silver bets on making statistics 100% sexy". Here are a few quotes from it
    NATE SILVER, an American mathematician who predicts elections and sport results with uncanny accuracy, is poised to do what fellow number-crunchers would have thought 101% impossible: make the application of statistics “cool”.

    Silver, who made his name by predicting the precise results of both of Barack Obama’s presidential wins, is turning his FiveThirtyEight blog into a fully fledged online magazine this month with backing from two of America’s biggest broadcasters.

    The Harvard Business Review recently described data scientists as having the “sexiest job of the 21st century” — not bad for a subject once derided as a profession for people without the personality to become accountants. “Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before,” it said.

    The US Bureau of Labour Statistics predicts a 14% surge in demand for statisticians over the next eight years.

    The starting salary for a graduate statistician at Google is $125,000. Hal Varian, the internet giant’s chief economist, said: “I keep saying the sexy job in the next 10 years will be statisticians. People think I’m joking, but who would have thought computer engineering would have been the sexy job of the 1990s?”

    John Harlow, Los Angeles, Published: 2 February 2014

More lecture notes

I am now posting notes up to Lecture 12.  Most of Examples Sheet 2 is on topics we will have covered up to Lecture 9. However, in Question 9 you need the definition of covariance, and this comes in Lecture 10. Remember also that if you want to read ahead there are other notes you can access (listed on the course information page). Here are some other things.
  • Corrections. I am continuing to receive help from students and others in finding small typos in the notes. By now the notes for Lectures 1-9 should be close to their perfect final form. If anyone wants to read ahead and sees any typos in Lectures 10-12, please let me know and I can correct them before we reach that point.
  • My handwriting. A student asked me to write more clearly. Fair comment. Since receiving that comment I have been trying to pay attention to this. I hope my writing on the visualiser is improving.
  • Visualiser (document camera). A student has written that the projection from the camera is dim and asked if it can be brighter. Have others a problem with the brightness? I will investigate. I think the visualiser is nice for a number of reasons. I can be always facing you (unlike blackboards), and it is centrally placed to give everyone a good view (unlike the two overhead projectors).  Some mathematicians like chalkboards. But I find them to be very messy and in the past I have had bad experiences with dust getting in my eyes.
You can write suggestions below, anonymously if you wish (but then I cannot reply to you). Comments are moderated by me before appearing. I will "read, mark, learn and inwardly digest" anything you want to tell me about the above points, or other things. When the message is in the spirit of a 1-1 communication I will not release the post for public view.

Friday, 31 January 2014

The St Petersburg Paradox

In Lecture 7 I mentioned that $EX$ may not exist, and that for a random variable $X$ such that $X\geq 0$ it is possible for $EX=\infty$. Here's a famous example arising from the case $EX=\infty$. See also, Examples sheet 3, #19.

The St Petersburg Paradox
How much would you be willing to pay to play the following game? We will toss a fair coin repeatedly. We start you with £1 and then double this every time the coin shows a head. We stop when the first tail occurs. You keep your winnings, denoted by the random variable $X$. Now

$EX = (1/2)1+(1/4)2+(1/8)4+(1/16)8 +\cdots = \infty$.

Since the expected value of your winnings is infinite, it seems like you should be willing to pay an arbitrarily large amount to play this game! But would you really be willing to pay £1,000,000? (if you could!) Surely not. If you paid £1,000,000 for the right to play this game, the probability is 0.9749 that you will win less than you paid.

This is the St. Petersburg paradox  posed by Nicolas Bernoulli 1713.The paradox concerns what the mathematics predicts versus what a rational person might really do in practice. One resolution is that a person's utility for money is not linear. We should really be looking at $Eu(X)$, where perhaps $u(X)=\log X$. In that case $Eu(X) < \infty$.

Lecture 7

You might like to revisit the blog for Lecture 4. The example I do there about tossing coins for which $\sum_k p_k=\infty$ makes use of both continuity of $P$ (proved today), and independence of events (Lecture 5).

We were careful today in defining the notion of a random variable. It is a function mapping elements of the sample space $\Omega$ to some other set $\Omega_X$ (which is typically some set of real numbers). So $X$ is a function and $X(\omega)$ is value in its range. Most of the time we leave out the $\omega$ and speak of an event like $X\leq 1.45$. This really means the event $\{\omega: X(\omega)\leq 1.45\}$.

The terminology ‘random variable’ is somewhat inaccurate, since a random variable is neither random nor a variable. However, a random variable has an associated probability distribution and one can think informally about a random variable as "a variable which takes its value according to a probability distribution".

Fine print. In Lecture 4 I gave you the definition of a probability space $(\Omega,\mathscr{F},P)$. For $T\subset\Omega_X$, we can calculate $P(X\in T)=P(\{\omega:X(\omega)\in T\})$ only if $\{\omega:X(\omega)\in T\}=X^{-1}(T)\in\mathscr{F}$. This may put restrictions on allowable $T$ and $X$. But this will not bother us in Probability IA, since we will only look at situations in which $\mathscr{F}$ consists of all subsets of $\Omega$, or $T$ is something nice like a subinterval of the reals.

Wednesday, 29 January 2014

Bayes' theorem

Bayes' theorem looks innocent enough, but from it emerges an entire school of statistical reasoning, Bayesian statistics. We have a hypothesis $A$, which might be true or false. A priori we think it is true with probability $p$. We collect some data and then want to update our belief that $A$ is true, that is find $P(A\mid\text{data})$. We can find the answer using $P(\text{data}\mid A)$ and Bayes' theorem.

Here is a YouTube video of students in a sing-a-long of George Box's "There's no theorem like Bayes' theorem", sung to the tune of "There's no business like show business".

There's no theorem like Bayes' theorem
Like no theorem we know
Everything about it is appealing
Everything about it is a wow
Let out all that a priori feeling
You've been concealing right up to now!

Further words are here.

Lecture 6

Here is some interesting reading for you about the law courts and conditional probability.
  • A formula for justice
    Bayes' theorem is a mathematical equation used in court cases to analyse statistical evidence. But a judge has ruled it can no longer be used. Will it result in more miscarriages of justice?
  • Court of Appeal bans Bayesian probability (and Sherlock Holmes)
    In a recent judgement the English Court of Appeal has not only rejected the Sherlock Holmes doctrine shown above, but also denied that probability can be used as an expression of uncertainty for events that have either happened or not.
  • Prosecutor's fallacy
    The prosecutor's fallacy is a fallacy of statistical reasoning, typically used by the prosecution to argue for the guilt of a defendant during a criminal trial. 
I mentioned David Aldous's blog post Presenting probability via math puzzles is harmful.
Do you think it has harmed you to see the problem "I have two children one of whom is a boy (born on a Thursday)"?

If you read the Wikipedia article on Simpson's paradox you will find that an alternative name is the Yule-Simpson effect. Udny Yule (a Fellow of St John's College, and a lecturer in statistics at Cambridge) appears to be the first one to have commented on this phenomenon (in 1903).

Our Example 6.7 was artificial, so it would be easy to see what was happening (women predominated amongst independent school applicants and women were more likely to gain admission.) In the Wikipedia article you can read about more practical examples. One of the best-known real-life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there.

I remember once extracting data from Cambridge University Tripos results to show that in each of certain set of carefully selected subjects (including Engineering, English, ...) women were more likely to obtain Firsts than men, but that men were more likely to obtain Firsts when results over all subjects were combined. This sounds paradoxical. Can you figure out how it can happen? Hint: some subjects award greater percentages of Firsts than do others. Some subjects are relatively more popular with women. 

Monday, 27 January 2014

Lecture 5

I have updated the notes to go as far as next Wednesday, Lecture 9.  By the way, if you are printing these notes, you can choose the format you most prefer. I use Acrobat to print with "Fit" which gives me a large typeface on A4. But "Actual size" (to give you wide margins) or "multiple pages" 2 per sheet can look nice also.

I have made some small changes to Example 5.2 in the notes for Lecture 5, and one important correction to a typo in the formula for the Poisson distribution. By the way, I mentioned horse kicks. I was imperfectly recalling that a famous example of data that is Poisson distributed is due to von Bortkiewicz (1898), who studied the numbers of Prussian cavalryman being killed by the kick of a horse in each of 20 years. This is an example with $n$ (number of cavalryman) large, and $p$ (chance a particular cavalryman being killed) small. You can read more about this famous example here.

You might compare the answer to Examples Sheet 1 #10 to a Bonferroni inequality working in the reverse direction, $\geq$.

We saw by examples today that it is possible to have three events $A_1,A_2,A_3$ such that every pair of events is independent, but the three events are not mutually independent. Here is another example. Suppose the sample sample consists of 4 equally likely outcomes $\{w_1,w_2,w_3,w_4\}$. Let

$A_1=\{w_1,w_2\}$,
$A_2=\{w_1,w_3\}$,
$A_3=\{w_1,w_4\}$.

Then $P(A_i\cap A_j)=P(A_i)P(A_j)=1/4$, but $1/4=P(A_1\cap A_2\cap A_3)\neq P(A_1)P(A_2)P(A_3)=1/8$.

I mentioned in the lecture that we might generalize this to a puzzle. Can you find four events $A_1,A_2,A_3,A_4$ such that every subset of either two or three events is a set of mutually independent events, but the four events are not mutually independent?

At the end of the lecture a student told me this nice answer. Let $A_i$ be the probability the $i$th dice roll is odd, $i=1,2,3$ and $A_4$ be the event the sum of the three rolls is even. That works! And it looks like we could generalize this example to give $A_1,\dotsc,A_{k+1}$ such that every  $k$ or fewer events are independent, but the $k+1$ events are not independent.

You can find another answer at the end of this page, but I think the above is nicer.

Friday, 24 January 2014

limsup of a sequence of sets (or events)

When seen for the first time it can be difficult to immediately grasp the meaning of  $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty A_k$.

I found that when I was a student and encountering this for the first time. This is one reason I included #7 on Examples Sheet 1. My advice is to pick it apart slowly.

$A_k$ is the event that the $k$th toss is a head.

First: The event $\bigcup_{k=i}^\infty A_k$ is true (occurs) if and only if there is at least one head somewhere amongst the tosses $i,i+1,i+2,\dots$ .

Second: When we put $\bigcap_{i=1}^\infty$ in front we create an event that is true if and only if the above is true for all $i$. So how can the result be true? Well, if and only if there are an infinite number of heads. (There might also be an infinite number of tails.)

In general, $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k$ is occurs (or is true) if and only if an infinite number of the events $E_1,E_2,\dotsc$ occur.

The idea is similar to the "$\limsup$" of a sequence $\{a_1,a_2,\dots\}$, which is defined as

$\lim_{i\to\infty} \sup_{k\geq i} a_k$.

How about $\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty A_k$? That is the event that eventually we see only heads. Now perhaps you can see that

$\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty E_k \subseteq\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k $.

Here is an article on it that might also help. 

The Mozart Cafe problem

Two friends travelling independently to Vienna wish to meet for a coffee on the afternoon they arrive. Neither has been to Vienna before, but they guess it must have a Mozart Cafe. So in an email exchange they agree to meet at the Mozart Cafe. Unfortunately, upon arrival they each discover that there are in fact $m >1$ Mozart Cafes in Vienna. Assuming they have no way to communicate, what should they do?

Suppose that each does one of the following. He either (a) chooses one cafe at random and stays there all afternoon, or (b) visits all $m$ cafes in random order spending equal time in each. On average, the best outcome is achieved when one does (a) and the other does (b) — on average they will meet half way through the afternoon. However, this cannot be guaranteed since they have no way to agree who should do (a) and who (b). The worst outcome is when both do (a) (unless they are lucky and choose the same cafe).

Suppose each friend decides to choose his strategy randomly, doing (a) with probability $p$ and (b) with probability $1-p$. If they fail to meet then they repeat the same strategy on subsequent afternoons, with new random choices, until they meet.

The calculation of the probability of derangement done in Example 4.4 helps with the evaluation of what happens when they both choose (b), an event that happens with probability $(1-p)^2$.

They wish to minimize the expected time until they meet. It turns out that for $m=3$, the optimal strategy is with $p=1/3$. As $m\to\infty$ the optimal $p\to \approx0.24749$.

You might enjoy looking at the slides for my recreational talk Symmetric Rendezvous Search Games. Here is a completely unsolved problem to which we would love to know the answer:

Two astronauts land at random spots on a planet (which is assumed to be a uniform sphere, without any known distinguishing marks or directions). They can each move at a constant speed. How should they move so as to be within 1 kilometre of one another in the least expected time?

A non-measurable set

I remarked that it may not be possible for $\mathscr{F}$ to be all subsets of $\Omega$. Here is an example to illustrate the point.

Let $\Omega$ be the set of all points on the circumference of a unit circle. Define a equivalence relation on points in $\Omega$ such that $x\sim y$ if the angle between them is a rational fraction of $\pi$. This partitions $\Omega$ into equivalence classes (in fact an uncountable number of them). Now pick one point from each equivalence class and call the set of these points $A$. If we shift every point in $A$ around the circumference of the circle by the same rational angle $q$ we get a new set, say $A_q$. Clearly, $A$ and $A_q$ are disjoint, and it ought to be that $P(A_q)=P(A)$. Now taking the union over all rationals, $\Omega=\bigcup_q A_q$ is a union of a countable number of disjoint sets (since every point in $\Omega$ is some rational angle away from some point in $A$). But it is impossible to satisfy the probability axioms, which would require $P(\bigcup_q A_q)=1=\sum_q P(A_q)$, and also have $P(A)=P(A_q)$ for all $q$. We are forced to conclude that no probability measure $P$ can cope with a subset of $\Omega$ like $A$.

Lecture 4

I have added to the list of Problem Solving Strategies a note about the inclusion-exclusion formula.

I showed you in Example 4.5 a use of the probabilistic method in combinatorics. If you liked that, you might like to read about other examples, for instance in this link.

In Example 4.4 we considered an infinite sequence of tosses of biased coins and saw that if $\sum_{i=1}^\infty p_i<\infty$  then $P(\text{infinite number of heads})=0.$

Now let's prove that $\sum_{i=1}^\infty p_i=\infty \implies P(\text{finite number of heads})=0.$ This is trickier.

We are assuming the coin tosses are independent (which is terminology from Lecture 5, but you can guess intuitively what this means). Let $A_k$ be the event that the $k$th toss is a head. Let $E_n=\bigcap_{k=n}^\infty A_k^c=[\text{no heads after the $(n-1)$th toss}]$. Then for $m > n$

$P(E_n)\leq P(\bigcap_{k=n}^m A_k^c)=\prod_{k=n}^m P(A_k^c) =\prod_{k=n}^m (1-p_k)\leq \prod_{k=n}^m e^{-p_k} =e^{-\sum_{k=n}^m p_k}$.

The right hand side $\to 0$ as $m\to\infty$.

Hence $P(E_n)=0$, and then using Property (v) of $P$ (continuity), we have that

$P(\text{finite number of heads})=P(\bigcup_{n=1}^\infty\bigcap_{k=n}^\infty A_k^c)=\lim_{n\to\infty}P(E_n)=0.$

These two results are examples of applications of the Borel-Cantelli lemmas, which appear in the Part II course Probability and Measure.

I did not have time to talk through Example 4.8, but it is another nice use of the inclusion-exclusion formula.