Probability: February 2014

Friday, 28 February 2014

Lecture 19

The fictional example in Section 19.2 about the heights of UK and Netherlands basketball players showed something striking. Can you think of any real life instances of this sort of effect?

The graphs on page 7 of this report on climate change are interesting in respect of showing how a small change in the mean can give rise to an amplified impact in the tail of a normal distribution. See Full SREX report: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation, Special Report of the Intergovernmental Panel on Climate Change.

In this graph we see "hot weather" increasing by a factor of 2.

But "extreme hot weather" increases even more, by a factor of 3

Wednesday, 26 February 2014

Last arrivals problem

The last arrivals problem that I discussed in 18.5 is similar to the famous Secretary Problem that is addressed in the Part II course Optimization and Control. In that problem the task is sequentially to interview $n$ secretaries for a job, stop & hire one who has the quality of better than any previously interviewed, and then find that choice remains better than any others that could be subsequently interviewed. This is one example of an interesting problem at the interface of probability and optimization.

The problem in 18.5 is slightly different because the number of gold coins is completely unknown. This problem has been studied by F. Thomas Bruss and Marc Yor in Stochastic Processes with Proportional Increments and The Last-Arrival Problem, Stochastic processes and their Applications, 122:3239–3261, 2012.

A game version of the problem is addressed by Johan Wästlund in the paper When only the last one will do. He discusses the following discrete game.

There is a deck of $d$ cards, and at the start of the game a devil labels the face side of some of them. There must be at least one labelled card, and at most $N$. The deck is then shuﬄed and the cards are turned up one by one. The selector (John), knowing $d$ and $N$, must select online the last labelled card.

The devil’s strategy is simply a probability distribution on $\{1,\dotsc, N\}$. Wästlund shows that, as $N\to\infty$, John's success probability will be about 0.352917000207196 (which is $<1/e=0.367879$).

It might be fun to find the answer for $d=52$ and $N=2$. The devil marks either 1 or 2 cards, with probabilities $p$ and $1-p$, respectively, and John will stop at the first marked card if it occurs $k+1$ or deeper in the deck of cards, or if it is the $k$th card he will stop w.p. $\theta$. Otherwise he continues, hoping to find a second marked card. What are optimal $p$, $k$ and $\theta$?

Answer: I couldn't resist figuring this out after lunch today. Both players in this game use mixed strategies. The devil will choose $p=52/103$ and John will stop if he finds the first marked card as the $21$st or deeper; if it is $20$th card then he will stop with probability $\theta=28/115$. His success probability is $927/1495\approx 0.62$.

A student suggested at the end of the lecture that we might consider a game in which our task is to stop at the point that only finitely many gold coins remain to be found. I'm not sure how we set up the problem so as to initially distribute an infinite number of gold coins in the early part of the road. But it would be certainly be interesting to ask how we might stop so as to maximize the probability that we have found all but exactly $k$ coins, generalizing the original problem, which was the case $k=0$.

Planet Zog problems

Here are two challenging tripos questions from recent years. The first is a problem in geometric probability (as in Lecture 18). To do this question it will help to know that the surface area of a spherical cap is $2\pi R h$, where $R$ is the radius of the sphere and $h$ is the height of the cap. The surface area of the entire sphere is $4\pi R^2$.

2003/II/12
Planet Zog is a ball with centre $O$. Three spaceships $A$, $B$ and $C$ land at random on its surface, their positions being independent and each uniformly distributed on its surface. Calculate the probability density function of the angle $AOB$ formed by the lines $OA$ and $OB$.

Spaceships $A$ and $B$ can communicate directly by radio if $\angle AOB < \pi /2$, and similarly for spaceships $B$ and $C$ and spaceships $A$ and $C$. Given angle $\angle AOB =\gamma < \pi /2$, calculate the probability that $C$ can communicate directly with either $A$ or $B$. Given $\angle AOB =\gamma > \pi /2$, calculate the probability that $C$ can communicate directly with both $A$ and $B$. Hence, or otherwise, show that the probability that all three spaceships can keep in in touch (with, for example, $A$ communicating with $B$ via $C$ if necessary) is $( \pi +2)/(4\pi )$.

Spoiler alert! Here is the answer.

The second question is rather different. Here you are being tested on the idea that a sample space can be partitioned into disjoint events.

2004/II/12
Let $A_1,A_2,\dotsc,A_r$ be events such that $A_i\cap A_j = \emptyset$; for $i \neq j$. Show that the number $N$ of events that occur satisfies

$ P(N = 0) = 1 −\sum_{i=1}^rP(A_i) . $

Planet Zog is a sphere with centre $O$. A number $N$ of spaceships land at random on its surface, their positions being independent, each uniformly distributed over the surface. A spaceship at $A$ is in direct radio contact with another point $B$ on the surface if $\angle AOB < \pi/2$. Calculate the probability that every point on the surface of the planet is in direct radio contact with at least one of the $N$ spaceships.

[Hint: The intersection of the surface of a sphere with a plane through the centre of the sphere is called a great circle. You may find it helpful to use the fact that $N$ random great circles partition the surface of a sphere into $N(N − 1) + 2$ disjoint regions with probability one.]

Lecture 18

The Wikipedia article on the Bertrand paradox has more interesting information and pictures. The is also a Wolfram demonstration project called the Random chord paradox that you can play with.

Examples sheet 4, #2 is this famous question: "A stick is broken in two places, independently uniformly distributed along its length. What is the probability that the three pieces will make a triangle?" For a more a more complicated geometric probability problem see the blog posting on Planet Zog problems.

Tuesday, 25 February 2014

Table of distributions

Here is a table of 5 discrete and 6 continuous distributions. It is supplied to second year students studying Statistics IB, as a reminder of things they are to know from Probability IA.

For each of distribution the table shows its usual notation, p.m.f. or p.d.f., mean, variance, and p.g.f. or m.g.f.

You will want to have at your fingertips most of the information in this table (or the method to by which to produce it). You might find it helpful to make your own table, using this one as a guide.

Some of these 11 distributions are not explicitly mentioned in the schedules for Probability IA. Rather they appear as examples of other things which are mentioned in the schedules. For example, the gamma distribution arises as an example of summing i.i.d. r.vs (in this case exponential r.vs). The multivariate normal provides an example of the p.d.f. of jointly distributed continuous r.vs. The beta distribution appears as an exercise on the joint distribution of order statistics of i.i.d. uniform r.vs. We have not mentioned in a lecture the negative binomial, but this is simply the sum of i.i.d. geometric r.vs. (as done in Examples sheet 3, #5).

Monday, 24 February 2014

Lecture 17

An important special case of Theorem 17.1 is when $Y=aX$, for some constant $a$.
The p.d.f. of $Y$ is then $f_Y(y)=(1/a)f_X(y/a)$.

There are many ways to define transitive ordering relations between random variables. Now you know about $EX\geq EY$ and $EX\geq_{\text{st}} EY$. Such relationships are important in applications, such as when $X$ and $Y$ are lifetimes of components and we are interested ordering their reliabilities. (Reliability engineering is concerned with determining the ability of a system or component to function under some stated conditions for a specified period of time.)

There are other ordering relations, such as "hazard rate ordering", and "convex ordering". These are described within the Wikipedia article on stochastic ordering. An important order relation in statistics is "likelihood ratio ordering". If $X$ and $Y$ have p.d.fs $f$ and $g$ such that $f(t)/g(t)$ increases with $t$ then $X\geq_{\text{lr}}Y$. It is a stronger ordering than the two we have met, since $X\geq_{\text{lr}}Y\implies X\geq_{\text{st}}Y$.

The treatment of the inspection paradox is this lecture derived from a paper by Sheldon Ross, The inspection paradox, Probability in the Engineering and Informational Sciences, 17, 2003, 47–51. Of course the use of the word "paradox" is somewhat in appropriate, since it is not hard to understand why $X_J\geq_{\text{st}}X_1$.

There are other such inspection paradoxes. Suppose $X_1,X_2,\dotsc$ are i.i.d. $U[0,1]$ and $N$ is the least integer such that $X_1+\cdots+X_N\geq 1$. Then conditional on this,

$X_N\geq_{\text{st}}X_{N-1}\geq_{\text{st}}\cdots \geq_{\text{st}}X_1$.

I decided not to discuss it in the lecture, but at the end of page 69 of today's notes is the puzzle, "Do girls have more brothers than boys do?" The "proof" that the answer is "yes" is false, though initially it looks almost convincing. What is wrong?

What do you think is the truth of the matter? Let $G$ and $B$ be the mean number of brothers possessed by a typical girl and typical boy, respectively.

If every family were to consists of some number of boys and just one girl then $G > B$.

If every family were to consists of only boys or only girls then $G < B$.

What are the minimal conditions under which $G=B$? Must you assume that each child is independently a boy or girl with probability $1/2$?

Friday, 21 February 2014

Probability distributions

The chart I showed you, with the distributions we are studying marked in yellow, can be downloaded here. It is from the paper Univariate Distribution Relationships, L. M. Leemis and J. T. McQueston. In that paper you can find explanations of the various relationships that are indicated by the arrows.

Here also is a simpler chart about Relationships between probability distributions. This one shows some relations that the above chart misses out, such as how the binomial can come from the Poisson as $X \mid X+Y$ (see Examples sheet 2, #5).

Wikipedia has a nice List of probability distributions.

In 16.5 of the notes I mention that if $X$ and $Y$ are i.i.d. exponential r.vs then $X/(X+Y)$ is a uniformly distributed r.v. You will prove this in Examples sheet 4, #8. A general method will be given in Lecture 20. But it can also be done this way:

$P(X/(X+Y) \leq t)=P(X(1-t)/t\leq Y)=\int_{x=0}^\infty\left(\int_{y=x(1-t)/t}^{\infty}\lambda e^{-\lambda y}dy\right)\lambda e^{-\lambda x}dx = t$

which is the c.d.f. of $U[0,1]$.

Lecture 16

The human life plots of force of mortality that I showed in today's lecture can be viewed on the Understanding Uncertainly page: How long are you going to live?

David Spiegelhalter (Winton Professor for the Public Understanding of Risk) has now added a comment for us in the comments section at the end of this post. He makes reference to the Gompertz distribution, which is one with a linearly increasing hazard rate.

Notice that the hazard rate curve (force of mortality) has a 'bath-tub' shape - there is an early high-risk period immediately following birth, followed by long low period of annual risk, until accidents start kicking in for 17-year-old-boys, and then steadily increases. At the above link there are some tools for interactively exploring how your life expectancy changes as a function of your present age, gender and lifestyle (non-smoker, moderate alcohol, 5 a day, exercise).

Most things (such as car engines, tyres, light bulbs) have an increasing hazard rate. But some things have a decreasing hazard rate. A business that has lasted two centuries is less likely to go bankrupt next year than one that only started two years ago.

Notice that if $X\sim U[0,1]$ then $h(x)=f(x)/(1-F(x))=1/(1-x)$, which tends to $\infty$ as $x\to 1$.

Some fine fine print. I said that a continuous random variable is one that has a continuous c.d.f., and it is this that puts the "continuous" in its name. However, the fine print (which you can find in my notes near the middle of page 63) is that we actually need the c.d.f. to absolutely continuous, and this will ensure that a p.d.f. exists. The qualifier "absolutely" rules out weirdly pathological c.d.fs. For example, the Cantor distribution has a continuous c.d.f., but it has no p.d.f. Of course none of this fine print is examinable in Part IA. But I know that some of you find this type of thing interesting.

Wednesday, 19 February 2014

Examples sheet 2

Most of you will have now done Examples Sheet 2. Let me make a few comments about some things I hope you learn from that sheet. Some of what I say below is "in brief', and you may need to fill in details or ask your supervisor.

#3,4. There is often more than one nice way to solve a problem. I think that both of these problems have at least two nice methods of solution. Have you been able to think of more than one way to solve them? For example, #4 can be solved by summing a geometric series. It can also be solved via a simple order 0 recurrence relation.

#6 (ii). You should have discovered that $Y$ and $X$ are independent. So $X=10$ provides no information about $Y$. What is the intuition for this? Well, we might imagine that $X$ and $Y$ are initially numbers of "detectable" and "non-detectable" misprints, created as independent samples from the Poission distribution with parameter $\lambda/2$. The page has $X+Y$ misprints (which is distributed Poisson, parameter $\lambda$). The proofreader finds the detectable ones. Clearly, in doing this she discovers nothing about $Y$. This model gives the same distribution of $X$ and $Y$ as one in which there are $X+Y$ misprints and each is detected with probability $1/2$.

#7,8. The results in these questions are often proved by students in something like 2 pages of working. But they can also be set out in perhaps only 2-4 lines of working. Make sure you learn (from your supervisor) the fast way to derive these results.

#10. There are several ways to show that $Y_1,\dotsc,Y_n$ are independent. One way is via the fact that $P(Y_1=1,Y_2=1,\dotsc,Y_n=1)=1/n!=P(Y_1=1)\cdots P(Y_n=1)$. Then apply the result of question #2.

#12. The coupon collector's problem. This question is one that should lodge permanently in your brain. There are many problems in probability, design of computer science algorithms, and elsewhere, which at some point involve this problem. I think it is interesting that the expected time to collect the $n$ figures is about $n\log n$. However, the expected time needed to collect $99\%$ of the figures is only about $n\log 100$. So it is the last fraction of figures that are costly to collect.

#14. Did you notice that this method would work for simulating any probability $\alpha\in (0,1)$? For example, if we want $P(A)=1/5$, we write this in binary as $\alpha=0.0011001100...$ and then toss a coin. If we see TTHHH we would say $A$ has not occurred, because translating this to $.00111$, it differs from the binary form of $\alpha$ by representing a number $>\alpha$.

#16. This questions has been set to IA students for many years. In my experience few students are able to solve it. But the solution is so beautiful that it would be a pity if your supervisor were not given the opportunity to explain it to you.

#17,18. When I say, "This section is for enthusiasts", I am hoping that most students will think "This means me". After all, if you are reading Mathematics at Cambridge you are by definition a mathematics enthusiast. Even if you cannot solve these questions fully you should at least make some attempt. I would expect almost all students to be able to do #17 for the case $k=2$, $n=3$. It is a good general principle to tackle a difficult problem by first trying to solve a simple special case.

Generating functions for random walk

Here is Mathematica code for the generating functions for the step at which the random walk first hits a barrier.
$U_z(s)=\sum_{n=0}^\infty P(\text{hit $0$ at $n$}\mid\text{start at $z$})s^n$,
$V_z(s)=\sum_{n=0}^\infty P(\text{hit $a$ at $n$}\mid\text{start at $z$})s^n$.

As one would expect, $U_z(1)=q_z$, $V_z(1)=p_z$ and $U_z'(1)+V_z'(1)=D_z$.

I used these to compute the table on page 60 of the notes. You could use these to find the variance of the duration of the game. By the way, notice in this table that in lines 2-4 the duration of the game is 10 times the expected loss. This is because the expected loss per game is $EX_1=-0.1$. So we are seeing $E[G]=E[X_1+\cdots+X_N]=EN\cdot EX_1=D_z\cdot EX_1$, where the r.v. $N$ is the length of the game.

L1[s_] = (1 + Sqrt[1 - 4 p q s^2])/(2 p s);
L2[s_] = (1 - Sqrt[1 - 4 p q s^2])/(2 p s);

U[s_] = (q/p)^z (L1[s]^(a - z) - L2[s]^(a - z))/(L1[s]^a - L2[s]^a) 
U[s] /. {a -> 10, z -> 9, p -> 45/100, q -> 55/100};
Series[%, {s, 0, 15}] // Simplify

M1[s_] = (1 + Sqrt[1 - 4 p q s^2])/(2 q s);
M2[s_] = (1 - Sqrt[1 - 4 p q s^2])/(2 q s);
V[s_] = (p/q)^(a - z) (M1[s]^z - M2[s]^z)/(M1[s]^a - M2[s]^a) 
V[s] /. {a -> 10, z -> 9, p -> 45/100, q -> 55/100};
Series[%, {s, 0, 15}] // Simplify

Lecture 15

The Wikipedia page about random walk is worth a look. Today, we focused upon the Gambler's ruin problem. However, there are many more interesting things that can be said about random walks. For example, one can consider random walks on arbitrary graphs. More about random walks will be presented to you in the course Markov Chains, Part IB.

I mentioned that if you wish to ensure you never go bankrupt then rather than bet £1 at each go you could bet £ $fz$, where $0<f<1$. That way you come next to $(1+f)z$ or $(1-f)z$. Kelly betting, that I will talk about later (probably in Lecture 21), is about choosing the optimal size of $f$.

The continuous version of random walk is Brownian motion, which is presented in Stochastic Financial Models, Part II. One of its beautiful properties is that it is self-similar. The graphic below shows how Brownian motion looks when we place it under the microscope and gradually increase the magnification. It is possible to say of Brownian motion in 2 dimensions that "almost surely every arc of the path contains the complete works of Shakespeare in the handwriting of Bacon." (Peter Whittle). Just as we might say in dimension 1 that at some magnifications the path below looks like a graph of FTSE all-share index for the past 5 years.

Monday, 17 February 2014

Simulation of a random walk

Here is something for Lecture 15 (on random walk).

Simulating the Simple Random Walk
from the Wolfram Demonstrations Project by Heikki Ruskeepää

Click above to play with a simulation of random walk produced by tossing a fair coin. The "seed" reruns with a different set of random coin tosses. The curves are confidence intervals, within which the walk will lie with high probability. Random walks are very important in physics, biology, and finance.

There are many interesting questions that one can ask about symmetric random walk, i.e. the random paths produced by tossing a fair coin (up 1 for a head, down 1 for a tail).

How many times on-average does a walk of length $n$ cross the $x$-axis?
What is the distribution of the last crossing of the $x$-axis? (recall Lecture 1!)
What is the distribution of the terminal value $X_n$?
As you try different seeds, do you think some walks look more "typical" than others?

Lecture 14

In this lecture you have had a taste of the theory of branching processes. Please check that you have an up-to-date copy of my notes, as I corrected several typos this morning. The calculations we did today in Theorem 14.2 are similar to the sort of thing you need to do in answering Examples Sheet 3, #10, 14, 15. Question #16 does not look like one about branching processes, but you might think about trying to map into a version of a branching process so that you can apply theorems from this lecture.

If you would like to see something more about branching processes, I recommend to A Short Introduction to Branching Processes by Gesine Reinert (a professor of statistics at Oxford).

I mentioned that branching processes have interesting applications beyond the obvious ones of population biology, e.g. in rumours and advertising. I asked Gesine if she might recommend something beyond the basics to tell you about. She has written to me this morning, "One application of branching processes is to assess the shortest path length between two randomly chosen vertices in sparse networks; but that may be a bit much for first-year students."

I said a bit at the end of the lecture that was drawn from Gesine's slides 26-28, and 30-32. In these you can see the idea of a multi-type branching process. We see that it is possible to prove a theorem similar to our Theorem 14.3, but now the probability of extinction depends on the maximum eigenvalue of the matrix $M=(m_{ij})$, where $m_{ij}$ is the mean number of offspring of type $j$ produced by a parent of type $i$. The branching process becomes extinct with a probability 1 or $<1$ as the eigenvalue is $\leq1$ or $>1$.

Suppose there are two types. The key quantities are the two multi-dimensional p.g.f.s

$F_i(z_1,z_2)=E_i\Bigl[z_1^{X_{i1}}z_2^{X_{i2}}\Bigr]$, $i=1,2$,

where $X_{ij}$ is a r.v. having the distribution of the number of offspring of type $j$ produced by a parent of type $i$.

Let $q_i$ be the probability of extinction when we start with one individual of type $i$. Then, analogous to Theorem 14.3, $(q_1,q_2)$ are the unique solution in $[0,1]^2$ to

$q_1 = F_1(q_1,q_2)$
$q_2 = F_2(q_1,q_2)$.

Friday, 14 February 2014

Simulation of a branching process

Here is something for Lecture 14 (on branching processes).

Readme first. You can run this demo in your browser, but you will need to install the Wolfram CDF player. It may take a few moments to load, so please be patient. If it seems slow to load then an alternative is to download the demo from the link on the page ("Download Demonstration as CDF") and then run the demo in Mathematica on your own computer. Here is a screenshot from the demo.

This Demonstration shows up to six simulated generations of a branching process. The distribution of the number of descendants of each member of the population can have either a Bernoulli, Poisson, or geometric distribution; the parameters of these distributions can be set with sliders. From the three figures, the left one shows the simulated branching process, the upper-right one shows the probability of ultimate extinction as a function of a parameter of the distribution, and the lower-right one shows the probabilities of the size of the population at the selected generation.

To run the demo click on Simulating the Branching Process
[from the Wolfram Demonstrations Project by Heikki Ruskeepää]

Discrete Fourier transforms and p.g.f.s

In non-examinable sections Section 13.4 and 13.5 I have tried to show you some moderately sophisticated applications of the simple ideas we are learning about p.g.f. and conditional expectation. I have added an Appendix B to the notes to say a bit more about the fast Fourier transform.

The brief story is this. To find the distribution of $Y$, the sum of two i.i.d. r.vs $X_1$ and $X_2$ taking values in $\{0,1,\dotsc,N-1\}$, we might find its p.g.f. $p_Y(z)=p(z)^2=(p_0+p_1z+\cdots+p_{N-1}z^{N-1})^2$.
This would take $O(N^2)$ multiplications of the form $p_ip_j$.

However, it is quicker to evaluate $p(z)$ at $2N$ complex values $z=\omega^0,\dotsc,\omega^{2N-1}$, square those values, and then recover the real coefficients of the polynomial $p(z)^2$. The calculation simplifies because by taking $\omega$ as a $2N$th root of unity the calculation of the $p(\omega^k)$, $k=0,\dotsc,2N-1$, can be done by a clever method, the discrete fast Fourier transform, which needs only $O(N\log N)$ multiplications.

Further details are in the notes on pages 52, 53 and 99, and then in the Part II course, Numerical Analysis.

Valentine's Day story

Today's lecture was on Valentine's Day. It is appropriate that you hear this story.

“You haven’t told me yet,” said Lady Nuttal, “what it is your fiancé does for a living.”

“He’s a statistician,” replied Lamia, with an annoying sense of being on the defensive.

Lady Nuttal was obviously taken aback. It had not occurred to her that statisticians entered into normal social relationships. The species, she would have surmised, was perpetuated in some collateral manner, like mules.

“But Aunt Sara, it’s a very interesting profession,” said Lamia warmly.

“I don’t doubt it,” said her aunt, who obviously doubted it very much. “To express anything important in mere figures is so plainly impossible that there must be endless scope for well-paid advice on how to do it. But don’t you think that life with a statistician would be rather, shall we say, humdrum?”

Lamia was silent. She felt reluctant to discuss the surprising depth of emotional possibility which she had discovered below Edward’s numerical veneer. “It’s not the figures themselves,” she said finally, “it’s what you do with them that matters.”

— Ascribed to K. A. C. Manderville, The Undoing of Lamia Gurdleneck, in Maurice G. Kendall and Alan Stuart, The Advanced Theory of Statistics, Volume 2 (1979, frontispiece).

Lecture 13

The tower property of conditional expectation (also called the law of iterated expectations) is akin to the law of total probability that we met in Section 6.3. Another name is law of total expectation. A special case is

$E[X]=\sum_{i:P(A_i)>0} E[X \mid A_i]P(A_i)$,

where $A_1,\dotsc,A_n$ are events giving a partition of the sample space $\Omega$. Compare the above to

$E[X]=E\big[E[X \mid Y]\bigr]=\sum_y E[X \mid Y=y]P(Y=y)$.

Sometimes we might write $E\Big[E[X \mid Y]\Bigr]$ as $E_Y\Big[E[X \mid Y]\Bigr]$, just to remind ourselves that the outer expectation is being applied to the variable $Y$. Of course it is implicit in all this that $E[X]$ exists.

Remember that $P(A | B)$ is only defined for $P(B)>0$. If you want to see something that illustrates this in a cute way, then read this nice description of Borel's paradox. The paradox occurs because of a mistaken step in conditioning on an event of probability $0$, which can be reached as a limit of a sequence of events in different ways.

Wednesday, 12 February 2014

Mathematica and MATLAB

I highly recommend that you install Mathematica and use it while you study and work on examples sheets (in many courses). I personally use it on almost a daily basis.
You can download a free copy:
http://www.maths.cam.ac.uk/computing/software/mathematica/

It is very easy install and learn to use. The time you spend learning to use it (and similarly MATLAB) is a very good investment.
http://www.maths.cam.ac.uk/undergrad/catam/software/matlabinstall/matlab-personal.htm

Here is a Mathematica program for Example Sheet 1, #15.

(* POLYA URN *)

{w, b} = {1, 1}
For[i = 1, i <= 50, i++, {
For[j = 1, j <= 1000, j++, {
m = RandomInteger[{1, w + b}];
If[m <= w, w = w + 1, b = b + 1];
}]
Print[w/(w + b) // N];
}]

Lecture 12

Today's lecture was on probability generating functions. A very nice tripos question is the following, 2004/2/II/9F. You should now be able to do this (and could discuss in your next supervision)

Hint

We saw today that generating functions are useful in realms beyond probability, as example being that

"the $n$th Fibonacci number, $F_n$, is the coeﬃcient of $x^n$ in the expansion of the
function $1/(1−x−x^2)$ as a power series about the origin."

That is a quote from chapter 1 of Herbert Wilf's fun book called generatingfunctionology, whose second edition is free to download. In his book you can read about many weird and wonderful things that can be done with generating functions. He starts Chapter 1 by writing

"A generating function is a clothesline on which we hang up a sequence of numbers for display."

Generating functions can encapsulate complicated things in a lovely way. For example, the coefficient of $x^n/n!$ in the power series expansion of $e^{e^x-1}$ is the Bell number $B(n)$, i.e. the number of different partitions that can be made from a set of $n$ elements. In Chapter 1 of Wilf's book you can see a derivation of this fact.

Tuesday, 11 February 2014

Supervisors

If you are reading this, supervising Probability IA, and have not already been in touch with me, then please do so. I would enjoy knowing that names of people who are supervising. Together we can share experiences and improve the examples sheets and course. Here are people who I know to be supervising the course this year.

Anthony Ashton
Nathanael Berestycki
Stephen Burgess
Nilanjana Datta
Henry Jackson
Frank Kelly
Ross Lawther
Roger Sewell
Alexandra Turner
Richard Weber

Monday, 10 February 2014

Lecture 11

The probabilistic proof of the Weierstrass approximation theorem was to show you how probability theory can be used to prove an important theorem in a different field. We did something similar in Example 4.5 (Erdos probabilistic method in combinatorics) and in Examples Sheet 1, #8. The Wikipedia page on Bernstein polynomials has a list of Berstein polynomials and an animation showing how they are used to approximate a continuous function on $[0,1]$.

I cheated today by using a result from the schedules for Analysis II ("a continuous function on a closed bounded interval is uniformly continuous"). I had thought about giving an easier proof by not appealing to Analysis II, but I decided it would be nicer to show you the proper full result.

The "easy" route would have been to make the stronger assumption that there exists $\lambda$ such that $|f(x)-f(y)|\leq \lambda |x-y|$ for all $x,y\in[0,1]$. This is Lipschitz continuity. Then the proof could have been done using only Jensen's inequality:

$E\Bigl[|f(X/n)-f(x)|\Bigr] \leq \lambda E\Bigl[|X/n-x|\Bigr] \leq \lambda \sqrt{E\Bigl[(X/n-x)^2\Bigr]}=\lambda\sqrt{\displaystyle\frac{x(1-x)}{n}}\leq \displaystyle\frac{\lambda}{2\sqrt{n}}$

which $\to 0$ as $n\to\infty$. However, I like giving you the stronger result. This needed Chebyshev's inequality and illustrated a nice idea that is frequently used in probability calculations. We partitioned the sample space into $A$ and $A^c$ and reasoned differently in each. For $A$ we had $P(A)$ small and $|f(X/n)-f(x)|\leq 2$, whereas for $A^c$ we had $P(A^c)\leq 1$ and $|f(X/n)-f(x)|$ small.

Benford's law is surprising when first encountered. The fact that the leading digit should be "1" with a frequency of 0.301 (rather than $1/9$) has been used as a test to check for fraud in financial records. The idea is that if a person is making up numbers, he unlikely to arrange that his distribution of first digits agree with Benford's law. You can read more about that, and many other examples of Benford's law, in the Wikipedia article referenced above.

Friday, 7 February 2014

Examples Sheet 1

Most of you will have now done Examples Sheet 1. Let me make a few comments about some things I hope you learn from that sheet.

#7. (about infinite sequence of coin tosses and limsup of events) This was done in Section 4.4 of the notes. An even quicker solution can be done using the fact that $P$ is a continuous function (proved in Section 7.1 of notes).

$P(C)=P(\bigcap_{n=1}^\infty\bigcup_{k=n}^\infty A_k)=\lim_{n\to\infty}P(\bigcup_{k=n}^\infty A_k)\leq\lim_{n\to\infty}(p_n+p_{n+1}+\cdots)=0$.

#8. (sets none of which is a subset of the other) Think how this question would look if everything except the first and last sentences of its statement were removed. It would look hard! See how adding a layer of probability helps prove a theorem that seems to have nothing to do with probability. We saw this in Example 4.5 (Erdos's probabilistic method) and we will see this a couple more times in the course, next in Section 11.4 (Weierstrass approximation theorem).

#13 (Mary tosses more heads than John) and #18 (41 bags of red and blue balls). Both these questions have very beautiful solutions that should give you a WOW! feeling. Please make sure your supervisor shows you the beautiful solutions, if you have not found them yourself.

#16 Some students found this hard because they were not really sure what I was asking you to show. I have rewritten the question and posted a new version of Sheet 1. See if the new version on #16 is easier to understand. That may help your successors in IA 2015. Note that this question relates to what I did in Lecture 10 about information entropy.

#17 (goat ate the cards!) I invented this question as a sort of joke - but I am pleased that it has turned out to be very thought provoking for many students. If the "penny has dropped" for you about this question then you are on your way to developing some good intuition in the field of probability.

Comments are welcome and will help your successors.

Information entropy

I spoke about the notion of information entropy: $H(X)=-\sum_i p_i\log p_i$. This has already featured on Examples sheet 1, #16. There we were to consider the multinomial distribution for dice rolls,

$\phi(k_1,\dotsc,k_6)=\displaystyle\frac{n!}{k_1!\cdots k_6!}\frac{1}{6^n}$, where $\sum_ik_i=n$ and $\sum_iik_i=\rho n$.

Using Stirling's formula and setting $k_i=np_i$ we see that this is maximized by maximizing the entropy of $-\sum_i p_i\log p_i$, subject to $\sum_i ik_i = \rho n$. You'll find something like this at the start of the Wikipedia article on Large deviations theory.

Entropy is an important concept in many fields, particular in communication/information theory. Shannon's coding theorem says, informally,

"$N$ i.i.d. random variables each with entropy $H(X)$ can be compressed into more than $N H(X)$ bits with negligible risk of information loss, as $N$ tends to infinity; but conversely, if they are compressed into fewer than $N H(X)$ bits it is virtually certain that information will be lost."

You can learn about this in the Part II course: Coding and Cryptology.

Lecture 10

I mentioned that there is a 318 page book completely devoted to the Cauchy-Schwarz inequality. It is The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Its author, Michael J. Steele, is a past President of the Institute of Mathematical Statistics. One reviewer wrote "... this is one of that handful of mathematics books that you can read almost like a novel." Chapter 1 is free to download and really interesting to read. I can also recommend his highly entertaining web site, which includes material for advanced courses in probability and finance, and also Semi-Random Rants, Videos and Favorite Quotes.

There are many proofs of this inequality. Here are two in addition to the one I did in lectures.

1. Consider $E(tX-Y)^2]=t^2E[X^2]-2t E[XY]+E[Y^2]\geq 0$, and $>0$ unless $tX=Y$. This quadratic has most one real roots and so the discriminant must be nonpositive. This is essentially the proof that was given by Schwarz and is a textbook favorite.

2. We started the lecture with Jensen's inequality. It is natural to ask if the Cauchy-Schwarz inequality can be deduced from it. One version of the Cauchy-Schwarz inequality is

$(\sum_i a_ib_i)^2\leq \sum_ia_i^2\sum_ib_i^2$,

for all numbers $(a_1,\dotsc,a_n)$ and $(b_1,\dotsc,b_n)$. WLOG we may assume all $a_j,b_j> 0$. Consider a random variable $X$ for which $P(X=a_j/b_j)=b_j/\sum_ib_i^2$. Apply Jensen's inequality with the choice of convex function $f(x)=x^2$. $Ef(X)\geq f(EX)$ gives the result.

This paper presents 12 different proofs of the Cauchy-Schwarz inequality, and this web page has 10.

Thursday, 6 February 2014

Non-transitive dice

In a comment to the Lecture 9's blog someone has alerted me to the fact that a set of 3 non-transitive dice can be bought from Maths Gear. They have sides like this:

Die A: 333336

Die B: 222555

Die C: 144444 $P(A>B)=P(B>C)=21/36$ and $P(C>A)=25/36$.

These are optimal, in that they maximize the smallest winning probability. However, there are other dice for which the sum of the 3 winning probabilities is greater.

Die A: 114444

Die B: 333333

Die C: 222255 $P(A>B)=P(B>C)=24/36$ and $P(C>A)=20/36$.

Following some links on that page I have found this interesting article by James Grime (of the Millennium Mathematics Project) in which he says lots more about non-transitive dice. He has devised a set of 5 non-transitive dice. In another interesting article there is a video in which James discusses non-transitive dice with David Spiegelhalter.

Reading this made me wonder more about optimal designs. If dice were 2-sided (i.e. coins with a number on each side) then non-transitivity clearly is impossible. But what about 3-sided dice? Is a set of three 3-sided non-transitive dice possible? I quickly found one: A: 333, B: 225, C:441. Is there a set of 4 non-transitive 3-sided dice? What is the greatest number of 6-sided non-transitive dice possible? Is it 5, as found by James Grime, or could we have 6 non-transitive dice?

James Grime has now added a comment below.

Wednesday, 5 February 2014

Expectations of sums and products

In Theorem 8.1 (expectation of a sum) and Theorem 9.2 (expectation of a product) we proved results for a finite number of r.vs, $X_1,\dotsc,X_n$. In general, the corresponding results for a countably infinite number of r.vs are not true. You might like to think about why the proofs do not work if $n$ is replaced by $\infty$. Can you find $X_1,X_2,\dotsc$ such that the following is not true?

$E\sum_{n=1}^\infty X_n = \sum_{n=1}^\infty EX_n$ Here is a hint. There do exist special cases for which the above is true, such as when $\sum_{n=1}^\infty E|X_n|$ converges.

Lecture 9

The account of Warren Buffett, Bill Gates and Efron's dice that I gave in lectures you can find here.

The "Names in Boxes Problem", which I talked about today, is a puzzle which I first heard described around a lunch table at AT&T Laboratories, New Jersey, around 2003. The problem was posed in a paper, with a less catchy title:

Anna Gal and Peter Bro Miltersen. The Cell Probe Complexity of Succinct Data Structures.
Proceedings of 30th Intemational Colloquium on Automata. Languages and Programming (ICALP) 2003, 332-344.

It won a best paper award at this ICALP conference. The solution was found by Sven Skyum, a colleague of Miltersen's at the University of Aarhus If you would like to read more about the history of the problem, I recommend this paper: The Locker Problem, (by Max Warshauer, Eucene Curtin, Michael Kleber, Ravi Vakil, The Mathematical Intelligencer, 2006, Volume 28, Issue 1, pp 28-31).

Someone asked me today about whether $1-\log 2\approx 0.30$ is the best possible success probability for the prisoners. I think this question is answered "yes" by the discussion on page 30 of The Locker Problem.

The puzzle is listed as #1 in a collection complied by Peter Winkler callled, "Seven Puzzles You Think You Must Not Have Heard Correctly". I think it is remarkable that we can work out all the needed facts about cycle lengths in a random permutation by using only the simple ideas from probability that we have encountered in our course thus far.

Peter is great collector of good puzzles. He has some nice ones that are geographic. How about these (answer without looking at a map):

If you travel due south from Miami, Florida, what part of the South American continent do you first hit?
What two cities, one in North America and one Africa, are closest?

I have been told that Question 2 in "Seven Puzzles You Think You Must Not Have Heard Correctly" has been used by some colleagues as an entrance interview question. But I think that is rather tough!

Tuesday, 4 February 2014

Notes for Lectures 1–24

I have a pretty clear idea how the rest of the course should be put into lectures. So I have done a bit more work on the notes and am posting everything I have prepared through Lecture 24.

There are some gaps, which I will fill before we come to each lecture, and I will do some more proof-reading and cosmetics. However, even if there are occasional typos and changes to be made, these notes are close to the eventual finished product. So I think it might help you to have these now — particularly for those of you whose supervisions on Sheets 2, 3, 4 arise sooner that we reach that point in lectures.

Monday, 3 February 2014

Problem solving strategies

Students sometimes say that the theory in Probability IA is all very well, but that the tripos questions require ingenuity. That is sometimes true, of questions like 2004/II/12: "Planet Zog is a sphere with centre O. A number N of spaceships land at random on its surface. ... ". Of course this makes the subject fun.

But to help with this, let's compile a collection of some frequently helpful problem solving strategies. More will be added as the course progresses.

You are asked to find $P(A)$. For example, $A$ might be the event that at least two out of $n$ people share the same birthday. Might it be easier to calculate $P(A^c)$? The simple fact that $P(A)=1-P(A^c)$ can sometimes be remarkably useful. For example, in Lecture 1 we calculated the probability that amongst $n$ people no two have the same birthday.
You are asked to find $P(A)$. Is there a partition of $A$ into disjoint events, $B_1,B_2,\dotsc, B_n$, so that $P(A)=\sum_i P(B_i)$?
Probabilities of intersections are usually easier to calculate than probabilities of unions. Fortunately, the inclusion-exclusion formula gives us a way to convert between the two. You are asked to find $P(A)$. Can $A$ be written as the union of some other events? Might the inclusion-exclusion formula be helpful?
The expectation of a sum of random variables is the sum of their expectations. The random variables need not be independent. This is particularly useful when applied to indicator random variables. You are asked to find $EN$, the expected number of times that something occurs. Can you write $N=I_1+\cdots+I_k$, where this is a sum of indicator variables, each of which is concerned with a distinct way in which $N$ can be incremented? This idea was used Example 8.3 in lectures, where $N$ was the number of couples seated next to one another. It is the way to do the Planet Zog question, mentioned above. You can use it in Examples sheet 2, #11.
You are asked to place a bound on some probability. Can you use one of the inequalities that you know? Boole, Bonferroni, Markov, Chebyshev, Jensen, Cauchy-Schwarz, AM-GM. The ones in bold are key ones.
You are asked something about sums of independent random variables. Might a probability generating function help? For example, suppose $X$ has the discrete uniform distribution $P(X=r)=1/(n+1)$, $r=0,1,\dotsc,n$. Is it possible for $X$ to have the same distribution as $Y_1+Y_2$, where $Y_1$, $Y_2$ are some two independent random variables with the same distribution? Hint: what would the p.g.f. of $Y_i$ have to be?
This is like 2 above, but with the tower property of conditional expectation. You are asked to find $EX$. Maybe there is some $Y$ so that $E[X]$ is most easily computed as $E[E[X\mid Y]]=\sum_y E[X \mid Y=y]\,P(Y=y)$.
It is often useful to use what you know about the relations between different standard distributions. For example, if $X\sim U[0,1]$ then $-\log X\sim \mathscr{E}(1)$. So make sure to learn well all the distributions that we cover in the course and understand the relations between them. Recall their special properties: such the memoryless property of the geometric and exponential distributions. As you approach a question ask yourself, what distribution(s) are involved in this question?
You are given the joint density function of continuous random variables $X_1,\dotsc,X_n$ and want to prove that they are independent. Try to spot, by inspection, how to factor this as $f(x_1,\dotsc,x_n)=f_1(x_i)\cdots f_n(x_n)$, where each $f_i$ is a p.d.f.
In questions about transformation of continuous random variables there are a couple things to look out for. Always start with a bijection between $R\subseteq\mathbb{R}^n$ and $S\subseteq\mathbb{R}^n$ and make sure that you specify $S$ correctly. If $Y_1,\dotsc,Y_n$ is more variables that really interest you, then you can always integrate the unneeded ones out at the end. In computing the Jacobian $J$, remember that it is sometimes easier to compute $1/J= \partial(y_1,\dotsc,y_n)/\partial(x_1,\dotsc,x_n)$.

Lecture 8

Zipf's law.
As a small digression, or "fun fact", I told you today about Zipf's law. This is a probability mass function such that $P(X=i)\propto 1/i^s$, $i=1,2,\dotsc,m$, for some $s > 0$, typically $s=1$. It's certainly mysterious why this distribution should be a good fit to frequencies with which words are used in a language. According to the Wikipedia article linked above, "Zipf himself proposed that neither speakers nor hearers using a given language want to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution". My guess is thinking that he is thinking that to retrieve the $i$th most popular word from your memory has a cost proportional to $i$ (like you were digging down into a list), and that the average workload that each word puts on the brain should be constant, so $ip(i)$ should constant. But this is clearly hand-waving. Maybe you can think of a better explanation

In queueing theory and other theories of congestion, delays and long waiting times can be caused by variability in service times. Suppose customers enter a queue at a constant rate of 1 every 5 minutes. They are served one at a time, service times are independent, taking $1,2,\dotsc$ minutes with probabilities $p_1,p_2,\dotsc$, respectively

An interesting fact (researched extensively during the past 30 years) is that there are large qualitative difference between how such queues behave when $p_i$ decays geometrically or exponentially (as do the Poisson, or geometric distribution, which has $p_i= q^ip$), and when it decays according to a power law, like $p_i\propto 1/i$ or $\propto 1/i^{0.9}$.

Sunday, 2 February 2014

The “sexiest job of the 21st century”

Probability lies at the foundations of statistics. In today's Sunday Times (2 February, 2014) there is an article entitled "Mystic Nate Silver bets on making statistics 100% sexy". Here are a few quotes from it

More lecture notes

I am now posting notes up to Lecture 12. Most of Examples Sheet 2 is on topics we will have covered up to Lecture 9. However, in Question 9 you need the definition of covariance, and this comes in Lecture 10. Remember also that if you want to read ahead there are other notes you can access (listed on the course information page). Here are some other things.

Corrections. I am continuing to receive help from students and others in finding small typos in the notes. By now the notes for Lectures 1-9 should be close to their perfect final form. If anyone wants to read ahead and sees any typos in Lectures 10-12, please let me know and I can correct them before we reach that point.
My handwriting. A student asked me to write more clearly. Fair comment. Since receiving that comment I have been trying to pay attention to this. I hope my writing on the visualiser is improving.
Visualiser (document camera). A student has written that the projection from the camera is dim and asked if it can be brighter. Have others a problem with the brightness? I will investigate. I think the visualiser is nice for a number of reasons. I can be always facing you (unlike blackboards), and it is centrally placed to give everyone a good view (unlike the two overhead projectors). Some mathematicians like chalkboards. But I find them to be very messy and in the past I have had bad experiences with dust getting in my eyes.

You can write suggestions below, anonymously if you wish (but then I cannot reply to you). Comments are moderated by me before appearing. I will "read, mark, learn and inwardly digest" anything you want to tell me about the above points, or other things. When the message is in the spirit of a 1-1 communication I will not release the post for public view.