Probability

Lecture 24

2015-03-11T12:15:00.000+00:00

The overhead slides I used for our last lecture are here (best downloaded and viewed with a pdf reader, rather than as a web page).

I showed you a bit of Large deviation theory and Chernoff's upper bound. The lower bound was quoted without proof (except that in Example 24.1 we did the lower bound for the special case of $B(1,p)$). If perhaps some time mid-summer you have a moment and are curious to see the lower bound proved (which you have learned enough to understand), and more about large deviations, see these slides from a talk I once gave, pages 5–9. This note on the Cramer-Chernoff theorem is also good. Interestingly, Chernoff says in an interview with John Bather that the Chernoff bound should really be named after someone else!

I have used large deviation theory in my research to analyse buffer overflows in queues. But I know almost nothing about the subject of Random matrices. I prepared the content for Section 24.2 because I was intrigued by the results, wanted to learn, and thought it would be something entertaining with which to conclude. I did some of my reading in Chapter 2 of An Introduction to Random Matrices, by Anderson, Guiomet and Zeotouni, and then simplified their treatment to make it suitable for IA.

I thought it was nice that in showing you these two advanced topics, I could bring into play so many of the ideas we have had in our course: Markov and Chebyshev inequalities, moment generating function, sums of Bernoulli r.vs, Stirling’s formula, normal distribution, gambler’s ruin, Dyke words, generating functions, and the Central limit theorem.

So I hope you will consider the applicable courses in your next year of study. In Appendix H I have written some things about applicable mathematics courses in IB.

Lecture 23

2015-03-09T08:17:00.000+00:00

Lévy's continuity theorem is the same thing as the continuity theorem given in today's lecture, but for characteristic functions.

Here is a little more history about the Central Limit Theorem.

Henk Tijms writes in his book, Understanding Probability: Chance Rules in Everyday Life, Cambridge: Cambridge University Press, 2004,

"The Central Limit Theorem for Bernoulli trials was first proved by Abraham de Moivre and appeared in his book, The Doctrine of Chances, first published in 1718. He used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812.

De Moivre spend his years from age 18 to 21 in prison in France because of his Protestant background. When he was released he left France for England, where he worked as a tutor to the sons of noblemen. Newton had presented a copy of his Principia Mathematica to the Earl of Devonshire. The story goes that, while de Moivre was tutoring at the Earl's house, he came upon Newton's work and found that it was beyond him. It is said that he then bought a copy of his own and tore it into separate pages, learning it page by page as he walked around London to his tutoring jobs. De Moivre frequented the coffeehouses in London, where he started his probability work by calculating odds for gamblers. He also met Newton at such a coffeehouse and they became fast friends. De Moivre dedicated his book to Newton."

The Wikipedia article on the Central limit theorem mentions two things that would be suitable for the television programme QI.

1. There is a quite interesting explanation of why the term "Central" is used.

The actual term "central limit theorem" (in German: "zentraler Grenzwertsatz") was first used by George Pólya in 1920 in the title of a paper. Pólya referred to the theorem as "central" due to its importance in probability theory. According to Le Cam, the French school of probability interprets the word central in the sense that "it describes the behaviour of the centre of the distribution as opposed to its tails".

Personally, I had always thought it was the second of these, but the first is also plausible.

2. There is a quite interesting Cambridge connection.

A proof of a result similar to the 1922 Lindeberg CLT was the subject of Alan Turing's 1934 Fellowship Dissertation for King's College at the University of Cambridge. Only after submitting the work did Turing learn it had already been proved. Consequently, Turing's dissertation was never published.

Lecture 22

2015-03-06T12:05:00.000+00:00

You might like to experiment with the bivariate normal. Here is the Mathematica code that I used to plot the joint density function. You can copy and paste this code into Mathematica and they play with viewing from different angles or changing the value of the correlation, r.

S = ParallelTable[
  Plot3D[PDF[
    MultinormalDistribution[{1, 1}, {{1, r }, {r , 1}}], {x, y}],
    {x, -3, 4}, {y, -3, 4}, PlotRange -> All, 
   MeshFunctions -> {#3 &}, PlotPoints -> 50], {r, {0, 0.6}}]

You can also get the code here, and for Kelly betting here.

Inversion of p.g.f and m.g.f.

2015-03-06T12:00:00.000+00:00

For discrete random variables working with the p.g.f. is easy. We have

Theorem 12.2. The distribution of $X$ is uniquely determined by the p.g.f. $p(z)$.

The proof is trivial because the distribution of $X$ is contained in coefficients of the power series $p(z)=p_0+p_1z+p_2z^2+\cdots$. It is also easy to tell if any given power series is a p.g.f. All we need is check is that the coefficients are nonnegative and sum to 1.

However, things are not so easy for moment generating functions. Given a power series $m(\theta)=1+m_1\theta + m_2\frac{1}{2}\theta^2 +\cdots$
it is not easy to tell if there exists a r.v. that has moments $m_1,m_2,\dotsc$.
Also we had without proof the theorem analogous to Theorem 12.2,

Theorem 21.4 The moment generating function determines the distribution of $X$, provided $m(\theta)$ is finite for all $\theta$ in some interval containing the origin.

To prove Theorem 21.4 we would need to show how to recover the p.d.f. from the m.g.f.
In fact, it is easier to use the characteristic function. Let's take a small look at that.

In Lecture 22 we saw that the moment generating function of $N(0,1)$ is $m(\theta)=e^{\theta^2/2}$. This looks a lot like the p.d.f., which is $f(x)=e^{-x^2/2}/\sqrt{2\pi}$.

The characteristic function (c.f.) $\chi$ is obtained by replacing $\theta$ with $it$, $t$ real,

$\chi(t) = \int_{-\infty}^\infty f(x)e^{itx} dx$. This is $\chi(t)=e^{-t^2/2}$ for $N(0,1)$.

We can compute nearly the same integral of $\chi$ to recover $f$:

$\int_{-\infty}^\infty \chi(t)e^{-itx} dt=2\pi e^{-x^2/2}=2\pi f(x)$.

In general, given the c.f. we can recover the p.d.f. using the inversion formula

$f(x)=\frac{1}{2\pi}\int_{-\infty}^\infty \chi(t)e^{-itx} dt$.

The m.g.f. exists only if all moments are finite. However, the c.f. exists even when moments do not exist, since $e^{itX}$ is just some point on a unit circle in the complex plane, and hence nothing is going off to infinity. The Cauchy distribution is an example of a distribution that has no m.g.f, but does have a c.f., namely $e^{-|t|}$.

Feedback

2015-03-06T10:25:00.000+00:00

On Friday March 6, I distributed the usual feedback forms. Alternatively, you can complete this equivalent on-line form, or print the paper form here and hand it to me. If you prefer, you may wait until closer to the last lecture to complete the on-line form or to hand me a paper form.

The on-line version will be sent to my email anonymously (unless you choose to sign your name). After reading the responses, I will forward printouts to the Faculty Office.

Examples Sheet 3

2015-03-06T10:20:00.000+00:00

Most of you will have now done Examples Sheet 3. Here are a few notes.

#4. You'll want to compare this with Examples sheet 4 #5.

#5. What would be the analogous question to this one, posed for continuous r.vs?

#8. You should know how to answer this two ways: by using p.g.fs, and also directly via that tower property of conditional expectation.

#16. How do we know that the answer is $a_z=a^z$, where $a$ the smallest positive root of $a=pa^3+1-p$? We have such a theorem for the extinction probability in a branching processes. How can this coin tossing problem be viewed through the lens of some branching process?

#17. If you need a hint, here is one: $P(|X+Y|>\epsilon)\leq P(|X|\geq \epsilon/2)+P(|Y|\geq \epsilon/2)$.

#18. The hint is for the third part. But it also useful in answering the second part: "Could you load the die so that the totals {2,3,4,5,6,7} are obtained with equal probabilities?"

Planet Zog problems

2015-03-06T10:14:00.002+00:00

Here are two challenging tripos questions from recent years. The first is a problem in geometric probability (as in Lecture 18). To do this question it will help to know that the surface area of a spherical cap is $2\pi R h$, where $R$ is the radius of the sphere and $h$ is the height of the cap. The surface area of the entire sphere is $4\pi R^2$.

2003/II/12
Planet Zog is a ball with centre $O$. Three spaceships $A$, $B$ and $C$ land at random on its surface, their positions being independent and each uniformly distributed on its surface. Calculate the probability density function of the angle $AOB$ formed by the lines $OA$ and $OB$.

Spaceships $A$ and $B$ can communicate directly by radio if $\angle AOB < \pi /2$, and similarly for spaceships $B$ and $C$ and spaceships $A$ and $C$. Given angle $\angle AOB =\gamma < \pi /2$, calculate the probability that $C$ can communicate directly with either $A$ or $B$. Given $\angle AOB =\gamma > \pi /2$, calculate the probability that $C$ can communicate directly with both $A$ and $B$. Hence, or otherwise, show that the probability that all three spaceships can keep in in touch (with, for example, $A$ communicating with $B$ via $C$ if necessary) is $( \pi +2)/(4\pi )$.

Spoiler alert! Here is the answer.

The second question is rather different. Here you are being tested on the idea that a sample space can be partitioned into disjoint events.

2004/II/12
Let $A_1,A_2,\dotsc,A_r$ be events such that $A_i\cap A_j = \emptyset$; for $i \neq j$. Show that the number $N$ of events that occur satisfies

$ P(N = 0) = 1 −\sum_{i=1}^rP(A_i) . $

Planet Zog is a sphere with centre $O$. A number $N$ of spaceships land at random on its surface, their positions being independent, each uniformly distributed over the surface. A spaceship at $A$ is in direct radio contact with another point $B$ on the surface if $\angle AOB < \pi/2$. Calculate the probability that every point on the surface of the planet is in direct radio contact with at least one of the $N$ spaceships.

[Hint: The intersection of the surface of a sphere with a plane through the centre of the sphere is called a great circle. You may find it helpful to use the fact that $N$ random great circles partition the surface of a sphere into $N(N − 1) + 2$ disjoint regions with probability one.]

Lecture 21

2015-03-04T12:15:00.000+00:00

I mentioned the beta distribution today. There is more about that distribution in Appendix D. I became much better acquainted with this distribution myself when I was doing some work for a reinsurance company in 2013.

Lecture 20

2015-03-02T12:15:00.000+00:00

Today we saw how to transform random variables. By this means we can discover further interesting relationships between distributions.

There is a particularly strange fact in Examples sheet 4, #19.

By the way, if (like me) you have ever been puzzled why it is that the absolute value of the Jacobian, $|J|$, gives the volume change of a $n$-dimensional parallelepiped under an linear transformation in $\mathbb{R}^n$, you might enjoy reading this essay A short thing about determinants, or “an attempt to explain volume forms, determinants and adjugates by staying in 3-d”, by Gareth Taylor.

I have also now added to the notes an Appendix C, in which I explain how under a linear transformation a sphere is mapped into an ellipsoid of a volume that greater by a factor $|J|$. I think that in previous years lecturers have said at this point "you already know this from Vector Calculus". But I am not sure that's that case, and so decided it would be nice to try to remove some of the mystery.

The convolution we need to add two Cauchy random variables is tricky. On page 81, the partial fractions method of computing this integral is to write

$\frac{1}{(1+x^2)(1+(z-x)^2)}=\frac{A+Bx}{(1+x^2)}+\frac{C+D(x-z)}{(1+(z-x)^2)}$

for $(A,B,C,D)=\frac{1}{4+z^2}(1,2/z,1,2/z)$. Upon integrating w.r.t. $x$ the terms with $B$ and $D$ give $0$ (by symmetry either side of $0$), and the terms with $A$ and $C$ provide the claimed result of $\frac{2}{\pi(4+z^2)}$.

Using the ideas in today's lectures, one can show that if $X,Y$ are independent $N(0,1)$ then $W=X^2+Y^2$ and $T=\tan^{-1}(Y/X)$ are independent $\mathscr{E}(1/2)$ and $U[-\pi/2,\pi/2]$. We can use this result in the reverse direction to generate normal random variables. Make two independent samples from $U[0,1]$, say $U$ and $V$. Let $R^2=-2\log U$ (which is $\mathscr{E}(1/2)$) and let $ \Theta=(V-\frac{1}{2})\pi$. Then $X=R\cos\Theta$ and $Y=R\sin\Theta$ are independent $N(0,1)$.

Lecture 19

2015-02-27T12:20:00.000+00:00

The fictional example in Section 19.2 about the heights of UK and Netherlands basketball players showed something striking. Can you think of any real life instances of this sort of effect?

The graphs on page 7 of this report on climate change are interesting in respect of showing how a small change in the mean can give rise to an amplified impact in the tail of a normal distribution. See Full SREX report: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation, Special Report of the Intergovernmental Panel on Climate Change.

In this graph we see "hot weather" increasing by a factor of 2.

But "extreme hot weather" increases even more, by a factor of 3

Lecture 18

2015-02-25T12:15:00.000+00:00

The Wikipedia article on the Bertrand paradox has more interesting information and pictures. The is also a Wolfram demonstration project called the Random chord paradox that you can play with.

Examples sheet 4, #2 is this classic question: "A stick is broken in two places, independently uniformly distributed along its length. What is the probability that the three pieces will make a triangle?" For a more a more complicated geometric probability problem see the blog posting on Planet Zog problems.

Table of distributions

2015-02-23T13:00:00.000+00:00

Here is a table of 5 discrete and 6 continuous distributions. It is supplied to second year students studying Statistics IB, as a reminder of things they are to know from Probability IA.

For each of distribution the table shows its usual notation, p.m.f. or p.d.f., mean, variance, and p.g.f. or m.g.f.

You will want to have at your fingertips most of the information in this table (or the method to by which to produce it). You might find it helpful to make your own table, using this one as a guide.

Some of these 11 distributions are not explicitly mentioned in the schedules for Probability IA. Rather they appear as examples of other things which are mentioned in the schedules. For example, the gamma distribution arises as an example of summing i.i.d. r.vs (in this case exponential r.vs). The multivariate normal provides an example of the p.d.f. of jointly distributed continuous r.vs. The beta distribution appears as an exercise on the joint distribution of order statistics of i.i.d. uniform r.vs. We have not mentioned in a lecture the negative binomial, but this is simply the sum of i.i.d. geometric r.vs. (as done in Examples sheet 3, #5).

Lecture 17

2015-02-23T12:15:00.000+00:00

An important special case of Theorem 17.1 is when $Y=aX$, for some constant $a$.
The p.d.f. of $Y$ is then $f_Y(y)=(1/a)f_X(y/a)$.

There are many ways to define transitive ordering relations between random variables. Now you know about $EX\geq EY$ and $EX\geq_{\text{st}} EY$. Such relationships are important in applications, such as when $X$ and $Y$ are lifetimes of components and we are interested ordering their reliabilities. (Reliability engineering is concerned with determining the ability of a system or component to function under some stated conditions for a specified period of time.)

There are other ordering relations, such as "hazard rate ordering", and "convex ordering". These are described within the Wikipedia article on stochastic ordering. I explained in the lecture that $EX\geq_{\text{hr}} EY\implies EX\geq_{\text{st}} EY\implies EX\geq EY$. The first implication is from
\[
1-F(x) = \exp\left(-\int_0^x h(t)dt\right)
\]and the second from Theorem 17.4. An important order relation in statistics is "likelihood ratio ordering". If $X$ and $Y$ have p.d.fs $f$ and $g$ such that $f(t)/g(t)$ increases with $t$ then $X\geq_{\text{lr}}Y$. It is a stronger ordering than the two we have met, since $X\geq_{\text{lr}}Y\implies X\geq_{\text{st}}Y$. For example, if $X$ and $Y$ are exponential random variables with parameters $1/2$ and $1$, then $X\geq_{\text{lr}}Y$

The treatment of the inspection paradox is this lecture derived from a paper by Sheldon Ross, The inspection paradox, Probability in the Engineering and Informational Sciences, 17, 2003, 47–51. Of course the use of the word "paradox" is somewhat in appropriate, since it is not hard to understand why $X_J\geq_{\text{st}}X_1$.

There are other such inspection paradoxes. Suppose $X_1,X_2,\dotsc$ are i.i.d. $U[0,1]$ and $N$ is the least integer such that $X_1+\cdots+X_N\geq 1$. Then conditional on this,

$X_N\geq_{\text{st}}X_{N-1}\geq_{\text{st}}\cdots \geq_{\text{st}}X_1$.

I did not discuss in the lecture the coda at the end of page 69. I leave you with this puzzle, "Do girls have more brothers than boys do?" The "proof" that the answer is "yes" is false, though initially it looks almost convincing. What is wrong?

What do you think is the truth of the matter? Let $G$ and $B$ be the mean number of brothers possessed by a typical girl and typical boy, respectively.

If every family were to consists of some number of boys and just one girl then $G > B$.

If every family were to consists of only boys or only girls then $G < B$.

What are the minimal conditions under which $G=B$? Must you assume that each child is independently a boy or girl with probability $1/2$?

Probability distributions

2015-02-20T12:35:00.000+00:00

The chart I showed you, with the distributions we are studying marked in yellow, can be downloaded here. It is from the paper Univariate Distribution Relationships, L. M. Leemis and J. T. McQueston. In that paper you can find explanations of the various relationships that are indicated by the arrows.

Here also is a simpler chart about Relationships between probability distributions. This one shows some relations that the above chart misses out, such as how the binomial can come from the Poisson as $X \mid X+Y$ (see Examples sheet 2, #5).

Wikipedia has a nice List of probability distributions.

In 16.5 of the notes I mention that if $X$ and $Y$ are i.i.d. exponential r.vs then $X/(X+Y)$ is a uniformly distributed r.v. You will prove this in Examples sheet 4, #8. A general method will be given in Lecture 20. But it can also be done this way:

$P(X/(X+Y) \leq t)=P(X(1-t)/t\leq Y)=\int_{x=0}^\infty\left(\int_{y=x(1-t)/t}^{\infty}\lambda e^{-\lambda y}dy\right)\lambda e^{-\lambda x}dx = t$

which is the c.d.f. of $U[0,1]$.

Lecture 16

2015-02-20T12:30:00.000+00:00

The human life plots of force of mortality that I showed in today's lecture can be viewed on the Understanding Uncertainly page: How long are you going to live?

David Spiegelhalter (Winton Professor for the Public Understanding of Risk) makes this comment:

When 7 years old, there's only 1 in 10,000 chance of not making your 8th birthday - this is the safest that anyone has been, anywhere, ever, in the history of humanity. But the jump from 16 to 17 is the biggest relative increase in risk in your life (all this, of course, assumes you're utterly average).

But I think the most amazing thing is the old Gompertz observation of log-linearity - the log-hazard is quite extraordinarily straight over such a long range (it would be 7 to 90 if it were not for idiotic youth). So, on average, every year older means an extra 9% chance of dying before your next birthday - so your force of mortality doubles every 8 years, like a nasty compound interest. And it will get you in the end.

He makes reference to the Gompertz distribution, which is one with a linearly increasing hazard rate.

Notice that the hazard rate curve (force of mortality) has a 'bath-tub' shape - there is an early high-risk period immediately following birth, followed by long low period of annual risk, until accidents start kicking in for 17-year-old-boys, and then steadily increases. At the above link there are some tools for interactively exploring how your life expectancy changes as a function of your present age, gender and lifestyle (non-smoker, moderate alcohol, 5 a day, exercise).

Most things (such as car engines, tyres, light bulbs) have an increasing hazard rate. But some things have a decreasing hazard rate. A business that has lasted two centuries is less likely to go bankrupt next year than one that only started two years ago.

Notice that if $X\sim U[0,1]$ then $h(x)=f(x)/(1-F(x))=1/(1-x)$, which tends to $\infty$ as $x\to 1$.

Some fine fine print. I said that a continuous random variable is one that has a continuous c.d.f., and it is this that puts the "continuous" in its name. However, the fine print (which you can find in my notes near the middle of page 63) is that we actually need the c.d.f. to absolutely continuous, and this will ensure that a p.d.f. exists. The qualifier "absolutely" rules out weirdly pathological c.d.fs. For example, the Cantor distribution has a continuous c.d.f., but it has no p.d.f. Of course none of this fine print is examinable in Part IA. But I know that some of you find this type of thing interesting.

Simulation of a random walk

2015-02-18T12:35:00.000+00:00

Here is something for Lecture 15 (on random walk).

Simulating the Simple Random Walk
from the Wolfram Demonstrations Project by Heikki Ruskeepää

Click above to play with a simulation of random walk produced by tossing a fair coin. The "seed" reruns with a different set of random coin tosses. The curves are confidence intervals, within which the walk will lie with high probability. Random walks are very important in physics, biology, and finance.

There are many interesting questions that one can ask about symmetric random walk, i.e. the random paths produced by tossing a fair coin (up 1 for a head, down 1 for a tail).

How many times on-average does a walk of length $n$ cross the $x$-axis?
What is the distribution of the last crossing of the $x$-axis? (recall Lecture 1!)
What is the distribution of the terminal value $X_n$?
As you try different seeds, do you think some walks look more "typical" than others?

Lecture 15

2015-02-18T12:30:00.000+00:00

The Wikipedia page about random walk is worth a look. Today, we focused upon the Gambler's ruin problem. However, there are many more interesting things that can be said about random walks. For example, one can consider random walks on arbitrary graphs. More about random walks will be presented to you in the course Markov Chains, Part IB.

If you wish to ensure you never go bankrupt then rather than bet £1 at each go you could bet £ $fz$, where $0<f<1$. That way you come next to $(1+f)z$ or $(1-f)z$. Kelly betting, is about choosing the optimal value of $f$.

The continuous version of random walk is Brownian motion, which is presented in Stochastic Financial Models, Part II. One of its beautiful properties is that it is self-similar. The graphic below shows how Brownian motion looks when we place it under the microscope and gradually increase the magnification. It is possible to say of Brownian motion in 2 dimensions that "almost surely every arc of the path contains the complete works of Shakespeare in the handwriting of Bacon." (Peter Whittle). Just as we might say in dimension 1 that at some magnifications the path below looks like a graph of FTSE all-share index for the past 5 years.

Simulation of a branching process

2015-02-16T12:30:00.000+00:00

Here is something for Lecture 14 (on branching processes).

Readme first. You can run this demo in your browser, but you will need to install the Wolfram CDF player. It may take a few moments to load, so please be patient. Also, there seems to a problem at present with the plugin running in Chrome, so you might have to use a different browser. The best thing is to download the demo from the link on the page ("Download Demonstration as CDF") and then run the demo in Mathematica on your own computer. Here is a screenshot from the demo.

This Demonstration shows up to six simulated generations of a branching process. The distribution of the number of descendants of each member of the population can have either a Bernoulli, Poisson, or geometric distribution; the parameters of these distributions can be set with sliders. From the three figures, the left one shows the simulated branching process, the upper-right one shows the probability of ultimate extinction as a function of a parameter of the distribution, and the lower-right one shows the probabilities of the size of the population at the selected generation.

To run the demo click on Simulating the Branching Process
[from the Wolfram Demonstrations Project by Heikki Ruskeepää]

Lecture 14

2015-02-16T12:15:00.000+00:00

In this lecture you have had a taste of the theory of branching processes.The calculations we did today in Theorem 14.2 are similar to the sort of thing you need to do in answering Examples Sheet 3, #10, 14, 15. Question #16 does not look like one about branching processes, but you might think about trying to map into a version of a branching process so that you can apply theorems from this lecture.

If you would like to read something more about branching processes, I recommend to A Short Introduction to Branching Processes by Gesine Reinert (a professor of statistics at Oxford).

Branching processes have interesting applications beyond the obvious ones of population biology, e.g. in rumours and advertising. I asked Gesine if she might recommend something beyond the basics. She has written to me, "One application of branching processes is to assess the shortest path length between two randomly chosen vertices in sparse networks; but that may be a bit much for first-year students."

In talking about branching processes with more than one type of individual I was drawing on Gesine's slides 26-28, and 30-32. In these you can see the idea of a multi-type branching process. We see that it is possible to prove a theorem similar to our Theorem 14.3, but now the probability of extinction depends on the maximum eigenvalue of the matrix $M=(m_{ij})$, where $m_{ij}$ is the mean number of offspring of type $j$ produced by a parent of type $i$. The branching process becomes extinct with a probability 1 or $<1$ as the eigenvalue is $\leq1$ or $>1$.

Suppose there are two types. The key quantities are the two multi-dimensional p.g.f.s

$F_i(z_1,z_2)=E_i\Bigl[z_1^{X_{i1}}z_2^{X_{i2}}\Bigr]$, $i=1,2$,

where $X_{ij}$ is a r.v. having the distribution of the number of offspring of type $j$ produced by a parent of type $i$.

Let $q_i$ be the probability of extinction when we start with one individual of type $i$. Then, analogous to Theorem 14.3, $(q_1,q_2)$ are smallest solutions in $[0,1]^2$ to

$q_1 = F_1(q_1,q_2)$
$q_2 = F_2(q_1,q_2)$.

Examples Sheet 2

2015-02-14T10:00:00.000+00:00

Most of you will now be doing Examples Sheet 2. Let me make a few comments about some things I hope you learn from that sheet. Some of what I say below is "in brief', and you may need to fill in details or ask your supervisor.

#3,4. There is often more than one nice way to solve a problem. I think that both of these problems have at least two nice methods of solution. Have you been able to think of more than one way to solve them? For example, #4 can be solved by summing a geometric series. It can also be solved via a simple order 0 recurrence relation.

#6 (ii). You should discover that $Y$ and $X$ are independent. So $X=10$ provides no information about $Y$. What is the intuition for this? Well, we might imagine that $X$ and $Y$ are initially numbers of "detectable" and "non-detectable" misprints, created as independent samples from the Poission distribution with parameter $\lambda/2$. The page has $X+Y$ misprints (which is distributed Poisson, parameter $\lambda$). The proofreader finds the detectable ones. In doing this she discovers nothing about $Y$. This model gives the same distribution of $X$ and $Y$ as one in which there are $X+Y$ misprints and each is detected with probability $1/2$.

#7,8. The results in these questions are often proved by students in something like 2 pages of working. But they can also be set out in perhaps only 2-4 lines of working. Make sure you learn (from your supervisor) the fast way to derive these results.

#10. There are several ways to show that $Y_1,\dotsc,Y_n$ are independent. One way is via the fact that $P(Y_1=1,Y_2=1,\dotsc,Y_n=1)=1/n!=P(Y_1=1)\cdots P(Y_n=1)$. Then apply the result of question #2.

#12. The coupon collector's problem. This question is one that should lodge permanently in your brain. There are many problems in probability, design of computer science algorithms, and elsewhere, which at some point involve this problem. I think it is interesting that the expected time to collect the $n$ figures is about $n\log n$. However, the expected time needed to collect $99\%$ of the figures is only about $n\log 100$. So it is the last fraction of figures that are costly to collect.

#14. Did you notice that this method would work for simulating any probability $\alpha\in (0,1)$? For example, if we want $P(A)=1/5$, we write this in binary as $\alpha=0.0011001100...$ and then toss a coin. If we see TTHHH we would say $A$ has not occurred, because translating this to $.00111$, it differs from the binary form of $\alpha$ by representing a number $>\alpha$.

#16. This questions has been set to IA students for many years. In my experience few students are able to solve it. But the solution is so beautiful that it would be a pity if your supervisor were not given the opportunity to explain it to you.

#17,18. When I say, "This section is for enthusiasts", I am hoping that most students will think "This means me". After all, if you are reading Mathematics at Cambridge you are by definition a mathematics enthusiast. Even if you cannot solve these questions fully you should at least make some attempt. I would expect almost all students to be able to do #17 for the case $k=2$, $n=3$. It is a good general principle to tackle a difficult problem by first trying to solve a simple special case.

Discrete Fourier transforms and p.g.f.s

2015-02-14T09:45:00.000+00:00

In non-examinable Section 13.4 and 13.5 I showed you some moderately sophisticated applications of the simple ideas we are learning about p.g.f. and conditional expectation. I have added an Appendix B to the notes to say a bit more about the fast Fourier transform.

The brief story is this. To find the distribution of $Y$, the sum of two i.i.d. r.vs $X_1$ and $X_2$ taking values in $\{0,1,\dotsc,N-1\}$, we might find its p.g.f. $$p_Y(z)=p(z)^2=(p_0+p_1z+\cdots+p_{N-1}z^{N-1})^2.$$ This would take $O(N^2)$ multiplications of the form $p_ip_j$.

However, it is quicker to evaluate $p(z)$ at $2N$ complex values $z=\omega^0,\dotsc,\omega^{2N-1}$, square those values, and then recover the real coefficients of the polynomial $p(z)^2$. The calculation simplifies because by taking $\omega$ as a $2N$th root of unity, $\omega=\exp(-i\pi/N)$, the calculation of the $p(\omega^k)$, $k=0,\dotsc,2N-1$, can be done by a clever method, the discrete fast Fourier transform, which needs only $O(N\log N)$ multiplications.

Further details are in Appendix B of the notes, and the Part II course, Numerical Analysis.

Lecture 13

2015-02-13T12:15:00.000+00:00

The tower property of conditional expectation (also called the law of iterated expectations) is akin to the law of total probability that we met in Section 6.3. Another name is law of total expectation. A special case is

$E[X]=\sum_{i:P(A_i)>0} E[X \mid A_i]P(A_i)$,

where $A_1,\dotsc,A_n$ are events giving a partition of the sample space $\Omega$. Compare the above to

$E[X]=E\big[E[X \mid Y]\bigr]=\sum_y E[X \mid Y=y]P(Y=y)$.

Sometimes we might write $E\Big[E[X \mid Y]\Bigr]$ as $E_Y\Big[E[X \mid Y]\Bigr]$, just to remind ourselves that the outer expectation is being applied to the variable $Y$. Of course it is implicit in all this that $E[X]$ exists.

Remember that $P(A | B)$ is only defined for $P(B)>0$. If you want to see something that illustrates this in a cute way, then read this nice description of Borel's paradox. The paradox occurs because of a mistaken step in conditioning on an event of probability $0$, which can be reached as a limit of a sequence of events in different ways.

Lecture 12

2015-02-11T12:15:00.000+00:00

Let me remind you that you can install copies of Mathematica and Matlab software. It can be fun to use these to help with an appreciation of things we are doing. Today I used this Mathematica program to calculate some probability generating functions.

(* Roll a fair die 10 times *)
(* What is the probability the sum is 50? *)

p[z_] = (1/6) (z + z^2 + z^3 + z^4 + z^5 + z^6)
SeriesCoefficient[p[z]^10, {z, 0, 50}]
% // N

Series[1/(1 - x - x^2), {x, 0, 10}]

Table[Fibonacci[n], {n, 0, 10}]

Series[(1 - Sqrt[1 - 4 x])/(2 x), {x, 0, 10}]

Table[CatalanNumber[n], {n, 10}]

Mathematica is available here http://www.maths.cam.ac.uk/computing/software/mat...
Matlab is here http://www.maths.cam.ac.uk/undergrad/catam/softwa...

A very nice tripos question on probability generating functions is the following, 2004/2/II/9F. You should now be able to do this (and could discuss in your next supervision)

Hint

We saw today that generating functions are useful in realms beyond probability, as example being that

"the $n$th Fibonacci number, $F_n$, is the coeﬃcient of $x^n$ in the expansion of the
function $1/(1−x−x^2)$ as a power series about the origin."

That is a quote from chapter 1 of Herbert Wilf's fun book called generatingfunctionology, whose second edition is free to download. In his book you can read about many weird and wonderful things that can be done with generating functions. He starts Chapter 1 by writing

"A generating function is a clothesline on which we hang up a sequence of numbers for display."

Generating functions can encapsulate complicated things in a lovely way. For example, the coefficient of $x^n/n!$ in the power series expansion of $e^{e^x-1}$ is the Bell number $B(n)$, i.e. the number of different partitions that can be made from a set of $n$ elements. In Chapter 1 of Wilf's book you can see a derivation of this fact.

Examples Sheet 1

2015-02-11T09:15:00.000+00:00

Most of you will have now done Examples Sheet 1. Let me make a few comments about some things I hope you learn from that sheet.

#7. (about infinite sequence of coin tosses and limsup of events) This was done in Section 4.4 of the notes. An even quicker solution can be done using the fact that $P$ is a continuous function (proved in Section 7.1 of notes).

$P(C)=P(\bigcap_{n=1}^\infty\bigcup_{k=n}^\infty A_k)=\lim_{n\to\infty}P(\bigcup_{k=n}^\infty A_k)\leq\lim_{n\to\infty}(p_n+p_{n+1}+\cdots)=0$.

This was part of the tripose question 2014, 2, II9.

#8. (sets none of which is a subset of the other) Think how this question would look if everything except the first and last sentences of its statement were removed. It would look hard! See how adding a layer of probability helps prove a theorem that seems to have nothing to do with probability. We saw this in Example 4.5 (Erdos's probabilistic method) and we will see this a couple more times in the course, next in Section 11.4 (Weierstrass approximation theorem).

#13 (Mary tosses more heads than John) and #18 (41 bags of red and blue balls). Both these questions have very beautiful solutions. Please make sure your supervisor shows you the beautiful solutions, if you have not found them yourself.

#16 This question relates to what I did in Lecture 10 about information entropy.

#17 (goat ate the cards!) I invented this question as a sort of joke - but I am pleased that it has turned out to be very thought provoking for many students. If the "penny has dropped" for you about this question then you are on your way to developing some good intuition in the field of probability.

Comments are welcome if you have any ideas for good questions on this part of the course.

Lecture 11

2015-02-09T12:15:00.001+00:00

The probabilistic proof of the Weierstrass approximation theorem was to show you how probability theory can be used to prove an important theorem in a different field. We did something similar in Example 4.5 (Erdos probabilistic method in combinatorics) and in Examples Sheet 1, #8. The Wikipedia page on Bernstein polynomials has a list of Bernstein polynomials and an animation showing how they are used to approximate a continuous function on $[0,1]$.

I cheated today by using a result from the schedules for Analysis II ("a continuous function on a closed bounded interval is uniformly continuous"). I had thought about giving an easier proof by not appealing to Analysis II, but I decided it would be nicer to show you the proper full result.

The proof used Chebyshev's inequality and illustrated a nice idea that is frequently used in probability calculations. We partitioned the sample space into $A$ and $A^c$ and reasoned differently in each. For $A$ we had $P(A)$ small and $|f(X/n)-f(x)|\leq 2$, whereas for $A^c$ we had $P(A^c)\leq 1$ and $|f(X/n)-f(x)|$ small.

Benford's law is surprising when first encountered. The fact that the leading digit should be "1" with a frequency of 0.301 (rather than $1/9$) has been used as a test to check for fraud in financial records. The idea is that if a person is making up numbers, he unlikely to arrange that his distribution of first digits agree with Benford's law. You can read more about that, and many other examples of Benford's law, in the Wikipedia article referenced above.

In discussing why Benford's Law applies to leading digits of Fibonacci numbers I used a result that if $\beta=\log_{10}\alpha$ is irrational, then the sequence of fractional parts $\{]n\beta[\}_{n=1}^\infty$ is uniformly distributed. This result is certainly very plausible, but a proper proof is beyond our scope. It involves an analysis via the discrete Fourier transform of the sequence of length $N$, showing that this approaches the Fourier transform of a sequence of numbers that are uniformly distributed on $[0,1]$.