## Friday, 30 January 2015

In Lecture 7 I mentioned that $EX$ may not exist, and that for a random variable $X$ such that $X\geq 0$ it is possible for $EX=\infty$. Here's a famous example arising from the case $EX=\infty$. See also, Examples sheet 3, #19.

How much would you be willing to pay to play the following game? We will toss a fair coin repeatedly. We start you with £1 and then double this every time the coin shows a head. We stop when the first tail occurs. You keep your winnings, denoted by the random variable $X$. Now

$EX = (1/2)1+(1/4)2+(1/8)4+(1/16)8 +\cdots = \infty$.

Since the expected value of your winnings is infinite, it seems like you should be willing to pay an arbitrarily large amount to play this game! But would you really be willing to pay £1,000,000? (if you could!) Surely not. If you paid £1,000,000 for the right to play this game, the probability is 0.9749 that you will win less than you paid.

This is the St. Petersburg paradox  posed by Nicolas Bernoulli 1713.The paradox concerns what the mathematics predicts versus what a rational person might actually do in practice. One resolution is that a person's utility for money is not linear. We should really be looking at $Eu(X)$, where perhaps $u(X)=\log X$. In that case $Eu(X) < \infty$.

### Lecture 7

You might like to revisit the blog for Lecture 4. The example I do there about tossing coins for which $\sum_k p_k=\infty$ makes use of both continuity of $P$ (proved today), and independence of events (Lecture 5).

We were careful today in defining the notion of a random variable. It is a function mapping elements of the sample space $\Omega$ to some other set $\Omega_X$ (which is typically some set of real numbers). So $X$ is a function and $X(\omega)$ is value in its range. Most of the time we leave out the $\omega$ and speak of an event like $X\leq 1.45$. This really means the event $\{\omega : X(\omega)\leq 1.45\}$.

I remarked that the terminology ‘random variable’ is somewhat inaccurate, since a random variable is neither random nor a variable. However, a random variable has an associated probability distribution and one can think informally about a random variable as "a variable which takes its value according to a probability distribution".

The reason we like to define a random variable $X$ as a function on $\Omega$ is that it ties $X$ back to an underlying probability space, for which we have postulated axioms and know their consequences. If we try to define a random variable $X$ only in terms of its probability distribution, we would not have such a strong axiomatic basis from which to derive further properties of $X$.

Fine print. In Lecture 4 I gave you the definition of a probability space $(\Omega,\mathscr{F},P)$. For $T\subset\Omega_X$, we can calculate $P(X\in T)=P(\{\omega:X(\omega)\in T\})$ only if $\{\omega:X(\omega)\in T\}=X^{-1}(T)\in\mathscr{F}$. This may put restrictions on allowable $T$ and $X$. But this will not bother us in Probability IA, since we will only look at situations in which $\mathscr{F}$ consists of all subsets of $\Omega$, or $T$ is something nice like a subinterval of the reals.

## Wednesday, 28 January 2015

### Bayes's theorem

Bayes' theorem looks innocent enough, but from it emerges an entire school of statistical reasoning, Bayesian statistics. We have a hypothesis $A$, which might be true or false. A priori we think it is true with probability $p$. We collect some data and then want to update our belief that $A$ is true, that is find $P(A\mid\text{data})$. We can find the answer using $P(\text{data}\mid A)$ and Bayes' theorem.

Here is a YouTube video of students in a sing-a-long of George Box's "There's no theorem like Bayes' theorem", sung to the tune of "There's no business like show business".

There's no theorem like Bayes' theorem
Like no theorem we know
Everything about it is a wow
Let out all that a priori feeling
You've been concealing right up to now!

Further words are here.

### Lecture 6

I mentioned David Aldous's blog post Presenting probability via math puzzles is harmful.

Someone says to you: "I have two children one of whom is a boy"? To assess the probability that both are boys we must make assumptions about what the speaker really means, and what other things he might have said in other circumstances. You might like to read David Aldous's twins example:

In casual conversation with a stranger in the next airplane seat, you ask "do you have children?" and the response is "yes, two boys". Based on this, what is the chance that the two boys are twins?

Aldous argues that the probability is very close to $0$, because most people, if asked this question, would mention having twins if that were the case. We should not abandon common sense when doing maths problems. He gives this example:

"If you ask a politician in government "has unemployment gone up?" and they answer "long-term unemployment has gone down" then you can be fairly confident that short-term unemployment has indeed gone up!"

Here is some interesting reading for you about the law courts and conditional probability.
• A formula for justice
Bayes' theorem is a mathematical equation used in court cases to analyse statistical evidence. But a judge has ruled it can no longer be used. Will it result in more miscarriages of justice?
• Court of Appeal bans Bayesian probability (and Sherlock Holmes)
In a recent judgement the English Court of Appeal has not only rejected the Sherlock Holmes doctrine shown above, but also denied that probability can be used as an expression of uncertainty for events that have either happened or not.
• Prosecutor's fallacy
The prosecutor's fallacy is a fallacy of statistical reasoning, typically used by the prosecution to argue for the guilt of a defendant during a criminal trial.

I told you that an alternative name for Simpson's paradox is the Yule-Simpson effect. Udny Yule (a Fellow of St John's College, and a lecturer in statistics at Cambridge) appears to be the first person to have commented on this phenomenon (in 1903). Coincidentally, David Aldous (mentioned above) was also an undergraduate and Fellow at St John's (1970-79).

Example 6.7 was artificial, so it would be easy to see what was happening (women predominated amongst independent school applicants and women were more likely to gain admission.) In the Wikipedia article Simpson's paradox you can read about more practical examples. One of the best-known real-life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there.

I once extracted data from Cambridge University Tripos results to show that in each of certain set of carefully selected subjects (including Engineering, English, ...) women were more likely to obtain Firsts than men, but that men were more likely to obtain Firsts when results over all subjects were combined. This sounds paradoxical. However, it happened because some subjects award greater percentages of Firsts than do others. But also some subjects are relatively more popular with women.

## Tuesday, 27 January 2015

### limsup of a sequence of sets (or events)

When seen for the first time it can be difficult to immediately grasp the meaning of  $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty A_k$.

I found that when I was a student and encountering this for the first time. This is one reason I included #7 on Examples Sheet 1. My advice is to pick it apart slowly.

$A_k$ is the event that the $k$th toss is a head.

First: The event $\bigcup_{k=i}^\infty A_k$ is true (occurs) if and only if there is at least one head somewhere amongst the tosses $i,i+1,i+2,\dots$ .

Second: When we put $\bigcap_{i=1}^\infty$ in front we create an event that is true if and only if the above is true for all $i$. So how can the result be true? Well, if and only if there are an infinite number of heads. (There might also be an infinite number of tails.)

In general, $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k$ is occurs (or is true) if and only if an infinite number of the events $E_1,E_2,\dotsc$ occur.

The idea is similar to the "$\limsup$" of a sequence $\{a_1,a_2,\dots\}$, which is defined as

$\lim_{i\to\infty} \sup_{k\geq i} a_k$.

How about $\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty A_k$? That is the event that eventually we see only heads. Now perhaps you can see that

$\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty E_k \subseteq\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k$.

Here is an article on it that might also help.

## Monday, 26 January 2015

### Lecture 5

You might compare the answer to Examples Sheet 1 #10 to a Bonferroni inequality working in the reverse direction, $\geq$.

We saw by examples today that it is possible to have three events $A_1,A_2,A_3$ such that every pair of events is independent, but the three events are not mutually independent. Here is another example. Suppose the sample sample consists of 4 equally likely outcomes $\{w_1,w_2,w_3,w_4\}$. Let

$A_1=\{w_1,w_2\}$,
$A_2=\{w_1,w_3\}$,
$A_3=\{w_1,w_4\}$.

Then $P(A_i\cap A_j)=P(A_i)P(A_j)=1/4$, but $1/4=P(A_1\cap A_2\cap A_3)\neq P(A_1)P(A_2)P(A_3)=1/8$.

We might generalize this to a puzzle. Can you find five events $A_1,A_2,A_3,A_4,A_5$ such that every subset of either two, three or four events is a set of mutually independent events, but the five events are not mutually independent?

Last year a student told me this nice answer. Roll a die 4 times. Let $A_i$ be the probability that the $i$th dice roll is odd, $i=1,2,3,4$ and let $A_5$ be the event the sum of the four rolls is even.

You can find another answer at the end of this page, but I think the above is nicer.

A famous example of data that is Poisson distributed is due to von Bortkiewicz (1898), who studied the numbers of Prussian cavalryman being killed by the kick of a horse in each of 20 years. This is an example with $n$ (number of cavalryman) large, and $p$ (chance a particular cavalryman being killed) small. You can read more about this famous example here.

### A non-measureable set

In Lecture 4 we met the idea of a probability space, defined by the triple $(\Omega,\mathscr{F},P)$. It may not be possible for $\mathscr{F}$ to be all subsets of $\Omega$. Here is an example to illustrate the point.

Let $\Omega$ be the set of all points on the circumference of a unit circle. Define a equivalence relation on points in $\Omega$ such that $x\sim y$ if the angle between them is a rational fraction of $\pi$. This partitions $\Omega$ into equivalence classes (in fact an uncountable number of them). Now pick one point from each equivalence class and call the set of these points $A$. If we shift every point in $A$ around the circumference of the circle by the same rational angle $q$ we get a new set, say $A_q$. Clearly, $A$ and $A_q$ are disjoint, and it ought to be that $P(A_q)=P(A)$. Now taking the union over all rationals, $\Omega=\bigcup_q A_q$ is a union of a countable number of disjoint sets (since every point in $\Omega$ is some rational angle away from some point in $A$). But it is impossible to satisfy the probability axioms, which would require $P(\bigcup_q A_q)=1=\sum_q P(A_q)$, and also have $P(A)=P(A_q)$ for all $q$. We are forced to conclude that no probability measure $P$ can cope with a subset of $\Omega$ like $A$.

### Mozart Cafe Problem

Here is one of my favorite problems, related to the calculation of the probability of derangement in Lecture 4.

Two friends travelling independently to Vienna wish to meet for a coffee on the afternoon they arrive. Neither has been to Vienna before, but they guess it must have a Mozart Cafe. So in an email exchange they agree to meet at the Mozart Cafe. Unfortunately, upon arrival they each discover that there are in fact $m >1$ Mozart Cafes in Vienna. Assuming they have no way to communicate, what should they do?

Suppose that each does one of the following. He either (a) chooses one cafe at random and stays there all afternoon, or (b) visits all $m$ cafes in random order spending equal time in each. On average, the best outcome is achieved when one does (a) and the other does (b) — on average they will meet half way through the afternoon. However, this cannot be guaranteed since they have no way to agree who should do (a) and who (b). The worst outcome is when both do (a) (unless they are lucky and choose the same cafe).

Suppose each friend decides to choose his strategy randomly, doing (a) with probability $p$ and (b) with probability $1-p$. If they fail to meet then they repeat the same strategy on subsequent afternoons, with new random choices, until they meet.

The calculation of the probability of derangement done in Example 4.4 helps with the evaluation of what happens when they both choose (b), an event that happens with probability $(1-p)^2$.

They wish to minimize the expected time until they meet. It turns out that for $m=3$, the optimal strategy is with $p=1/3$. As $m\to\infty$ the optimal $p\to \approx 0.24749$.

Here is a completely unsolved problem to which we would love to know the answer:

Two astronauts land at random spots on a planet (which is assumed to be a uniform sphere, without any known distinguishing marks or directions). They can each move at a constant speed. How should they move so as to be within 1 kilometre of one another in the least expected time?

## Friday, 23 January 2015

### Lecture 4

The inclusion-exclusion formula is included in the list of Problem Solving Strategies.

I showed you in Example 4.5 a use of Erdos's probabilistic method in combinatorics. If you liked that, you might like to read about other examples, for instance in this link.

I made a small slip at the end of the lecture.
$$1 - \frac{1}{2!}+\frac{1}{3!}+\cdots =1-e^{-1}.$$ So the probability of derangement tends to $e^{-1}=0.368$.

In Example 4.4 we considered an infinite sequence of tosses of biased coins and saw that if $\sum_{i=1}^\infty p_i<\infty$  then $P(\text{infinite number of heads})=0.$

Now let's prove that $\sum_{i=1}^\infty p_i=\infty \implies P(\text{finite number of heads})=0.$ This is trickier.

We are assuming the coin tosses are independent (which is terminology from Lecture 5, but you can guess intuitively what this means). Let $A_k$ be the event that the $k$th toss is a head. Let $E_n=\bigcap_{k=n}^\infty A_k^c=[\text{no heads after the$(n-1)$th toss}]$. Then for $m > n$

$P(E_n)\leq P(\bigcap_{k=n}^m A_k^c)=\prod_{k=n}^m P(A_k^c) =\prod_{k=n}^m (1-p_k)\leq \prod_{k=n}^m e^{-p_k} =e^{-\sum_{k=n}^m p_k}$.

The right hand side $\to 0$ as $m\to\infty$.

Hence $P(E_n)=0$, and then using Property (v) of $P$ (continuity), we have that

$P(\text{finite number of heads})=P(\bigcup_{n=1}^\infty\bigcap_{k=n}^\infty A_k^c)=\lim_{n\to\infty}P(E_n)=0.$

These two results are examples of applications of the Borel-Cantelli lemmas, which appear in the Part II course Probability and Measure.

I did not talk through Example 4.8 (counting surjective functions), but it is another nice use of the inclusion-exclusion formula. See also Examples Sheet 1, #9.

## Wednesday, 21 January 2015

### Big O notation

I wrote Stirling's formula as
$$\log\left(\frac{n!e^n}{n^{n+1/2}}\right)=\log\sqrt{2\pi}+O(1/n)$$ and thought later that maybe you do not know what $O(1/n)$ means (which is read as "Big O of $1/n$").

Jacqueline Kennedy Onassis was known to her friends as "the Big O". That is not what we mean here.

We say $f(n)=O(g(n))$ if there exists some $B$ such that $|f(n)| \leq B|g(n)|$ for all $n$ sufficiently large. We proved that $A<d_n<A+1/(12n)$, so
$$0<\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)<\frac{1}{12n}$$ is consistent with this, with $B=1/12$.

There is also a notation for lower bounds. We say $f(n)=\Omega(g(n))$ if there exists some $B'$ such that $|f(n)| \geq B'|g(n)|$ for all $n$ sufficiently large. This is the same as saying $g(n)=O(f(n))$.

And $g(n)=\Theta(g(n))$ if both $f(n)=O(g(n))$ and $f(n)=\Omega(g(n))$ are true. Since
$$\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right) > \frac{1}{12n+1}>\frac{1}{13n}$$ we may actually claim that
$$\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)=\Theta(1/n).$$ Because this tells how how fast is the convergence to $0$ it is stronger than just saying
$$\lim_{n\to \infty}\log\left(\frac{n!e^n}{n^{n+1/2}\sqrt{2\pi}}\right)=0.$$ Read more about "Big O", "Big omega" and "Big theta" notations. They are very commonly used in theoretical computer science.

### Wallis's product

Here's a thought. Suppose we knew Wallis's product
$$\frac{2}{1}\frac{2}{3}\frac{4}{3}\frac{4}{5}\frac{6}{5}\frac{6}{7}\cdots =\frac{\pi}{2}.$$ Define $I_{2n}=\tfrac{1}{2}\tfrac{3}{4}\cdots \frac{2n-1}{2n}\frac{\pi}{2}$ and $I_{2n+1}=\tfrac{2}{3}\tfrac{4}{5}\cdots \frac{2n}{2n+1}$.
If we truncate the left hand side of Wallis' product after $2n$ terms we make it less, and this implies $I_{2n+1}/I_{2n} < 1$. Now we can complete the proof of Stirling's formula without ever needing to consider $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta$.

Of course a standard way to prove Wallis's product is to work with $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta=(n-1)(I_{n-2}-I_n)$,

But there are other ways to prove it that require just basic algebra, Pythagoras' theorem and the formula for the area of a circle. See here.

John Wallis  (1616-1703) was a student of Emmanuel College and a Fellow of Queens' College. He is credited with introducing the symbol $\infty$.

### Lecture 3

On the basis of today's lecture you can attempt Examples Sheet 1, #3, #4, #16. You might also look at #14 and think about how using a multinomial coefficient might help sort this one out.

The proof of Stirling's formula is not examinable. So why do we include it? Well, it would be a pity to so often use in our other work a result whose proof we did not understand. I also think it is interesting for you to see how a careful analysis of an asymptotic like this can be made.

Section 3.3 on Improved Stirling's formula (not examinable), which shows that $n!/\left(\sqrt{2\pi}n^{n+1/2}e^{-n}\right)$ lies between $e^{1/(12n+1)}$ and $e^{1/(12n)}$. This is from a paper, A remark on Stirling's formula, by Herbert Robbins.

Tim Gowers once lectured this course and wrote a blog post called Removing the magic from Stirling’s formula. He was looking for a more natural proof. He comments: It turned out to be hard to find a proof that didn’t look magic: the arguments had a “consider the following formula, do such-and-such, observe that this is less than that, bingo it drops out” sort of flavour.

I tried to explain in this lecture how we can understand the motivation behind most of the steps in this proof so that it feels less like pulling-a-rabbit-out-of-a-hat.

Why do we need to make the definition $I_n=\int_0^{\pi/2} \sin^n\theta\, d\theta$? Why not directly define $I_{2n}=\tfrac{1}{2}\tfrac{3}{4}\cdots \frac{2n-1}{2n}\frac{\pi}{2}$ and $I_{2n+1}=\tfrac{2}{3}\tfrac{4}{5}\cdots \frac{2n}{2n+1}$?

The answer is that we need to show that $I_n$ is decreasing in $n$, and this follows from the fact that $\sin^n\theta$ is decreasing in $n$ since $0<\sin\theta<1$.

Our calculation today of $I_{2n}/I_{2n+1}\to 1$ is related to a proof of Wallis's product. This is $$\frac{2}{1}\frac{2}{3}\frac{4}{3}\frac{4}{5}\frac{6}{5}\frac{6}{7}\cdots =\frac{\pi}{2}.$$ Perhaps the trickiest thing in the proof was the evaluation of the constant $\sqrt{2\pi}$. The following is a method of doing it in a less magical way. Suppose $n!\sim B n^{n+1/2}e^{-n}$. Then $p=2^{-2n}\binom{2n}{n}=(2n)!/(2^{n}n!)^2\sim \sqrt{2/n}/B$ is the probability that in $2n$ tosses of a fair coin there are $n$ heads and $n$ tails. If we knew the central limit theorem (Lecture 23), we could use it to estimate $p=1/\sqrt{n\pi}$, and from that find $B=\sqrt{2\pi}$.

## Monday, 19 January 2015

### Lecture 2

At the end of the lecture someone asked me about keys sampled without replacement and how it can be easily seen that the probability that you succeed on the $r$th attempt is $1/n$. This can be found in several ways. Either you think  about the number of ways the first $r$ keys can be seen and get
$$\frac{(n-1)(n-2)\cdots(n-r+1)1}{n(n-1)\cdots(n-r+1)} =\frac{1}{n}.$$
Or you think about the key you want being at the $r$th position and the other $(n-1)$ keys being in any of $(n-1)!$ possible orders, giving
$$\frac{(n-1)!}{n!} =\frac{1}{n}.$$
You might also think about a fixed ordering of taking the keys from you pocket, like 1,3,4,2,5. Perhaps 3 is the key you want. Notice that all cyclic permutations must be equally likely. So key 3 is equally likely to be seen as the $1,2,\dotsc,5$th key drawn from your pocket. This example serves to show that can be many ways to calculate the required probability. In the above three, we have used a different sample space $\Omega$ in each case.

There are some good Wikipedia articles on topics that we are covering. The one on the Birthday Problem is nice. You will find in that article more than ever thought it would be possible to say about such a simple problem. The article explains that the problem is not just of recreational interest — it is relevant to several hashing algorithms in computer science.

Here is a quiz to test your understanding. Complete the following sentence:

There are several ways to count the number of distinct injective functions that map elements of a set $N=\{1,\dotsc,n\}$ to elements of a set $X=\{1,\dotsc,x\}$. This is because ...

You might like to look at the Wikipedia article about the Twelvefold Way classification system for counting functions between two finite sets. I talked about a subset of four of these ways today. The paper of Robert Proctor, Let's expand Rota's twelvefold way for counting partitions! takes this systematic approach even further in an attempt to organize what he calls, "a bewildering zoo of  enumeration problems". The other counts involve some more complicated notions, such as the Bell numbers and Stirling numbers.

Please regard this as a bit of fun. If as in (ii) we say two functions $f,g:N\to X$ are the same if there exists a permutation $\pi$ for which $g(i)=f(\pi(i))$ we are defining an equivalence relation, and so this bit of work has really been about counting the number of equivalence classes for certain equivalence relations defined on the set of functions that map $N$ to $X$. However, in practice, it is often useful to work from scratch, relying on what I called the Fundamental Rule of Counting.

Try Examples sheet 1, #14. This apparently innocuous question may do your head in — or maybe you'll find it easy. There is really no systematic way to a answer question like this. Try it different ways (perhaps colouring the balls or not.) But it can help to figure out the answer for small $n$ before attempting to find the general formula.

## Friday, 16 January 2015

### Note to supervisors

I like to know who is supervising this course. I can then put you on an email circulation list, where we can discuss how students are getting on with the course and examples sheets.

It is unlikely supervisors will read this blog, so perhaps helpful students might point it out to their supervisor.

### Lecture 1

I spoke about the Problem of Points, which was puzzling people as long as 500 years ago. On pages 32-34 of Grimstead and Snell you can read translations of the letters exchanged by Fermat and Pascal about this problem in 1654. G&S also write:

The problem had been a standard problem in mathematical texts; it appeared in Fra Luca Paccioli’s book summa de Arithmetica, Geometria, Proportioni et Proportionalita, printed in Venice in 1494 in the form:
"A team plays ball such that a total of 60 points are required to win the game, and each inning counts 10 points. The stakes are 10 ducats. By some incident they cannot finish the game and one side has 50 points and the other 20. One wants to know what share of the prize money belongs to each side. In this case I have found that opinions differ from one to another but all seem to me insufficient in their arguments, but I shall state the truth and give the correct way."
In fact his "correct way" turns out to be wrong!

I mentioned classical, frequentist, and subjective approaches to probability. There are others: such as the propensity approach. Indeed, the question, "What is Probability?" can take us into the realm of philosophy. If I roll a die and do not show you the answer, what is the probability it is a six? Some people might say it is 1/6. Others might say it is 1 or 0, because it either is a six or not. (If it appeals to you to delve into philosophy, then have a look at this article on Probability interpretations.) Fortunately, in our course we work within the context of some mathematical axioms and therefore have a well-defined approach that does not depend on interpretation.

At the start of each chapter of my notes the non-examinable topics are placed between *s. The *arcsine law* that I mentioned today is the second of the three laws described here. The other arcsine laws concern the proportion of the time that the random walk is positive, and the time at which the random walk achieves its maximum. My purpose in including this in our first lecture was to show you something that is surprising and which is proved in a clever way (by the reflection principle and matching paths in a 1-1 manner).

There is a space after each post for you to write comments or ask questions using the IntenseDebate commenting system. Please feel free to do so. Other students will probably benefit.