Friday 31 January 2014

The St Petersburg Paradox

In Lecture 7 I mentioned that $EX$ may not exist, and that for a random variable $X$ such that $X\geq 0$ it is possible for $EX=\infty$. Here's a famous example arising from the case $EX=\infty$. See also, Examples sheet 3, #19.

The St Petersburg Paradox
How much would you be willing to pay to play the following game? We will toss a fair coin repeatedly. We start you with £1 and then double this every time the coin shows a head. We stop when the first tail occurs. You keep your winnings, denoted by the random variable $X$. Now

$EX = (1/2)1+(1/4)2+(1/8)4+(1/16)8 +\cdots = \infty$.

Since the expected value of your winnings is infinite, it seems like you should be willing to pay an arbitrarily large amount to play this game! But would you really be willing to pay £1,000,000? (if you could!) Surely not. If you paid £1,000,000 for the right to play this game, the probability is 0.9749 that you will win less than you paid.

This is the St. Petersburg paradox  posed by Nicolas Bernoulli 1713.The paradox concerns what the mathematics predicts versus what a rational person might really do in practice. One resolution is that a person's utility for money is not linear. We should really be looking at $Eu(X)$, where perhaps $u(X)=\log X$. In that case $Eu(X) < \infty$.

Lecture 7

You might like to revisit the blog for Lecture 4. The example I do there about tossing coins for which $\sum_k p_k=\infty$ makes use of both continuity of $P$ (proved today), and independence of events (Lecture 5).

We were careful today in defining the notion of a random variable. It is a function mapping elements of the sample space $\Omega$ to some other set $\Omega_X$ (which is typically some set of real numbers). So $X$ is a function and $X(\omega)$ is value in its range. Most of the time we leave out the $\omega$ and speak of an event like $X\leq 1.45$. This really means the event $\{\omega: X(\omega)\leq 1.45\}$.

The terminology ‘random variable’ is somewhat inaccurate, since a random variable is neither random nor a variable. However, a random variable has an associated probability distribution and one can think informally about a random variable as "a variable which takes its value according to a probability distribution".

Fine print. In Lecture 4 I gave you the definition of a probability space $(\Omega,\mathscr{F},P)$. For $T\subset\Omega_X$, we can calculate $P(X\in T)=P(\{\omega:X(\omega)\in T\})$ only if $\{\omega:X(\omega)\in T\}=X^{-1}(T)\in\mathscr{F}$. This may put restrictions on allowable $T$ and $X$. But this will not bother us in Probability IA, since we will only look at situations in which $\mathscr{F}$ consists of all subsets of $\Omega$, or $T$ is something nice like a subinterval of the reals.

Wednesday 29 January 2014

Bayes' theorem

Bayes' theorem looks innocent enough, but from it emerges an entire school of statistical reasoning, Bayesian statistics. We have a hypothesis $A$, which might be true or false. A priori we think it is true with probability $p$. We collect some data and then want to update our belief that $A$ is true, that is find $P(A\mid\text{data})$. We can find the answer using $P(\text{data}\mid A)$ and Bayes' theorem.

Here is a YouTube video of students in a sing-a-long of George Box's "There's no theorem like Bayes' theorem", sung to the tune of "There's no business like show business".

There's no theorem like Bayes' theorem
Like no theorem we know
Everything about it is appealing
Everything about it is a wow
Let out all that a priori feeling
You've been concealing right up to now!

Further words are here.

Lecture 6

Here is some interesting reading for you about the law courts and conditional probability.
  • A formula for justice
    Bayes' theorem is a mathematical equation used in court cases to analyse statistical evidence. But a judge has ruled it can no longer be used. Will it result in more miscarriages of justice?
  • Court of Appeal bans Bayesian probability (and Sherlock Holmes)
    In a recent judgement the English Court of Appeal has not only rejected the Sherlock Holmes doctrine shown above, but also denied that probability can be used as an expression of uncertainty for events that have either happened or not.
  • Prosecutor's fallacy
    The prosecutor's fallacy is a fallacy of statistical reasoning, typically used by the prosecution to argue for the guilt of a defendant during a criminal trial. 
I mentioned David Aldous's blog post Presenting probability via math puzzles is harmful.
Do you think it has harmed you to see the problem "I have two children one of whom is a boy (born on a Thursday)"?

If you read the Wikipedia article on Simpson's paradox you will find that an alternative name is the Yule-Simpson effect. Udny Yule (a Fellow of St John's College, and a lecturer in statistics at Cambridge) appears to be the first one to have commented on this phenomenon (in 1903).

Our Example 6.7 was artificial, so it would be easy to see what was happening (women predominated amongst independent school applicants and women were more likely to gain admission.) In the Wikipedia article you can read about more practical examples. One of the best-known real-life examples of Simpson's paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there.

I remember once extracting data from Cambridge University Tripos results to show that in each of certain set of carefully selected subjects (including Engineering, English, ...) women were more likely to obtain Firsts than men, but that men were more likely to obtain Firsts when results over all subjects were combined. This sounds paradoxical. Can you figure out how it can happen? Hint: some subjects award greater percentages of Firsts than do others. Some subjects are relatively more popular with women. 

Monday 27 January 2014

Lecture 5

I have updated the notes to go as far as next Wednesday, Lecture 9.  By the way, if you are printing these notes, you can choose the format you most prefer. I use Acrobat to print with "Fit" which gives me a large typeface on A4. But "Actual size" (to give you wide margins) or "multiple pages" 2 per sheet can look nice also.

I have made some small changes to Example 5.2 in the notes for Lecture 5, and one important correction to a typo in the formula for the Poisson distribution. By the way, I mentioned horse kicks. I was imperfectly recalling that a famous example of data that is Poisson distributed is due to von Bortkiewicz (1898), who studied the numbers of Prussian cavalryman being killed by the kick of a horse in each of 20 years. This is an example with $n$ (number of cavalryman) large, and $p$ (chance a particular cavalryman being killed) small. You can read more about this famous example here.

You might compare the answer to Examples Sheet 1 #10 to a Bonferroni inequality working in the reverse direction, $\geq$.

We saw by examples today that it is possible to have three events $A_1,A_2,A_3$ such that every pair of events is independent, but the three events are not mutually independent. Here is another example. Suppose the sample sample consists of 4 equally likely outcomes $\{w_1,w_2,w_3,w_4\}$. Let

$A_1=\{w_1,w_2\}$,
$A_2=\{w_1,w_3\}$,
$A_3=\{w_1,w_4\}$.

Then $P(A_i\cap A_j)=P(A_i)P(A_j)=1/4$, but $1/4=P(A_1\cap A_2\cap A_3)\neq P(A_1)P(A_2)P(A_3)=1/8$.

I mentioned in the lecture that we might generalize this to a puzzle. Can you find four events $A_1,A_2,A_3,A_4$ such that every subset of either two or three events is a set of mutually independent events, but the four events are not mutually independent?

At the end of the lecture a student told me this nice answer. Let $A_i$ be the probability the $i$th dice roll is odd, $i=1,2,3$ and $A_4$ be the event the sum of the three rolls is even. That works! And it looks like we could generalize this example to give $A_1,\dotsc,A_{k+1}$ such that every  $k$ or fewer events are independent, but the $k+1$ events are not independent.

You can find another answer at the end of this page, but I think the above is nicer.

Friday 24 January 2014

limsup of a sequence of sets (or events)

When seen for the first time it can be difficult to immediately grasp the meaning of  $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty A_k$.

I found that when I was a student and encountering this for the first time. This is one reason I included #7 on Examples Sheet 1. My advice is to pick it apart slowly.

$A_k$ is the event that the $k$th toss is a head.

First: The event $\bigcup_{k=i}^\infty A_k$ is true (occurs) if and only if there is at least one head somewhere amongst the tosses $i,i+1,i+2,\dots$ .

Second: When we put $\bigcap_{i=1}^\infty$ in front we create an event that is true if and only if the above is true for all $i$. So how can the result be true? Well, if and only if there are an infinite number of heads. (There might also be an infinite number of tails.)

In general, $\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k$ is occurs (or is true) if and only if an infinite number of the events $E_1,E_2,\dotsc$ occur.

The idea is similar to the "$\limsup$" of a sequence $\{a_1,a_2,\dots\}$, which is defined as

$\lim_{i\to\infty} \sup_{k\geq i} a_k$.

How about $\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty A_k$? That is the event that eventually we see only heads. Now perhaps you can see that

$\bigcup_{i=1}^\infty \bigcap_{k=i}^\infty E_k \subseteq\bigcap_{i=1}^\infty \bigcup_{k=i}^\infty E_k $.

Here is an article on it that might also help. 

The Mozart Cafe problem

Two friends travelling independently to Vienna wish to meet for a coffee on the afternoon they arrive. Neither has been to Vienna before, but they guess it must have a Mozart Cafe. So in an email exchange they agree to meet at the Mozart Cafe. Unfortunately, upon arrival they each discover that there are in fact $m >1$ Mozart Cafes in Vienna. Assuming they have no way to communicate, what should they do?

Suppose that each does one of the following. He either (a) chooses one cafe at random and stays there all afternoon, or (b) visits all $m$ cafes in random order spending equal time in each. On average, the best outcome is achieved when one does (a) and the other does (b) — on average they will meet half way through the afternoon. However, this cannot be guaranteed since they have no way to agree who should do (a) and who (b). The worst outcome is when both do (a) (unless they are lucky and choose the same cafe).

Suppose each friend decides to choose his strategy randomly, doing (a) with probability $p$ and (b) with probability $1-p$. If they fail to meet then they repeat the same strategy on subsequent afternoons, with new random choices, until they meet.

The calculation of the probability of derangement done in Example 4.4 helps with the evaluation of what happens when they both choose (b), an event that happens with probability $(1-p)^2$.

They wish to minimize the expected time until they meet. It turns out that for $m=3$, the optimal strategy is with $p=1/3$. As $m\to\infty$ the optimal $p\to \approx0.24749$.

You might enjoy looking at the slides for my recreational talk Symmetric Rendezvous Search Games. Here is a completely unsolved problem to which we would love to know the answer:

Two astronauts land at random spots on a planet (which is assumed to be a uniform sphere, without any known distinguishing marks or directions). They can each move at a constant speed. How should they move so as to be within 1 kilometre of one another in the least expected time?

A non-measurable set

I remarked that it may not be possible for $\mathscr{F}$ to be all subsets of $\Omega$. Here is an example to illustrate the point.

Let $\Omega$ be the set of all points on the circumference of a unit circle. Define a equivalence relation on points in $\Omega$ such that $x\sim y$ if the angle between them is a rational fraction of $\pi$. This partitions $\Omega$ into equivalence classes (in fact an uncountable number of them). Now pick one point from each equivalence class and call the set of these points $A$. If we shift every point in $A$ around the circumference of the circle by the same rational angle $q$ we get a new set, say $A_q$. Clearly, $A$ and $A_q$ are disjoint, and it ought to be that $P(A_q)=P(A)$. Now taking the union over all rationals, $\Omega=\bigcup_q A_q$ is a union of a countable number of disjoint sets (since every point in $\Omega$ is some rational angle away from some point in $A$). But it is impossible to satisfy the probability axioms, which would require $P(\bigcup_q A_q)=1=\sum_q P(A_q)$, and also have $P(A)=P(A_q)$ for all $q$. We are forced to conclude that no probability measure $P$ can cope with a subset of $\Omega$ like $A$.

Lecture 4

I have added to the list of Problem Solving Strategies a note about the inclusion-exclusion formula.

I showed you in Example 4.5 a use of the probabilistic method in combinatorics. If you liked that, you might like to read about other examples, for instance in this link.

In Example 4.4 we considered an infinite sequence of tosses of biased coins and saw that if $\sum_{i=1}^\infty p_i<\infty$  then $P(\text{infinite number of heads})=0.$

Now let's prove that $\sum_{i=1}^\infty p_i=\infty \implies P(\text{finite number of heads})=0.$ This is trickier.

We are assuming the coin tosses are independent (which is terminology from Lecture 5, but you can guess intuitively what this means). Let $A_k$ be the event that the $k$th toss is a head. Let $E_n=\bigcap_{k=n}^\infty A_k^c=[\text{no heads after the $(n-1)$th toss}]$. Then for $m > n$

$P(E_n)\leq P(\bigcap_{k=n}^m A_k^c)=\prod_{k=n}^m P(A_k^c) =\prod_{k=n}^m (1-p_k)\leq \prod_{k=n}^m e^{-p_k} =e^{-\sum_{k=n}^m p_k}$.

The right hand side $\to 0$ as $m\to\infty$.

Hence $P(E_n)=0$, and then using Property (v) of $P$ (continuity), we have that

$P(\text{finite number of heads})=P(\bigcup_{n=1}^\infty\bigcap_{k=n}^\infty A_k^c)=\lim_{n\to\infty}P(E_n)=0.$

These two results are examples of applications of the Borel-Cantelli lemmas, which appear in the Part II course Probability and Measure.

I did not have time to talk through Example 4.8, but it is another nice use of the inclusion-exclusion formula.

Wednesday 22 January 2014

More on Wallis' product

Here's a thought. Suppose we knew Wallis' product
$$ \frac{2}{1}\frac{2}{3}\frac{4}{3}\frac{4}{5}\frac{6}{5}\frac{6}{7}\cdots =\frac{\pi}{2}. $$ Define $I_{2n}=\tfrac{1}{2}\tfrac{3}{4}\cdots \frac{2n-1}{2n}\frac{\pi}{2}$ and $I_{2n+1}=\tfrac{2}{3}\tfrac{4}{5}\cdots \frac{2n}{2n+1}$.
If we truncate the left hand side of Wallis' product after $2n$ terms we make it less, and this implies $I_{2n+1}/I_{2n}<1$. Now we can complete the proof of Stirling's formula without ever needing to consider $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta$.

Of course a standard way to prove Wallis' product is to work with $I_n=\int_0^{\pi/2}\sin^n\theta\,d\theta=(n-1)(I_{n-2}-I_n)$, but there are other ways to prove it that require just basic algebra, Pythagoras' theorem and the formula for the area of a circle. See here.

Examples Sheet 1 (Lectures 1–6)

Some of you have keen supervisors who are already giving supervision on Examples Sheet 1. This sheet is intended for Lectures 1—6, and Lecture 6 is next Wednesday. To help you work ahead I am now posting my lecture notes through Lecture 6. I may subsequently make small changes to these, but they might help you as you work on questions on conditional probability, as #11, #12.

Monday 20 January 2014

Poll: give your view about printed notes

The notes for lectures 1-4 are now available.

Here is a poll for you to complete. What is your view about printed notes? Tick statements with which you are in sympathy. Additional comments are welcome below (if not anonymous and from a valid @cam address.)

Lecture 2

There are some good Wikipedia articles on topics that we are covering. The one on the Birthday Problem is nice. You will find in that article more than ever thought it would be possible to say about such a simple problem. The article explains that the problem is not just of recreational interest — it is relevant to several hashing algorithms in computer science.

Here is a quiz to test your understanding. Complete the following sentence:

There are several ways to count the number of distinct injective functions that map elements of a set $N=\{1,\dotsc,n\}$ to elements of a set $X=\{1,\dotsc,x\}$. This is because ...

You might like to look at the Wikipedia article about the Twelvefold Way classification system for counting functions between two finite sets. I talked about a subset of four of these ways today. The paper of Robert Proctor, Let's expand Rota's twelvefold way for counting partitions! takes this systematic approach even further in an attempt to organize what he calls, "a bewildering zoo of  enumeration problems". The other counts involve some more complicated notions, such as the Bell numbers and Stirling numbers.

Please regard this as a bit of fun. If as in (ii) we say two functions $f,g:N\to X$ are the same if there exists a permutation $\pi$ for which $g(i)=f(\pi(i))$ we are defining an equivalence relation, and so this bit of work has really been about counting the number of equivalence classes for certain equivalence relations defined on the set of functions that map $N$ to $X$. However, in practice, it is often useful to work from scratch, relying on what I called the Fundamental Rule of Counting.

Try Examples sheet 1, #14. This apparently innocuous question may do your head in — or maybe you'll find it easy. There is really no systematic way to a answer question like this. Try it different ways (perhaps colouring the balls or not.) But it can help to figure out the answer for small $n$ before attempting to find the general formula.

A student asked me afterwards why I choose to talk about $f:N\to X$ and not the other way around. This is because $f(i)$ is to be the item placed in the $i$th list position. If we sample the items with replacement then we can have $f(i)=f(j)$ for some $i\neq j$. I was imagining here that each list position (or box) is to take just one item.

If we are sequentially treating $n$ patients with a drug and recording whether it is a success, a failure or inconclusive we might take $X=\{$success, failure, inconclusive $\}$ and then sample $n$ times from $X$ with replacement. Here $f(i)$ is the result for the $i$th patient. The number of different lists of results we can produce is $x^n=3^n$. The number of different totals ($n_1$ successes, $n_2$ failures, $n_3$ inconclusives) is $\binom{n+x-1}{x-1}=(n+2)(n+1)/2$.

Friday 17 January 2014

Printed notes: Good or Bad?

The lecture notes for Monday's  Lecture 2 (Combinatorial analysis) are now available. I will always make notes available shortly before the lecture.

You should already be able to do Questions 1, 2 on Examples Sheet 1. These questions require no more than use of the definition of classical probability.

I give you course notes so that you can dispense with some tedious copying-out, you can listen better, and you are guaranteed to have an accurate account. But I know there is a danger that this might induce complacency. Writing during a lecture helps to keep engaged, and so I hope you will still do some of that. I think Tom Korner makes a good point when he writes "many mathematicians find it easier to learn from lectures than from books ... We learn more by watching a house being built than by inspecting it afterwards." (See page 3 of his treatise In Praise of Lectures.) In Lecture 1 I told you how Probability IA can prepare you for a well-paid career, mentioned my treasonable conversation about the Lottery with HM, and emphasised repeatedly what I most wanted to you to remember: the definition of classical probability. Can you complete this sentence? Classical probability is concerned with situations in which ...

I hope you experience a process of a building of the theory of Probability and understand why Probability is subject I enjoy; I always intend to say some things that are not in the notes, and elaborate explanations of things that are.

For most of us, learning mathematics requires repeated exposure to ideas. So I hope that a combination of reading, listening, writing and solving problems will work well for you in this course.

Lecture 1

I spoke about the Problem of Points, which was puzzling people as long as 500 years ago. On pages 32-34 of Grimstead and Snell you can read translations of the letters exchanged by Fermat and Pascal about this problem in 1654. G&S also write:

The problem had been a standard problem in mathematical texts; it appeared in Fra Luca Paccioli’s book summa de Arithmetica, Geometria, Proportioni et Proportionalita, printed in Venice in 1494 in the form:
    "A team plays ball such that a total of 60 points are required to win the game, and each inning counts 10 points. The stakes are 10 ducats. By some incident they cannot finish the game and one side has 50 points and the other 20. One wants to know what share of the prize money belongs to each side. In this case I have found that opinions differ from one to another but all seem to me insufficient in their arguments, but I shall state the truth and give the correct way."
In fact his "correct way" turns out to be wrong!

I mentioned classical, frequentist, and subjective approaches to probability. There are others: such as the propensity approach. Indeed, the question, "What is Probability?" can take us into the realm of philosophy. If I roll a die and do not show you the answer, what is the probability it is a six? Some people might say it is 1/6. Others might say it is 1 or 0, because it either is a six or not. (If it appeals to you to delve into philosophy, then have a look at this article on Probability interpretations.) Fortunately, in our course we work within the context of some mathematical axioms and therefore have a well-defined approach that does not depend on interpretation.

The arcsine law that I mentioned today is the second of the three laws described here. The other arcsine laws concern the proportion of the time that the random walk is positive, and the time at which the random walk achieves its maximum. My purpose in including this in our first lecture was to show you something that is surprising and which is proved in a clever way (by the reflection principle and matching paths in a 1-1 manner).

There is a space after each post for you to write comments or ask questions using the IntenseDebate commenting system. Please feel free to do so. Other students will probably benefit.

Thursday 9 January 2014

Course starts Friday, 17 January

This is the first post on the blog page for my IA Probability course, starting 17 January 2014.

Sometimes upon leaving a lecture I think, "I wish I had said ...". This blog gives me a place to say it. Or I can use this space to tell you about some extra interesting things, talk about a question that a student has asked during a lecture. Or I might comment on an examples sheet question.

Mathematics in this blog is written in LaTeX and rendered by MathJax. You need JavaScript on your viewing device. These pages should look great on a PC, Mac or a tablet like an iPad or Nexus. If it is working properly for you then \$(a+b)^2=a^2+2ab+b^2\$ should be rendered as: $(a+b)^2=a^2+2ab+b^2$.

If you would like to learn more about how I installed MathJax in these pages, see here.