Wednesday, 12 March 2014

Final lecture notes

The notes have been a work in progress, and many small corrections have been made along our way. You should make sure now to obtain an up-to-date copy.

I appreciate you patience. Students in another year will have perfect notes from the start. But you have had to cope with changing notes, and temporary typos. Perhaps it has been a compensation that you have had the fun of seeing a course being designed, modified along the way, and delivered freshly the first time.

I think that as of now, March 12, 12.30pm, very few typos remain. But if, during your reading this vacation, you think you find any further typos please let me know! If you find some part hard to understand because is it too cryptic, also let me know. Perhaps I can make a small improvement. Obviously, I could include more details, and more worked examples, if I allowed myself to write more than 4 pages per lecture, but I think that page limit is a good one. I am trying to cover everything, without being boringly verbose. For more lengthy expositions you can go to textbooks.

I have appreciated the comments that some people have made, either in the blog, or privately to me in email. If any of you come across something in your reading that you think would make a nice addition to this course, then do please send me an email about it. For instance, someone has added a comment to Lecture 23 about something recently on the BBC web site about the Zipf distribution.

Non-examinable bits are put in * * just after the title of each lecture. But in most cases the non-examinable bits have some bearing on the examinable topics. So although you would not want to learn the * * bits for regurgitation in exams, I have been hoping that they will help you in developing a better over-all appreciation of some of the examinable topics.

Lecture 24

The overhead slides I used for our last lecture are here (best downloaded and viewed with a pdf reader, rather than as a web page).

I showed you a bit of Large deviation theory and Chernoff's upper bound. The lower bound  was quoted without proof (except that in Example 24.1 we did the lower bound for the special case of $B(1,p)$). If perhaps some time mid-summer you have a moment and are curious to see the lower bound proved (which you have learned enough to understand), and more about large deviations, see these slides from a talk I once gave, pages 5–9. This note on the Cramer-Chernoff theorem is also good. Interestingly, Chernoff says in an interview with John Bather that the Chernoff bound should really be named after someone else!

I have used large deviation theory in my research to analyse buffer overflows in queues. But I know almost nothing about the subject of Random matrices. I prepared the content for Section 24.2 because I was intrigued by the results, wanted to learn, and thought it would be something entertaining with which to conclude. I did some of my reading in Chapter 2 of An Introduction to Random Matrices, by Anderson, Guiomet and Zeotouni, and then simplified their treatment to make it suitable for IA.

I thought it was nice that in showing you these two advanced topics, I could bring into play so many of the ideas we have had in our course: Markov and Chebyshev inequalities, moment generating function, sums of Bernoulli r.vs, Stirling’s formula, normal distribution, gambler’s ruin, Dyke words, generating functions, and the Central limit theorem.

Feedback. I am pleased that over 80% have said that the course has been interesting (4) or very interesting (5). So I hope you will consider the applicable courses in your next year of study. In Appendix H I have written some things about applicable mathematics courses in IB.

Monday, 10 March 2014

Lecture 23

On the way home today, I realised that I had forgotten to tell you why the Central limit theorem has that name. See below. First, here is a little more history about this important theorem.

Henk Tijms writes in his book, Understanding Probability: Chance Rules in Everyday Life, Cambridge: Cambridge University Press, 2004,

"The Central Limit Theorem for Bernoulli trials was first proved by Abraham de Moivre and appeared in his book, The Doctrine of Chances, first published in 1718. He used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812.

De Moivre spend his years from age 18 to 21 in prison in France because of his Protestant background. When he was released he left France for England, where he worked as a tutor to the sons of noblemen. Newton had presented a copy of his Principia Mathematica to the Earl of Devonshire. The story goes that, while de Moivre was tutoring at the Earl's house, he came upon Newton's work and found that it was beyond him. It is said that he then bought a copy of his own and tore it into separate pages, learning it page by page as he walked around London to his tutoring jobs. De Moivre frequented the coffeehouses in London, where he started his probability work by calculating odds for gamblers. He also met Newton at such a coffeehouse and they became fast friends. De Moivre dedicated his book to Newton."

I was looking at the Wikipedia article on the Central limit theorem. It mentions two things that I did not already know, but which would be suitable for the television programme QI.

1. There is a quite interesting explanation of why the term "Central" is used.

The actual term "central limit theorem" (in German: "zentraler Grenzwertsatz") was first used by George Pólya in 1920 in the title of a paper. Pólya referred to the theorem as "central" due to its importance in probability theory. According to Le Cam, the French school of probability interprets the word central in the sense that "it describes the behaviour of the centre of the distribution as opposed to its tails".

Personally, I had always thought it was the second of these, but the first is also plausible.

2. There is a quite interesting Cambridge connection.

A proof of a result similar to the 1922 Lindeberg CLT was the subject of Alan Turing's 1934 Fellowship Dissertation for King's College at the University of Cambridge. Only after submitting the work did Turing learn it had already been proved. Consequently, Turing's dissertation was never published.

Friday, 7 March 2014

Lecture 22

You might like to experiment with the bivariate normal. Here is the Mathematica code that I used to plot the joint density function. You can copy and paste this code into Mathematica and they play with viewing from different angles or changing the value of the correlation, r.

S = ParallelTable[
    MultinormalDistribution[{1, 1}, {{1, r }, {r , 1}}], {x, y}],
    {x, -3, 4}, {y, -3, 4}, PlotRange -> All, 
   MeshFunctions -> {#3 &}, PlotPoints -> 50], {r, {0, 0.6}}]

Inversion of probability and moment generating functions

For discrete random variables working with the p.g.f. is easy. We have

Theorem 12.2The distribution of $X$ is uniquely determined by the p.g.f. $p(z)$. 

The proof is trivial because the distribution of $X$ is contained in coefficients of the power series $p(z)=p_0+p_1z+p_2z^2+\cdots$. It is also easy to tell if any given power series is a p.g.f. All we need is check is that the coefficients are nonnegative and sum to 1.

However, things are not so easy for moment generating functions. Given a power series $m(\theta)=1+m_1\theta  + m_2\frac{1}{2}\theta^2 +\cdots$
it is not easy to tell if there exists a r.v. that has moments $m_1,m_2,\dotsc$.
Also we had without proof the theorem analogous to Theorem 12.2,

Theorem 21.4 The moment generating function determines the distribution of $X$, provided $m(\theta)$ is finite for all $\theta$ in some interval containing the origin.

To prove Theorem 21.4 we would need to show how to recover the p.d.f. from the m.g.f.
In fact, it is easier to use the characteristic function. Let's take a small look at that.

Today we saw that the moment generating function of $N(0,1)$ is $m(\theta)=e^{\theta^2/2}$. This looks a lot like the p.d.f., which is $f(x)=e^{-x^2/2}/\sqrt{2\pi}$.

The characteristic function (c.f.) $\chi$ is obtained by replacing $\theta$ with $it$, $t$ real,

$\chi(t) = \int_{-\infty}^\infty f(x)e^{itx} dx$. This is $\chi(t)=e^{-t^2/2}$ for $N(0,1)$.

We can compute nearly the same integral of $\chi$ to recover $f$:

$\int_{-\infty}^\infty \chi(t)e^{-itx} dt=2\pi e^{-x^2/2}=2\pi f(x)$.

In general, given the c.f. we can recover the p.d.f. using the inversion formula

$f(x)=\frac{1}{2\pi}\int_{-\infty}^\infty \chi(t)e^{-itx} dt$.

The m.g.f. exists only if all moments are finite. However, the c.f. exists even when moments do not exist, since $e^{itX}$ is just some point on a unit circle in the complex plane, and hence nothing is going off to infinity. The Cauchy distribution is an example of a distribution that has no m.g.f, but does have a c.f., namely $e^{-|t|}$. 

Examples sheet 3

Most of you will have now done Examples Sheet 3. Here are a few notes.

 #3. I have amended this question to read "Prove that for $t > 0$, ...." Why?

 #4. You'll want to compare this with Examples sheet 4 #5.

 #5. What would be the analogous question to this one, posed for continuous r.vs?

 #8. You should know how to answer this two ways: by using p.g.fs, and also directly via that tower property of conditional expectation.

 #16. How do we know that the answer is the smallest positive root of $a=pa^3+1-p$? We have such a theorem for the extinction probability in a branching processes. How can this coin tossing problem be viewed through the lens of some branching process?

 #18. The hint is for the third part. But it also useful in answering the second part: "Could you load the die so that the totals {2,3,4,5,6,7} are obtained with equal probabilities?"

Wednesday, 5 March 2014

Lecture 21

For Lectures 21–23 I plan to focus only on examinable topics. Lecture 24 should have space for some digressions, my favourite mathematical joke, and some wrapping up.

I mentioned the beta distribution today. There is more about that distribution in Appendix D. I became much better acquainted with this distribution myself when I was doing some work for a reinsurance company last year.

Today I handed out the feedback forms. If you were not at the lecture and would like to give feedback you can do it on-line as described here: Feedback request.

Last few lectures

I have just done some reorganization of the remaining lectures, so that we will finish all examinable material by the end of Lecture 23.

In Lecture 24 I will be free to chat about some other things and wrap up.

Tuesday, 4 March 2014

Feedback request

In tomorrow's lecture, Wednesday March 5, I will distribute the usual feedback forms. Alternatively, in the comfort of your room you can complete this equivalent on-line form, or print the paper form here and hand it to me. If you prefer, you may wait until closer to the last lecture to complete the on-line form or hand me a paper form.

The on-line version will be sent to my email anonymously (unless you choose to sign your name). After reading the responses, I will forward printouts to the Faculty Office.

New to the course this year were
  • printed notes,
  • blog,
  • examples sheet "puzzles" section,
  • follow-up blogs about the examples sheets,
  • problem solving strategies page,
  • non-examinable "fun" topics (arsine law, Zipf's law, Benford's distribution, information entropy, actuarial mathematics in insurance, stochastic bin-packing, Kelly criterion, random matrices, and so on).
It will be interesting for both me, and perhaps other lecturers, to hear what you thought about those things - and whatever else you wish to say.

I could have skipped all the non-examinable stuff, and used the extra time to write-out more careful lecture notes. Instead, I have chosen to put (eventually ...) perfect lecture notes on-line and use part of each hour to chat off-schedule about some unconventional things which would not usually be included in a first (or even second) course in Probability, but which I think make interesting connections with our core syllabus.

Monday, 3 March 2014

Lecture 20

Today we saw how to transform random variables. By this means we can discover further interesting relationships between distributions.

I mentioned in this lecture that if $X,Y$ are independent $N(0,1)$ then $W=X^2+Y^2$ and $T=\tan^{-1}(Y/X)$ are independent $\mathscr{E}(1/2)$ and $U[-\pi/2,\pi/2]$. We can use this result in the reverse direction to generate normal random variables. Make two independent samples from $U[0,1]$, say $U$ and $V$. Let $R^2=-2\log U$ (which is $\mathscr{E}(1/2)$) and let $ \Theta=(V-\frac{1}{2})\pi$. Then $X=R\cos\Theta$ and $Y=R\sin\Theta$ are independent $N(0,1)$.

There is a particularly strange fact in Examples sheet 4, #19.

By the way, if (like me) you have ever been puzzled why it is that the absolute value of the Jacobian, $|J|$, gives the volume change of a $n$-dimensional parallelepiped under an linear transformation in $\mathbb{R}^n$, you might enjoy reading this essay A short thing about determinants, or “an attempt to explain volume forms, determinants and adjugates by staying in 3-d”, by Gareth Taylor.

I have also now added to the notes an Appendix C, in which I explain how under a linear transformation a sphere is mapped into an ellipsoid of a volume that greater by a factor $|J|$. I think that in previous years lecturers have said at this point "you already know this from Vector Calculus". But I am not sure that's that case, and so decided it would be nice to try to remove some of the mystery.

The convolution we need to add two Cauchy random variables is tricky. On page 81, the partial fractions method of computing this integral is to write


for $(A,B,C,D)=\frac{1}{4+z^2}(1,2/z,1,2/z)$. Upon integrating w.r.t. $x$ the terms with $B$ and $D$ give $0$ (by symmetry either side of $0$), and the terms with $A$ and $C$ provide the claimed result of $\frac{2}{\pi(4+z^2)}$.