# Infinite Monkey Theorem and the Borel-Cantelli Lemmas

After submitting another two fellowship applications to survive the next year, I was relaxing a little bit by randomly searching the web, when I came across this thing called the “infinite monkey theorem.’’ The theorem says that if an infinite number of monkeys randomly punch on a typewriter, the probability that one of them will write Hamlet (or all of Shakespeare’s work) is equal to one. There are in fact many versions of this theorem, and I’m not sure which one is the real thing, but we’ll see that not only one monkey but an infinite number of monkeys will write Hamlet with probability equal to one.

The theorem is quite amusing not only because of the imagine of a monkey punching on a typewriter but also because it highlights a counter-intuitive result that arises from dealing with infinite things. After all, while it seems plausible that one monkey would eventually succeed, an infinite number of monkeys succeeding in writing Hamlet is a totally different story.

The statement itself is a straightforward corollary of the second Borel-Cantelli Lemma. And, fortunately, I had written a note on the lemma in the past. I recall starting to write it when I had difficulties with understanding limit suprema/infima of sequences of sets. This was when I was reading my very first textbook in mathematical statistics (I think I have chosen the wrong book back then…). Afterwards, the note was updated several times, as new theorems related to limit suprema and infima of sets were encountered. So, here is an edited version of the note, with an extra section on the infinite monkey theorem at the end.

### Suprema and Infima

Let us start with the definition of supremum, $\sup$, and infimum, $\inf$. Consider a partially ordered set $(\Omega, \le)$ and let $A \subset \Omega$ be a subset that is bounded from above. For example, the real line is a partially ordered set with respect to the relation “less than or equal to.’’ An upper bound of $A$ is an element $u\in \Omega$ such that $a \le u$ for all $a\in A$. Further, given a set $U_A$ of upper bounds of $A$, the **supremum** or the **least upper bound** of $A$, denoted by $\sup A$, is the element $\bar u \in U_A$ such that $\bar u \le u$ for all $u\in U_A$. In other words, the supremum is the smallest element of all the upper bounds of $A$. The **infimum** or the **greatest lower bound** is analogously defined as the element $\bar l\in \Omega$ that satisfies $\bar l \ge l$ for all $l \in L_A$, where $L_A$ is the set of all lower bounds of $A$.

A natural question that arises when dealing with upper bounds is “why do we not simply use the maximum of the set $A$ rather than thinking in terms of suprema?” The problem is that, often, the maximum of a set does not exist. A simple example is the open interval $A = \{a \in \mathbb R \, \vert \, x < a < y\}$. For every number $a \in A$, we can always find another number $a’ = (y + a)/2$ that is an element of $A$ but is greater than $a$. Thus, there is no largest number in $A$. On the other hand, the least upper bound is well-defined and is equal to $y$ (and it turns out that every bounded subset $A$ of real numbers has an infimum and supremum, which is quite an important result in mathematical analysis.)

### Limit Superior and Limit Inferior of sequences in $\mathbb R$

We can also think about suprema and infima of sequences of real numbers. Let $\{x_n\}_{n=1}^\infty$ be a bounded sequence in $\mathbb R$. Then the **limit superior** and **limit inferior** of the sequence is defined as

To see what these expressions mean, pick any $n \ge 1$ and think of the *tail* of that sequence, $A_n = \{x_n, x_{n+1},…\}$. For every $n’ > n$, we have

As $A_{n’}$ is a subset of $A_n$ for $n’\ge n$, it follows that the least upper bound of $A_{n’}$ cannot be greater than that of $A_n$, i.e., $\sup A_{n’} \le \sup A_n$. Note further that $\{\sup A_n\}_{n=1}^\infty$ is itself a sequence of real numbers. But as $\sup A_{n’} \le \sup A_{n}$ for all $n’\le n$, we see that $\{\sup A_n\}$ is monotonically decreasing in $n$. Lastly, all bounded monotonic sequences of real numbers have a finite limit. Therefore, it must be the case that the limit of $\{\sup A_n\}_{n}$ is well-defined and finite. It is precisely this limit, i.e., $\lim_{n\rightarrow\infty} \{\sup A_n\}$, which is the limit superior of the sequence $\{x_n\}_{n=1}^\infty$. Intuitively speaking, the further we go down the tail of $\{x_n\}$, the less elements the remaining tail contains, and the least upper bound of the tail can only become smaller until it reaches a stable limit.

We might consider a quite trivial example, which, however, will be helpful in understanding what follows. Suppose that the sequence $\{x_n\}$ does not converge to single real number, but that for all $n \ge N$, elements of the tail $A_N = \{x_N, x_{N+1}, …\}$ take on values only in the set $\{0,1,2\}$, e.g., $A_n = \{0,2,1,0,1,0,…\}$. Notice that if there are *only a finite number* of $2$s in $A_N$, then the $\limsup$ of the sequence cannot be equal to $2$. To see this, let $k$ be the last occurrence of $2$ in the sequence $\{x_n\}$. Then $\sup_{n > k} x_n \ne 2$, as there are no $2$s anymore. On the other hand, if $1$ appears an infinite number of times in the tail of the sequence, there is no natural number $k’$ after which $1$ does not appear anymore. Regardless of how large we choose $k’$, there exists a $k’’ > k’$ such that $x_{k’’} = 1$. Thus even though 0 might appear an infinite number of times in the tail sequences as well, and the sequence does not reach a stable limit, the upper bound of the tail sequence will stay at $1$ after all $2$s are used up, and thus $\limsup_{n\rightarrow \infty} x_n = 1$.

### Limit Superior and Limit Inferior of Sequences of Sets in General

Next, let $\Omega$ be a set and $\mathcal P(\Omega)$ its power set, i.e., the collection of all subsets of $\Omega$. We can construct a partial order on $\mathcal P(\Omega)$ by the relation of set inclusion. Similarly to real numbers, where we write $a\le b$ to mean that $a$ is “smaller or equal’’ to $b$, we might say say that $A$ is “smaller or equal’’ to $B$ if all elements of $A$ are also elements of $B$, which we denote by $A\subseteq B$. Notice that all subsets of $\Omega$ are bounded below by the empty set and from above by $\Omega$ with respect to $\subseteq$, as $\varnothing \subseteq A \subseteq \Omega$ for all $A\in \mathcal P(A)$.

Now, let $\{X_1,X_2,…\}$ be a sequence in $\mathcal P(\Omega)$, i.e., a sequence of subsets of $\Omega$. Then, the limit superior and inferior of the sequence are defined, respectively, as

Boom. These expressions didn’t make any sense to me when I first encountered them.

To understand how these expressions arise, we notice that $\limsup$ and $\liminf$ of sequence sets have a similar structure to that of real sequences. Using the $\limsup$ as the main example, we see that we are concerned with a tail sequences of the form $\{X_k\}_{k=n}^\infty$ for some $n$, and that we are taking the “limit,’’ here the union or intersection, with respect to $n$. So, we would need to understand why the limit is expressed in terms of unions and intersections.

Consider $\bigcup_{n=1}^\infty X_n$. In terms of set inclusion, the “smallest’’ set that contains all elements in the sequence $\{X_n\}_{n=1}^\infty$ is, by definition, their union. Hence $\bigcup X_n$ is the supremum of the sequence $\{X_n\}$. Similarly, $\bigcup_{k=n}^\infty X_k$ is the supremum of the tail sequence $\mathcal Y_n = \{X_n, X_{n+1},…\}$. On the other hand, the “largest” set that is a subset of all $X_n \in \mathcal Y_n$ is the intersection $\bigcap_{k=n}^\infty X_n$, i.e., the infimum of the tail sequence.

Next, we have to think about limits of set sequences. To do so, it is convenient to think of increasing and decreasing sequences, where these terms are, again, defined with respect to set inclusion. A sequence $\{X_1,X_2,…\}$ is said to be **increasing** if $X_k \subseteq X_{k+1}$ for all $k$. Similarly, it is said to be **decreasing** if $X_k \supseteq X_{k+1}$. Suppose we have a finite sequence $\{W_k\}_{k=1}^n$. If this sequence is increasing, then $W_k$ is a subset of $W_n$ for all $k\le n$ and, thus, $\bigcup_{k=1}^n W_k =W_n$. This leads to the definition of a limit as

On the other hand, for a finite decreasing sequence $\{V_k\}_{k=1}^n$, we have that all elements in $V_n$ are also elements of $V_k$ for all $k \le n$. Thus $\bigcap_{k=1}^n V_k = V_n$ and we can define

which consists of the elements in $\Omega$ that appear in all $V_k$. This shows why the limit for increasing/decreasing sequences of sets is expressed in terms of unions and intersections.

Lastly, we combine these these two results about tail sequences and limits. Consider the sequence $\{X_n\}$, and let $\mathcal Y_n$ be the tail sequence starting at $n$, i.e., $\mathcal Y_n = \{X_n, X_{n+1},…\}$. The least upper bound of the tail sequence is $\sup_{k\ge n} \mathcal Y_n = \bigcup_{k=n}^\infty X_k$ and $\{\sup_{k\ge n} \mathcal Y_n\}_{n=1}^\infty$ is itself a sequence of sets, which is decreasing in $n$. That is,

But as we have just shown, for a decreasing sequence of sets, the limit is expressed in terms of intersections. Hence, we have

We might interpret this as follows. Consider an element $\omega \in \bigcup_{k=n}^\infty X_k$. This element is a member of at least one set in the tail sequence $\{X_n, X_{n+1},…\}$. If $\omega\in \bigcap_{n=1}^\infty \bigcup_{k = n}^\infty X_k$, we further require that this is true for all $n\ge 1$. In other words, we require $\omega$ to be an element of $X_1\cup X_2\cup \cdots$, $X_2\cup X_3\cup \cdots$, $X_{n}\cup X_{n+1} \cup \cdots $ and so on. No matter how far down the sequence we go, $\omega$ must be a member of *at least one* of the sets. But as there are infinitely many tail sequences, this implies that $\omega$ has to belong to infinitely many $X_n$’s.

An important thing to notice here is that there are not only infinitely many tail sequences but that each tail sequence consists of infinitely many sets. This implies that there might be also an infinite number of sets to which $\omega$ does *not* belong. For example, suppose $\omega$ is an element of exactly one and only one set of each tail sequence. Then, $\omega$ would belong to infinitely many sets (as there are infinitely many tail sequences). At the same time, $\omega$ would not be an element of the infinitely many sets that constitute each tail sequence.

Next, consider the limit inferior. The infimum of the tail sequence $\mathcal Y_n$ is $\inf_{k\ge n} \mathcal Y_n = \bigcap_{k=n}^\infty X_n$. The sequence $\{\inf_{k\ge n} \mathcal Y_n\}_{n=1}^\infty$ is an increasing sequence in $n$, as

Hence,

Again, to get an intuitive understanding, fix $n$ for the moment. An element $\omega \in \bigcap_{k=n}^\infty X_k$ is a member of *all* $X_n$ in the tail sequence $\mathcal Y_n = \{X_n, X_{n+1},…\}$. There is only a finite number of sets up to $X_n$, but there are infinitely many sets $X_k$ with $k\ge n$; and $\omega$ has to belong to all of them. This means that $\omega$ is a member of *all but a finite number* of sets in the sequence $\{X_n\}$. The set $\liminf X_n$ consists of all $\omega \in \Omega$ that meet this condition.

Notice that an element in $\liminf X_n$ has to be also an element of $\limsup X_n$. To see this, suppose that $\omega \in \liminf_{n\rightarrow\infty} X_n$. Then $\omega \in \bigcup_{n=1}^\infty \bigcap_{k=n}^\infty X_k$, which implies that there exists a natural number $N$ such that for all $n\ge N$, $\omega$ is a member of $X_n$. That is, from the $N$th term onwards in the sequence $\{X_n\}$, $\omega$ is an element to all of the remaining sets, $\{X_{N}, X_{N+1},…\}$. Hence, $\omega$ must be an element of every tail sequence $\mathcal Y_n$, $n\ge 1$, and, accordingly, a member of their intersection. Thus, $\omega \in \bigcap_{n=1}^\infty \mathcal Y_n = \bigcap_{n=1}^\infty \bigcup_{k=n}^\infty X_k= \limsup_{n\rightarrow\infty} X_n.$ So, we have

Intuitively, the same results can be also seen by the fact that if $\omega$ is a member of all but infinitely many $X_n$ ($\liminf$), then it has to be a member of infinitely many $X_n$ ($\limsup$).

Another result which can be shown using DeMorgan’s Laws is

and similarly

### Continuity of Probability

It turns out that limit inferior and superior of sets is extremely important to understand probability, especially asymptotic results.

Consider the probability space $(\Omega, \mathcal F, P)$, where $\Omega$ is the sample space, $\mathcal F$ a sigma algebra of events and $P:\mathcal F \rightarrow [0,1]$ a probability measure. When we say that a function is continuous, we roughly mean that the function preserves limiting processes. As we have considered sequences of sets above, let us think about how this idea can be applied to probabilities of sequences of events.

First, consider a pairwise disjoint sequence of sets, $\{D_n\} \in \mathcal F$. By the countable additivity of probability measure, we have

Next, let $\{A_n\}\in \mathcal F$ be an *increasing* sequence of events. Define $D_1 = A_1$ and $D_n = A_n\setminus A_{n-1}$. As $\{A_n\}$ is an increasing sequence, we have that $A_{n-1} \subseteq A_{n}$. So $D_n$ is a set of those elements in $A_n$ that are not in $A_{n-1}$ or any set preceding $A_{n-1}$ in the sequence. So, the sequence $\{D_n\}$ consists of pairwise disjoint sets. Further $\bigcup_{n=1}^\infty D_n = \bigcup_{n=1}^\infty A_n$ by construction. Hence, by countable additivity of $P$, we have

But, $D_n = A_n \setminus A_{n-1}$ implies that $P(D_n) = P(A_n) - P(A_{n-1})$ for all $n$. Hence, for a fixed $k$, $\sum_{n=1}^k P(D_n) = P(A_k)$. It follows that $\lim_{k\rightarrow\infty} \sum_{n=1}^k P(D_k) = \lim_{k\rightarrow\infty} P(A_k)$. Putting these together, we get

for increasing sequences $\{A_n\}$. Notice how the limit operator moves in and out of the probability function. Being able to do so is what we mean by **continuity of probability**.

For decreasing sequences of sets $\{A_n\}$ we can use the following trick. If $A\subseteq B$, then $A^C \supseteq B^C$. Hence, if $\{A_n\}$ is an decreasing sequence, then $\{A_n^C\}$ is an increasing sequence. So,

Lastly, let $(A_n)$ be an arbitrary sequence of events, not necessarily monotonic. We note that $\bigcup_{k=1}^n A_k$ is an increasing sequence in $n$. Hence,

Similarly, as $\bigcap_{k=1}^n A_k$ is decreasing in $n$, it follows that

Also, we have that

The first step is just the definition of limit superior. The second step follows from the fact that $Y_n = \bigcup_{k=n}^\infty A_k$ is a decreasing sequence, such that $\bigcap_{n=1}^\infty Y_n = \lim_{n\rightarrow\infty} Y_n $, and the last step from the continuity of probability. Analogously, as $\bigcap_{k=n} A_k$ is a increasing sequence, we have

### Borel-Cantelli Lemmas

Once we have understood limit inferior/superior of sequence of sets and the continuity property of probability measure, proving the Borel-Cantelli Lemmas is straightforward. So, here are the lemmas and their proof.

**Theorem**(*First Borel-Cantelli Lemma*)
Let $(\Omega, \mathcal F, P)$ be a probability space. Suppose $\{A_n\}_{n=1}^\infty$ is a sequence of events. If $\sum_{k=1}^\infty P(A_k) < \infty$ then $P(\limsup_{n\rightarrow\infty}A_n) = 0$.

Again, we might ask ourselves what this Lemma is telling us. To start from scratch, each element $\omega\in \Omega$ is an outcome from some random process. An event $A_n$ is a collection of outcomes, and we say that event $A_n$ has occurred if we observe an $\omega \in A_n$ from the process. The probability of the event $A_n$, $P(A_n)$, is thus the probability that we observe an outcome $\omega$ that belongs to $A_n$. Now, the set $\limsup A_n$ is the set of $\omega \in \Omega$ that belong to infinitely many sets in the sequence $\{A_n\}$. That is $\limsup_n A_n = \{\omega \in \Omega: \omega \in A_n \text{ for infinitely many } n\}$. As a sigma-algebra is closed under countable unions and intersections, $\limsup_n A_n$ is an element of $\mathcal F$ as well and, therefore, an event. If the event $\limsup_n A_n$ “occurs,’’ which means that we observe some $\omega \in \limsup_n A_n$, then infinitely many events in $\{A_n\}$ are occurring. Hence, $P(\limsup_n A_n)$ is the probability that infinitely many of the $A_n$ in $\{A_n\}$ occur.

What the first Borel-Cantelli Lemma is saying is, therefore, the following: if the sum of the probabilities of events in the sequence $\{A_n\}$ is finite (read “small’’ or “not too large’’), then the probability that infinitely many of the $A_n$ occur must be zero, i.e., only finitely many of the $A_n$ can occur with positive probability.

So, here is the proof of the lemma.

By the countable subadditivity of $P$, we have

for all $n$. But, by assumption, $\sum_{k=1}^\infty P(A_k) < \infty$. Hence,

Here’s the second Borel-Cantelli Lemma:

**Theorem**(*Second Borel-Cantelli Lemma*)
Let $\{A_n\}_{n=1}^\infty$ be a sequence of independent events. If $\sum_{n=1}^\infty P(A_n) = \infty$ then $P(\limsup_{n\rightarrow\infty} A_n) = 1$.

That is, if the sum of probabilities of a sequence of *independent* events is infinite, then the probability that infinite events from the sequence occur is equal to one.

Note that $1 - P(A_n) \le \exp[-P(A_n)]$ for all $n$ (see, Excursion on Gibbs Inequality section in this post). Further, we have

But by hypothesis the events are independent. It follows that

as, by assumption, $\sum_{k=1}^\infty P(A_k) = \infty$. Hence,

A small note: the definition of independent events is that for any *finite* sub-collection $J$ of the events $\{A_n\}$, the following holds: $P(\cap_{n\in J}A_n)= \prod_{n\in J}P(A_n)$. The generalization to an infinite collection of events follows from the continuity property of $P$.

### So, What About the Monkeys?

The first Borel-Cantelli Lemma is often used in proving the Strong Law of Large Numbers. The Second Lemma is a direct proof of the Infinite Monkey Theorem that was introduced at the start of the post.

Recall that the theorem says that if an infinite number of monkeys randomly punch on a typewriter, one of them will write Hamlet with probability one. We don’t even need that they punch on the typewriter for an infinite amount of time, it is sufficient that each of them punch more characters than the number of characters in Hamlet.

To see the connection to the Borel-Cantelli Lemma, let $A_n$ be the event that monkey $n$ writes Hamlet and let $k$ be the number of characters in the tragedy. Suppose that typewriter has $q$ keys. The probability of the event $A_n$ is then equal to the probability of sampling an ordered set of size $k$ (that is the characters in Hamlet) from a pool of size $q$, were we are sampling with replacement. Hence, $P(A_n) = q^{-k}$, which will be extremely small for reasonably large $q$ and $k$ but still strictly positive. It follows that $\sum_{n=1}^\infty P(A_n) = \infty$, and by the second Borel-Cantelli Lemma, we have $P(\limsup_{n\rightarrow\infty} A_n) = 1$. An infinite number of monkeys will write Hamlet *almost surely*, that is, with probability equal to one.