Posted by: atri | February 22, 2009

## Notes for Lecture 4

The polished version of the scribed notes for Lecture 4 are now online. Thanks to Willmert for his great effort.

Thanks to Luca Trevisan‘s program to convert LaTeX to wordpress suitable format, the notes are also produced below. (Citations are missing.)

===============================================================================

Note: These extraordinarily detailed note are almost entirely due to Willmert. The actual lecture was worth only one page in these notes. Enjoy! –Atri.

1. Counting and Probability

This lecture reviews elementary combinatorics and probability theory. We begin by first reviewing elementary results in counting theory, including standard formulas for counting permutations and combinations. Then, the axioms of probability and basic facts concerning probability distributions are presented.

2. Counting

Counting theory tries to answer the question “How many?” or “How many orderings of ${n}$ distinct elements are there?” In this section, we review the elements of counting theory. A set of items that we wish to count can sometimes be expressed as a union of disjoint sets or as a Cartesian product of sets. The rule of sum says that the number of ways to choose an element from one of two disjoint sets is the sum of the cardinalities of the sets. That is, if ${A}$ and ${B}$ are two finite sets with no members in common, then ${|A\cup B|=|A|+|B|}$. The rule of product says that the number of ways to choose an ordered pair is the number of ways to choose the first element times the number of ways to choose the second element. That is, if ${A}$ and ${B}$ are two finite sets, then ${|A\times B|=|A|\cdot |B|}$.

A string over a finite set ${S}$ is a sequence of elements of ${S}$. We sometimes call a string of length ${k}$ a k-string. A substring ${s'}$ of a string ${s}$ is an ordered sequence of consecutive elements of ${s}$. A k-substring of a string is a substring of length ${k}$. For example, ${010}$ is a ${3}$-substring of ${01101001}$ (the ${3}$-substring that begins in position ${4}$), but ${111}$ is not a substring of ${01101001}$. A ${k}$-string over a set ${S}$ can be viewed as an element of the Cartesian product ${S^{k}}$ of ${k}$-tuples; thus, there are ${|S|^{k}}$ strings of length ${k}$. For example, the number of binary ${k}$-strings is ${2^{k}}$. Intuitively, to construct a ${k}$-string over an ${n}$-set, we have ${n}$ ways to pick the first element; for each of these choices, we have ${n}$ ways to pick the second element; and so forth ${k}$ times. This construction leads to the ${k}$-fold product ${n\cdot n\cdots n=n^{k}}$ as the number of ${k}$-strings.

A permutation of a finite set ${S}$ is an ordered sequence of all the elements of ${S}$, with each element appearing exactly once. For example, if ${S=\{a,b,c\}}$, there are ${6}$ permutations of ${S}$ :

$\displaystyle abc,acb,bac,bca,cab,cba.$

There are ${n!}$ permutations of a set of ${n}$ elements, since the first element of the sequence can be chosen in ${n}$ ways, the second in ${n-1}$ ways, the third in ${n-2}$ ways, and so on \cite{Algorithms}.

A k-permutation of ${S}$ is an ordered sequence of ${k}$ elements of ${S}$, with no element appearing more than once in the sequence. Thus, an ordinary permutation is just an ${n}$-permutation of an ${n}$-set. The twelve ${2}$-permutations of the set ${\{a,b,c,d\}}$ are

$\displaystyle ab,ac,ad,ba,bc,bd,ca,cb,cd,da,db,dc,$

where we have used the shorthand of denoting the ${2}$-set ${\{a,b\}}$ by ${ab}$, and on on. The number of ${k}$-permutations of an ${n}$-set is

since there are ${n}$ ways of choosing the first element, ${n-1}$ ways of choosing the second element, and so on until ${k}$ elements are selected, the last being a selection from ${n-k+1}$ elements.

A k-combination of an ${n}$-set ${S}$ is simply a ${k}$-subset of ${S}$. There are six ${2}$-combinations of the ${4}$-set ${\{a,b,c,d\}}$:

$\displaystyle ab,ac,ad,bc,bd,cd.$

We can construct a ${k}$-combination of an ${n}$-set by choosing ${k}$ distinct elements from the ${n}$-set. The number of ${k}$-combinations of an ${n}$-set can be expressed in terms of the number of ${k}$-permutations of an ${n}$-set. For every ${k}$-combination, there are exactly ${k!}$ permutations of its elements, each of which is a distinct ${k}$-permutation of the ${n}$-set. Thus the number of ${k}$-combinations of an ${n}$-set is the number of ${k}$-permutations divided by ${k!}$; from Equation (1), this quantity is

For ${k=0}$, this formula tells us that the number of ways to choose ${0}$ elements from an ${n}$-set is ${1}$ (not ${0}$), since ${0!=1}$.

We use the notation ${\binom{n}{k}}$ (read “${n}$ choose ${k}$“) to denote the number of ${k}$-combinations of an ${n}$-set. From Equation (2), we have

This formula is symmetric in ${k}$ and ${n-k}$:

These two numbers are known as binomial coefficients, due to their appearance in the binomial expansion:

A special case of the binomial expansion occurs when ${x=y=1}$:

This formula corresponds to counting the ${2^n}$ binary ${n}$-strings by the number of ${1}$‘s they contain: there are ${\binom{n}{k}}$ binary ${n}$-strings containing exactly ${k}$ ${1}$‘s, since there are ${\binom{n}{k}}$ ways to choose ${k}$ out of the ${n}$ positions in which to place the ${1}$‘s.

We sometimes need to bound the size of a binomial coefficient. For ${1\leq k \leq n}$, we have the lower bound

$\displaystyle \binom{n}{k}=\frac{n(n-1)\cdots (n-k+1)}{k(k-1)\cdots 1}$

$\displaystyle =\left(\frac{n}{k}\biggr) \biggl(\frac{n-1}{k-1}\right) \cdots \left(\frac{n-k+1}{1}\right)$

Taking advantage of the inequality ${k!\geq (k/e)^{k}}$, we obtain the upper bounds

$\displaystyle \binom{n}{k}=\frac{n(n-1)\cdots (n-k+1)}{k(k-1)\cdots 1}$

For all ${0\leq k\leq n}$, we can use induction to prove the bound

where for convenience we assume that ${0^{0}=1}$. For ${k=\lambda n}$, where ${0\leq \lambda \leq 1}$, this bound can be rewritten as

$\displaystyle \binom{n}{\lambda n}\leq \frac{n^{n}}{(\lambda n)^{\lambda n} ((1-\lambda)n)^{(1-\lambda)n}}$

where

is the (binary) entropy function and where, for convenience, we assume that ${0\cdot \lg{(0)}=0}$, so that ${H(0)=H(1)=0}$.

3. Probability

This section reviews basic probability theory. We define probability in terms of a sample space ${S}$, which is a set whose elements are called elementary events. Each elementary event can be viewed as a possible outcome of an experiment. For the experiment of flipping two distinguishable coins, we can view the sample space as consisting of the set of all possible ${2}$-strings over ${\{H,T\}}$:

$\displaystyle S=\{HH,HT,TH,TT\}.$

An event is a subset\footnote{For a general probability distribution, there may be some subsets of the sample space ${S}$ that are not considered to be events. This situation usually arises when the sample space is uncountably infinite. The main requirement is that the set of events of a sample space be closed under the operations of taking the complement of an event, forming the union of a finite or countable number of events, and taking the intersection of a finite or countable number of events.} of the sample space ${S}$. For example, in the experiment of flipping two coins, the event of obtaining one head and one tail is ${\{HT,TH\}}$. The event ${S}$ is called the certain event, and the event ${\emptyset}$ is called the null event. We say that two events ${A}$ and ${B}$ are mutually exclusive if ${A\cap B=\emptyset}$. We sometimes treat an elementary event ${s\in S}$ as the event ${\{s\}}$. By definition, all elementary events are mutually exclusive.

A probability distribution ${Pr\left[\cdot\right]}$ on a sample space ${S}$ is a mapping from events of ${S}$ to real numbers such that the following probability axioms are satisfied:

1. ${Pr\left[A\right]\geq 0}$ for any event ${A}$.
2. ${Pr\left[S\right]=1}$.
3. ${Pr\left[A\cup B\right]=Pr\left[A\right]+Pr\left[B\right]}$ for any two mutually exclusive events ${A}$ and ${B}$. More generally, for any (finite or countably infinite) sequence of events ${A_{1},A_{2},\ldots}$ that are pairwise mutually exclusive,

$\displaystyle Pr\left[\bigcup_{i}A_{i}\right]=\sum_{i}Pr\left[A_{i}\right].$

We call ${Pr\left[A\right]}$ probability of the event ${A}$. We note here that axiom ${2}$ is a normalization requirement: there is really nothing fundamental about choosing ${1}$ as the probability of the certain event, except that it is natural and convenient. Several results follow immediately from these axioms and basic set theory. The null event ${\emptyset}$ has probability ${Pr\left[\emptyset\right]=0}$. If ${A\subseteq B}$, then ${Pr\left[A\right]\leq Pr\left[B\right]}$. Using ${\overline{A}}$ to denote the event ${S-A}$ (the complement of ${A}$), we have ${Pr\left[\;\overline{A}\;\right]=1-Pr\left[A\right]}$. For any two events ${A}$ and ${B}$,

The generalization of Equation (12) is also known as the union bound and it is given by:

$\displaystyle Pr\left[\bigcup_{i=1}^{n} A_{i}\right]\leq \sum_{i=1}^{n}Pr\left[A_{i}\right].$

4. Markov’s inequality:

Theorem 1 For any nonnegative random variable ${X}$ and any ${t>0}$,

$\displaystyle Pr\left[X\geq t\right]\leq \frac{\mathbb{E}\left[X\right]}{t}.$

Proof:

$\displaystyle \mathbb{E}\left[X\right]=\int_{0}^{\infty}x f_{X}(x) dx \geq \int_{t}^{\infty}x f_{X}(x) dx \geq t \int_{t}^{\infty} f_{X}(x) dx =t\cdot Pr\left[X\ge t\right].$

$\Box$

5. Suppose a random variable ${X}$ takes values ${X=i}$, where ${i=0,1,2,\ldots}$ with probabilities

$\displaystyle Pr\left[X=i\right]=P_{X}(i).$

Define

$\displaystyle u(n-k)= 1 \text{ if }n\geq k,$

$\displaystyle 0 \text{ otherwise}.$

It follows that

$\displaystyle Pr\left[X\geq k\right]=\sum_{n=k}^{\infty}P_{X}(n) =\sum_{n=0}^{\infty}P_{X}(n)u(n-k) \leq\sum_{n=0}^{\infty}P_{X}(n)e^{t(n-k)} \text{ for }\:t\geq 0.$

The last line follows from the fact that

$\displaystyle e^{t(n-k)}\geq u(n-k)\text{ for }t\geq 0.$

Next, let ${\theta_{X}(t)=P_{X}(n)e^{tn}}$ and note that

$\displaystyle \sum_{n=0}^{\infty}P_{X}(n)e^{t(n-k)}=e^{-tk} P_{X}(n)e^{tn} =e^{-tk} \theta_{X}(t).$

Hence, we have established the important result

$\displaystyle Pr\left[X\geq k\right]\leq e^{-tk} \theta_{X}(t).$

The Chernoff bound is determined by minimizing ${e^{-tk} \theta_{X}(t)}$:

$\displaystyle Pr\left[X\geq k\right]\leq \underset{t\geq 0}{\min}\: e^{-tk} \theta_{X}(t).$

Going back to the coin-flipping example, suppose that each of the four elementary events has probability ${1/4}$. Then the probability of getting at least one head is

$\displaystyle Pr\left[HH,HT,TH\right]=Pr\left[HH\right]+Pr\left[HT\right]+ Pr\left[TH\right]=\frac{3}{4}.$

Alternatively, since the probability of getting strictly less than one head is ${Pr\left[TT\right]=1/4}$, the probability of getting at least one head is ${1-1/4=3/4}$. A probability distribution is discrete if it is defined over a finite or countably infinite sample space. Let ${S}$ be the sample space. Then for any event ${A}$,

$\displaystyle Pr\left[A\right]=\sum_{s\in A} Pr\left[s\right],$

since elementary events, specifically those in ${A}$, are mutually exclusive. If ${S}$ is finite and every elementary event ${s\in S}$ has probability

$\displaystyle Pr\left[s\right]=\frac{1}{|S|},$

then we have the uniform probability distribution on ${S}$. In such case the experiment is often described as “picking an element of ${S}$ at random.” As an example, consider the process of flipping a fair coin, one for which the probability of obtaining head is the same as the probability of obtaining a tail, that is, ${1/2}$. If we flip the coin ${n}$ times, we have the uniform probability distribution defined on the sample space ${S=\{H,T\}^{n}}$, a set of size ${2^{n}}$. Each elementary event in ${S}$ can be represented as a string of length ${n}$ over ${\{H,T\}}$, and each occurs with probability ${1/2^{n}}$. The event

$\displaystyle A=\{\text{exactly }k \text{ heads and exactly } n-k \text{ tails occur}\}$

is a subset of ${S}$ of size ${|A|=\binom{n}{k}}$, since there are ${\binom{n}{k}}$ strings of length ${n}$ over ${\{H,T\}}$ that contain exactly ${k}$ ${H}$‘s. The probability of event ${A}$ is thus ${Pr\left[A\right]=\binom{n}{k}/2^{n}}$.

Sometimes we have some prior partial knowledge about the outcome of an experiment. For example, suppose that a friend has flipped two fair coins and has told you that at least one of the coins showed a head. what is the probability that both coins are heads? The information given eliminates the possibility of two tails. the three remaining elementary events are equally likely, so we infer that each occurs with probability ${1/3}$. Since only one of these elementary events shows two heads, the answer to our question is ${1/3}$.

Conditional probability formalizes the notion of having prior partial knowledge of the outcome of an experiment. The conditional probability of an event ${A}$ given that another event ${B}$ occurs is defined to be

whenever ${Pr\left[B\right]\neq 0}$. We read “${Pr\left[A|B\right]}$” as “the probability of ${A}$ given ${B}$.” Intuitively, since we are given that event ${B}$ occurs, the event that ${A}$ also occurs is ${A\cap B}$. That is, ${A\cap B}$ is the set of outcomes in which both ${A}$ and ${B}$ occur. Since the outcome is one of the elementary events in ${B}$, we normalize the probabilities of all the elementary events in ${B}$ by dividing them by ${Pr\left[B\right]}$, so that they sum to ${1}$. The conditional probability of ${A}$ given ${B}$ is, therefore, the ratio of the probability of event ${A\cap B}$ to the probability of event ${B}$. In the example above, ${A}$ is the event that both coins are heads, and ${B}$ is the event that at least one coin is a head. Thus

$\displaystyle Pr\left[A|B\right]=(1/4)/(3/4)=1/3.$

Two events are “independent” if

$\displaystyle Pr\left[A\cap B\right]=Pr\left[A\right]Pr\left[B\right],$

which is equivalent, if ${Pr\left[B\right]\neq 0}$, to the condition

$\displaystyle Pr\left[A|B\right]=Pr\left[A\right].$

For example, suppose that two fair coins are flipped and that the outcomes are independent. Then the probability of two heads is ${(1/2)(1/2)=1/4}$. Now suppose that one event is that the first coin comes up heads and the other event is that the coins come up differently. Each of these events occurs with probability ${1/2}$, and the probability that both events occur is ${1/4}$; thus, according to the definition of independence, the events are independent–even though one might think that both events depend on the first coin. Finally, suppose that the coins are welded together so that they both fall heads or both fall tails and that the two possibilities are equally likely. The the probability that each coin comes up heads is ${1/2}$, but the probability that they both come up heads is ${1/2\neq (1/2)(1/2)}$. Consequently, the event that one comes up heads and the event that the other comes up heads are not independent.

A collection ${A_{1},A_{2},\ldots,A_{n}}$ of events is said to be pairwise independent if

$\displaystyle Pr\left[A_{i}\cap A_{j}\right]=Pr\left[A_{i}\right]Pr\left[A_{j}\right]$

for all $1\leq i. We say that they are (mutually) independent if every ${k}$-subset ${A_{i_{1}},A_{i_{2}},\ldots,A_{i_{n}}}$ of the collection, where ${2\leq k\leq n}$ and ${1\leq i_{1} < i_{2} < \cdots < i_{k} \leq n}$, satisfies

$\displaystyle Pr\left[A_{i_{1}}\cap A_{i_{2}} \cap \cdots \cap A_{i_{k}}\right]= Pr\left[A_{i_{k}}\right]Pr\left[A_{i_{2}}\right]\cdots Pr\left[A_{i_{k}}\right].$

For example, suppose we flip two fair coins. Let ${A_{1}}$ be the event that the first coin is heads, let ${A_{2}}$ be the event that the second coin is heads, and let ${A_{3}}$ be the event that the two coins are different. We have

$\displaystyle Pr\left[A_{1}\right]=1/2,$

$\displaystyle Pr\left[A_{2}\right]=1/2,$

$\displaystyle Pr\left[A_{3}\right]=1/2,$

$\displaystyle Pr\left[A_{1}\cap A_{2}\right]=1/4,$

$\displaystyle Pr\left[A_{1}\cap A_{3}\right]=1/4,$

$\displaystyle Pr\left[A_{2}\cap A_{3}\right]=1/4,$

$\displaystyle Pr\left[A_{1}\cap A_{2}\cap A_{3}\right]=0.$

Since for $1\leq i , we have $Pr[A_i\cap A_j]=Pr[A_i]Pr[A_j]=1/4$, the events $A_1,A_2$ and $A_3$ are pairwise independent. The events are not mutually independent, however, because $Pr[A_1\cap A_2\cap A_3]=0$ and $Pr[A_1]Pr[A_2]Pr[A_3]=1/8\neq 0$. From definition of conditional probability (13), it follows that for two events ${A}$ and ${B}$, each with nonzero probability,

Solving for ${Pr\left[A|B\right]}$, we obtain

which is known as Bayes’s theorem. The denominator ${Pr\left[B\right]}$ is a normalizing constant that we can express as follows. Since ${B=(B\cap A)\cup (B\cap \overline{A}\;)}$ and ${B\cap A}$ and ${B\cap \overline{A}}$ are mutually exclusive events,

$\displaystyle Pr\left[B\right]=Pr\left[B\cap A\right]+Pr\left[B\cap \overline{A}\;\right]$

$\displaystyle =Pr\left[A\right]Pr\left[B|A\right]+Pr\left[\;\overline{A}\;\right] Pr\left[B|\;\overline{A}\;\right].$

Substituting into Equation (15), we obtain an equivalent form of Bayes’s theorem:

$\displaystyle Pr\left[A|B\right]=\frac{Pr\left[A\right]Pr\left[B|A\right]}{Pr\left[A\right] Pr\left[B|A\right]+ Pr\left[\;\overline{A}\;\right]Pr\left[B|\;\overline{A}\;\right]}.$

Bayes’s theorem can simplify the computing of conditional probabilities. For example, suppose that we have a fair coin and a biased coin that always comes up heads. We run an experiment consisting of three independent events: one of the two coins is chosen at random, the coin is flipped once, and then it is flipped again. Suppose that the chosen coin comes up heads both times. What is the probability that is is biased? We solve this problem using Bayes’s theorem. Let ${A}$ be the event that the biased coin is chosen, and let ${B}$ be the event that the coin comes up heads both times. We wish to determine ${Pr\left[A|B\right]}$. We have ${Pr\left[A\right]=1/2}$, ${Pr\left[B|A\right]=1}$, ${Pr\left[\;\overline{A}\;\right]=1/2}$, and ${Pr\left[B|\overline{A}\;\right]=1/4}$; hence,

$\displaystyle Pr\left[A|B\right]=\frac{(1/2)\cdot 1}{(1/2)\cdot 1+(1/2)\cdot (1/4)} =4/5.$

4. Discrete random variables

A (discrete) random variable ${X}$ is a function from a finite or countably infinite sample space ${S}$ to the real numbers. It associates a real number with each possible outcome of an experiment, which allows us to work with the probability distribution induced on the resulting set of numbers. Random variables can also be defined for uncountably infinite sample spaces. For our purposes, we shall assume that random variables are discrete.

For a random variable ${X}$ and a real number ${x}$, we define the event ${X=x}$ to be ${\{s\in S : X(s)=x\}}$; thus,

$\displaystyle Pr\left[X=x\right]=\sum_{\{s\in S : X(s)=x\}}Pr\left[s\right].$

The function

$\displaystyle f(x)=Pr\left[X=x\right]$

is the probability density function of the random variable ${X}$. From the probability axioms, ${Pr\left[X=x\right]\geq 0}$ and ${\sideset{}{_x}\sum Pr\left[X=x\right]=1}$. As an example, consider the experiment of rolling a pair of ordinary ${6}$-sided dice. There are ${36}$ possible elementary events in the sample space. We assume that the probability distribution is uniform, so that each elementary event ${s\in S}$ is equally likely: ${Pr\left[s\right]=1/36}$. Define the random variable ${X}$ to be the maximum of the two values showing on the dice. We have ${Pr\left[X=3\right]=5/36}$, since ${X}$ assigns a value of ${3}$ to ${5}$ of the ${36}$ possible elementary events, namely ${(1,3)}$, ${(2,3)}$, ${(3,3)}$, ${(3,2)}$, and ${(3,1)}$. It is common for several random variables to be defined on the same sample space. If ${X}$ and ${Y}$ are random variables, the function

$\displaystyle f(x,y)=Pr\left[X=x\text{ and }Y=y\right]$

is the joint probability density function of ${X}$ and ${Y}$. For a fixed value y,

$\displaystyle Pr\left[Y=y\right]=\sum_{x}Pr\left[X=x\text{ and }Y=y\right],$

and similarly, for a fixed value ${x}$,

$\displaystyle Pr\left[X=x\right]=\sum_{y}Pr\left[X=x\text{ and }Y=y\right].$

Using the definition (13) of conditional probability, we have

$\displaystyle Pr\left[X=x|Y=y\right]=\frac{Pr\left[X=x\text{ and }Y=y\right]}{Pr\left[Y=y\right]}.$

We define two random variables ${X}$ and ${Y}$ to be independent if for all ${x}$ and ${y}$, the events ${X=x}$ and ${Y=y}$ are independent or, equivalently, if for all ${x}$ and ${y}$, we have ${Pr\left[X=x\text{ and }Y=y\right]=Pr\left[X=x\right]Pr\left[Y=y\right]}$.

Given a set of random variables defined over the same sample space, one can define new random variables as sums, products, or other functions of the original variables. The simplest and most useful summary of the distribution of a random variable is the “average” of the values it takes on. The expected value (or, synonymously, expectation or mean) of a discrete random variable ${X}$ is

which is well defined if the sum is finite or converges absolutely. Sometimes the expectation of ${X}$ is denoted by ${\mu_{X}}$ or, when the random variable is apparent from context, simply by ${\mu}$.

Consider a game in which you flip two fair coins. You earn $${3}$ for each head but lose$${2}$ for each tail. The expected value of the random variable ${X}$ representing your earnings is

$\displaystyle \mathbb{E}\left[X\right]=6\cdot Pr\left[2 H\right]+ 1\cdot Pr\left[1 H, 1 T\right] -4\cdot Pr\left[2 T\right] =6(1/4)+1(1/2)-4(1/4) =1.$

The expectation of the sum of two random variables is the sum of their expectations, that is,

whenever ${\mathbb{E}[X]}$ and ${\mathbb{E}[Y]}$ are defined. This property extends to finite and absolutely convergent summations of expectations, and it is called linearity of expectation:

$\displaystyle \mathbb{E}\left[\sum_{i=1}^{n}X_{i}\right]=\mathbb{E}\left[X_{1}\right]+ \mathbb{E}\left[X_{2}\right]+\cdots +\mathbb{E}\left[X_{n}\right].$

If ${X}$ is any random variable, any function ${g(X)}$ defines a new random variable ${g(X)}$. If the expectation of ${g(X)}$ is defined, then

$\displaystyle \mathbb{E}[g(x)]=\sum_{x}g(x)Pr\left[X=x\right].$

Letting ${g(x)=ax}$, we have for any constant ${a}$,

Consequently, expectations are linear: for any two random variables ${X}$ and ${Y}$ and any constant ${a}$,

When two random variables ${X}$ and ${Y}$ are independent and each has a defined expectation,

$\displaystyle \mathbb{E}[XY]=\sum_{x}\sum_{y} xy Pr\left[X=x\text{ and }Y=y\right]$

$\displaystyle =\sum_{x}\sum_{y} xy Pr\left[X=x\right] Pr\left[Y=y\right]$

$\displaystyle =\left(\sum_{x} x Pr\left[X=x\right]\right) \left(\sum_{y} y Pr\left[Y=y\right]\right)$

$\displaystyle =\mathbb{E}[X]\mathbb{E}[Y].$

In general, when ${n}$ random variables ${X_{1},X_{2},\ldots,X_{n}}$ are mutually independent,

When a random variable ${X}$ takes on values from the natural numbers ${\mathbb{N}=\{0,1,2,\ldots\}}$, there is a nice formula for its expectation:

$\displaystyle \mathbb{E}[X]=\sum_{i=0}^{\infty} i Pr\left[X=i\right]$

$\displaystyle =\sum_{i=0}^{\infty} i \left(Pr\left[X\geq i\right]- Pr\left[X\geq i+1\right]\right)$

since each term ${Pr\left[X\geq i\right]}$ is added in ${i}$ times and subtracted out ${i-1}$ times (except ${Pr\left[X\geq 0\right]}$, which is added in ${0}$ times and not subtracted out at all). The variance of a random variable ${X}$ with mean ${\mathbb{E}[X]}$ is

$\displaystyle Var[X]=\mathbb{E}\bigl[(X-\mathbb{E}[X])^{2}\bigr]$

$\displaystyle =\mathbb{E}\bigl[X^{2}-2 X \mathbb{E}[X]+\mathbb{E}^{2}[X]\bigr]$

$\displaystyle =\mathbb{E}\bigl[X^{2}\bigr]- 2 \mathbb{E}[X \mathbb{E}[X]]+\mathbb{E}^{2}[X]$

The justification for the equalities ${\mathbb{E}\left[\mathbb{E}^{2}[X]\right]=\mathbb{E}^{2}[X]}$ and ${\mathbb{E}[X \mathbb{E}[X]]=\mathbb{E}^{2}[X]}$ is that ${\mathbb{E}[X]}$ is not a random variable but simply a real number, which means that Equation (18) applies (with ${a=\mathbb{E}[X]}$). Equation (22) can be rewritten to obtain an expression for the expectation of the square of a random variable:

The variance of a random variable ${X}$ and the variance of ${aX}$ are related:

$\displaystyle Var[a X]=a^{2} Var[X].$

When ${X}$ and ${Y}$ are independent random variables,

$\displaystyle Var[X+Y]=Var[X]+Var[Y].$

In general, if ${n}$ random variables ${X_{1},X_{2},\ldots,X_{n}}$ are pairwise independent, then

The standard deviation of a random variable X is the positive square root of the variance of ${X}$. The standard deviation of a random variable ${X}$ is sometimes denoted ${\sigma_{X}}$ or simply ${\sigma}$ when the random variable ${X}$ is understood from context. With this notation, the variance of ${X}$ is denoted ${\sigma^{2}}$.

5. The geometric and binomial distributions

A coin flip is an instance of a Bernoulli trial, which is defined as an experiment with only two possible outcomes: success, which occurs with probability ${p}$, and failure, which occurs with probability ${q=1-p}$. When we speak of Bernoulli trials collectively, we mean that the trials are mutually independent and, unless we specifically say otherwise, that each has the same probability ${p}$ for success. Two important distributions arise from Bernoulli trials: the geometric distribution and the binomial distribution.

Suppose we have a sequence of Bernoulli trials, each with a probability ${p}$ of success and a probability ${q=1-p}$ of failure. How many trials occur before we obtain a success? Let the random variable ${X}$ be the number of trials needed to obtain a success. Then ${X}$ has values in the range ${\{1,2,\ldots\}}$, and for ${k\geq 1}$,

since we have ${k-1}$ failures before the one success. A probability distribution satisfying Equation (25) is said to be a geometric distribution.

Assuming ${p<1}$, the expectation of a geometric distribution can be calculated using

$\displaystyle \mathbb{E}[X]=\sum_{k=1}^{\infty}k q^{k-1} p$

$\displaystyle =\frac{p}{q}\sum_{k=0}^{\infty}k q^{k}$

$\displaystyle =\frac{p}{q}\cdot\frac{q}{(1-q)^{2}}$

Thus, on average, it takes ${1/p}$ trials before we obtain a success, an intuitive result. The variance, which can be calculated similarly, is

As an example, suppose we repeatedly roll two dice until we obtain either a seven or an eleven. Of the ${36}$ possible outcomes, ${6}$ yield a seven and ${2}$ yield an eleven. Thus, the probability of success is ${p=8/36=2/9}$, and we must roll ${1/p=9/2=4.5}$ times on average to obtain a seven or eleven. How many successes occur during ${n}$ Bernoulli trials, where a success occurs with probability ${p}$ and a failure with probability ${q=1-p}$? Define the random variable ${X}$ to be the number of successes in ${n}$ trials. Then ${X}$ has values in the range ${\{0,1,\ldots,n\}}$, and for ${k=0,\ldots,n}$,

since there are ${\binom{n}{k}}$ ways to pick which ${k}$ of the ${n}$ trials are successes, and the probability that each occurs is ${p^{k}q^{n-k}}$. A probability distribution satisfying Equation (28) is said to be a binomial distribution. For convenience, we define the family of binomial distributions using the notation

The name “binomial” comes from the fact that Equation (29) is the ${k}$th term of the expansion of ${(p+q)^{n}}$. Consequently, since ${p+q=1}$,

as is required by axiom ${2}$ of the probability axioms. We can compute the expectation of a random variable having a binomial distribution from Equation (30). Let ${X}$ be a random variable that follows the binomial distribution ${b(k;n,p)}$, and let ${q=1-p}$. By the definition of expectation, we have

$\displaystyle \mathbb{E}[X]=\sum_{k=0}^{n}k b(k;n,p)$

$\displaystyle =\sum_{k=0}^{n} k \binom{n}{k} p^{k} q^{n-k}$

$\displaystyle =np\sum_{k=0}^{n} \binom{n-1}{k-1} p^{k-1} q^{n-k}$

$\displaystyle =np\sum_{k=0}^{n-1} \binom{n-1}{k} p^{k} q^{(n-1)-k}$

By using the linearity of expectation, we obtain the same result with substantially less algebra. Let ${X_{i}}$ be the random variable describing the number of successes in the ${i}$th trial. Then ${\mathbb{E}[X_{i}]=p\cdot 1+q\cdot 0=p}$, and by linearity of expectation Equation (19), the expected number of successes for ${n}$ trials is

$\displaystyle \mathbb{E}[X]=\mathbb{E}\left[\sum_{i}^{n} X_{i} \right] =\sum_{i=1}^{n} \mathbb{E}[X_{i}]$

$\displaystyle =\sum_{i=1}^{n} p =np.$

The same approach can be used to calculate the variance of the distribution. Using Equation (22), we have ${Var[X_{i}]=\mathbb{E}\left[X_{i}^{2}\right]-\mathbb{E}^{2}[X_{i}]}$. Since ${X_{i}}$ only takes on the values ${0}$ and ${1}$, we have ${\mathbb{E}\left[X_{i}^{2}\right]=\mathbb{E}[X_{i}]=p}$, and hence

To compute the variance of ${X}$, we take advantage of the independence of the ${n}$ trials; thus, by Equation (32),

$\displaystyle Var[X]=Var{\left[\sum_{i=1}^{n} X_{i}\right]}$

The binomial distribution ${b(k;n,p)}$ increases as ${k}$ runs from ${0}$ to ${n}$ until it reaches the mean ${np}$, and then it decreases. We can prove that the distribution always behaves in this manner by looking at the ratio of successive terms:

$\displaystyle \frac{b(k;n,p)}{b(k-1;n,p)}=\frac{\binom{n}{k}p^{k}q^{n-k}} {\binom{n}{k-1}p^{k-1}q^{n-k+1}}$

This ratio is greater than ${1}$ precisely when ${(n+1)p-k}$ is positive. Consequently, ${b(k;n,p)>b(k-1;n,p)}$ for ${k<(n+1)p}$ (the distribution increases), and ${b(k;n,p)(n+1)p}$ (the distribution decreases). If ${k=(n+1)p}$ is an integer, then ${b(k;n,p)=b(k-1;n,p)}$, so the distribution has two maxima: at ${k=(n+1)p}$ and at ${k-1=(n+1)p-1=np-q}$. Otherwise, it attains a maximum at the unique integer ${k}$ that lies in the range ${np-q<(n+1)p}$. The following lemma provides an upper bound on the binomial distribution.

Lemma 2 Let ${n\geq 0}$, let $latex {0 <1}&fg=000000$, let ${q=1-p}$, and let ${0\leq k \leq n}$. Then

$\displaystyle b(k;n,p)\leq \left(\frac{np}{k}\right)^{k} \left(\frac{nq}{n-k}\right)^{n-k}.$

Proof: Using Equation (9), we have

$\displaystyle b(k;n,p)=\binom{n}{k}p^{k} q^{n-k}$

$\displaystyle \leq\left(\frac{n}{k}\right)\left(\frac{n}{n-k}\right)^{n-k}p^{k} q^{n-k}$

$\displaystyle =\left(\frac{np}{k}\right)^{k}\left(\frac{nq}{n-k}\right)^{n-k}.$

$\Box$

6. The Probabilistic Method

The probabilistic method is a very powerful method in combinatorics which can be used to show the existence of objects that satisfy certain properties. (For more, see the book by Alon and Spencer \cite{AS92}.) In this course, we will use the probabilistic method to prove existence of a code ${\mathcal{C}}$ with certain property ${\mathcal{P}}$. Towards that end, we define a distribution ${\mathcal{D}}$ over all possible codes ${\Omega}$ and prove that

$\displaystyle \underset{\mathcal{C}\in_{\mathcal{D}} \Omega}{Pr}\left[\mathcal{C}\:\text{has property}\:\mathcal{P} \right]>0 \text{ or equivalently } {Pr}\left[\mathcal{C}\:\text{doesn't have property}\:\mathcal{P} \right]<1.$

Note that the above inequity proves the existence of ${\mathcal{C}}$ with property ${\mathcal{P}}$. The typical approach will be to define ${P_{1}, \ldots, P_{m}}$ such that ${\mathcal{P}=P_{1}\wedge P_{2} \wedge P_{3} \ldots \wedge P_{m}}$ and show that for every ${1\le i\le m}$:

$\displaystyle Pr\left[\mathcal{C}\:\text{doesn't have property}\:P_i \right]<\frac{1}{m}.$

Finally, by the union bound, the above will prove that ${{Pr}\left[\mathcal{C}\:\text{doesn't have property}\:\mathcal{P} \right]<1}$, as desired.

7. Summary of Important Results

We now summarize the results from probability theory that we are going to use in this course.

1. Linearity of Expectation. Given random variables ${X_1,\dots,X_n}$:

$\displaystyle \mathbb{E}\left[\sum_{i=1}^{n}X_{i}\right]= \mathbb{E}\left[X_{1}\right]+ \mathbb{E}\left[X_{2}\right]+ \cdots +\mathbb{E}\left[X_{n}\right]. \ \ \ \ \ (35)$

2. Union Bound. Given events ${A_1,\dots,A_n}$:

$\displaystyle Pr\left[\bigcup_{i=1}^{n} A_{i}\right]\leq \sum_{i=1}^{n}Pr\left[A_{i}\right]. \ \ \ \ \ (36)$

3. Markov’s Inequality. Given a non-zero random variable ${X}$ and ${t>0}$:

$\displaystyle Pr\left[X\geq t\right]\leq \frac{\mathbb{E}\left[X\right]}{t}. \ \ \ \ \ (37)$

4. Chernoff Bound. Let ${X_1,\dots,X_n}$ be binary random variables and define ${X=\sum X_i}$. Then

$\displaystyle Pr\left[|X-\mathbb{E}(X)|>\epsilon \mathbb{E}(X)\right]</p>