The polished version of the scribed notes for Lecture 4 are now online. Thanks to Willmert for his great effort.

Thanks to Luca Trevisan‘s program to convert LaTeX to wordpress suitable format, the notes are also produced below. (Citations are missing.)

===============================================================================

**Note:** *These extraordinarily detailed note are almost entirely due to Willmert. The actual lecture was worth only one page in these notes. Enjoy! –Atri.*

**1. Counting and Probability **

This lecture reviews elementary combinatorics and probability theory. We begin by first reviewing elementary results in counting theory, including standard formulas for counting permutations and combinations. Then, the axioms of probability and basic facts concerning probability distributions are presented.

**2. Counting **

Counting theory tries to answer the question “How many?” or “How many orderings of distinct elements are there?” In this section, we review the elements of counting theory. A set of items that we wish to count can sometimes be expressed as a union of disjoint sets or as a Cartesian product of sets. The **rule of sum** says that the number of ways to choose an element from one of two *disjoint* sets is the sum of the cardinalities of the sets. That is, if and are two finite sets with no members in common, then . The **rule of product** says that the number of ways to choose an ordered pair is the number of ways to choose the first element times the number of ways to choose the second element. That is, if and are two finite sets, then .

A **string** over a finite set is a sequence of elements of . We sometimes call a string of length a **k-string**. A **substring** of a string is an ordered sequence of consecutive elements of . A **k-substring** of a string is a substring of length . For example, is a -substring of (the -substring that begins in position ), but is not a substring of . A -string over a set can be viewed as an element of the Cartesian product of -tuples; thus, there are strings of length . For example, the number of binary -strings is . Intuitively, to construct a -string over an -set, we have ways to pick the first element; for each of these choices, we have ways to pick the second element; and so forth times. This construction leads to the -fold product as the number of -strings.

A **permutation** of a finite set is an ordered sequence of all the elements of , with each element appearing exactly once. For example, if , there are permutations of :

There are permutations of a set of elements, since the first element of the sequence can be chosen in ways, the second in ways, the third in ways, and so on \cite{Algorithms}.

A **k-permutation** of is an ordered sequence of elements of , with no element appearing more than once in the sequence. Thus, an ordinary permutation is just an -permutation of an -set. The twelve -permutations of the set are

where we have used the shorthand of denoting the -set by , and on on. The number of -permutations of an -set is

since there are ways of choosing the first element, ways of choosing the second element, and so on until elements are selected, the last being a selection from elements.

A **k-combination** of an -set is simply a -subset of . There are six -combinations of the -set :

We can construct a -combination of an -set by choosing distinct elements from the -set. The number of -combinations of an -set can be expressed in terms of the number of -permutations of an -set. For every -combination, there are exactly permutations of its elements, each of which is a distinct -permutation of the -set. Thus the number of -combinations of an -set is the number of -permutations divided by ; from Equation (1), this quantity is

For , this formula tells us that the number of ways to choose elements from an -set is (not ), since .

We use the notation (read “ choose “) to denote the number of -combinations of an -set. From Equation (2), we have

This formula is symmetric in and :

These two numbers are known as **binomial coefficients**, due to their appearance in the **binomial expansion**:

A special case of the binomial expansion occurs when :

This formula corresponds to counting the binary -strings by the number of ‘s they contain: there are binary -strings containing exactly ‘s, since there are ways to choose out of the positions in which to place the ‘s.

We sometimes need to bound the size of a binomial coefficient. For , we have the lower bound

Taking advantage of the inequality , we obtain the upper bounds

For all , we can use induction to prove the bound

where for convenience we assume that . For , where , this bound can be rewritten as

is the **(binary) entropy function** and where, for convenience, we assume that , so that .

**3. Probability **

This section reviews basic probability theory. We define probability in terms of a **sample space** , which is a set whose elements are called **elementary events**. Each elementary event can be viewed as a possible outcome of an experiment. For the experiment of flipping two distinguishable coins, we can view the sample space as consisting of the set of all possible -strings over :

An **event** is a subset\footnote{For a general probability distribution, there may be some subsets of the sample space that are not considered to be events. This situation usually arises when the sample space is uncountably infinite. The main requirement is that the set of events of a sample space be closed under the operations of taking the complement of an event, forming the union of a finite or countable number of events, and taking the intersection of a finite or countable number of events.} of the sample space . For example, in the experiment of flipping two coins, the event of obtaining one head and one tail is . The event is called the **certain event**, and the event is called the **null event**. We say that two events and are **mutually exclusive** if . We sometimes treat an elementary event as the event . By definition, all elementary events are mutually exclusive.

A **probability distribution** on a sample space is a mapping from events of to real numbers such that the following **probability axioms** are satisfied:

- for any event .
- .
- for any two mutually exclusive events and . More generally, for any (finite or countably infinite) sequence of events that are pairwise mutually exclusive,
We call

**probability**of the event . We note here that axiom is a normalization requirement: there is really nothing fundamental about choosing as the probability of the certain event, except that it is natural and convenient. Several results follow immediately from these axioms and basic set theory. The null event has probability . If , then . Using to denote the event (the**complement**of ), we have . For any two events and ,The generalization of Equation (12) is also known as the

**union bound**and it is given by: -
**Markov’s inequality**:

**Theorem 1***For any nonnegative random variable and any ,*

*Proof:* - Suppose a random variable takes values , where with probabilities
Define

It follows that

The last line follows from the fact that

Next, let and note that

Hence, we have established the important result

The

**Chernoff bound**is determined by minimizing :

Going back to the coin-flipping example, suppose that each of the four elementary events has probability . Then the probability of getting at least one head is

Alternatively, since the probability of getting strictly less than one head is , the probability of getting at least one head is . A probability distribution is discrete if it is defined over a finite or countably infinite sample space. Let be the sample space. Then for any event ,

since elementary events, specifically those in , are mutually exclusive. If is finite and every elementary event has probability

then we have the **uniform probability distribution** on . In such case the experiment is often described as “picking an element of at random.” As an example, consider the process of flipping a **fair coin**, one for which the probability of obtaining head is the same as the probability of obtaining a tail, that is, . If we flip the coin times, we have the uniform probability distribution defined on the sample space , a set of size . Each elementary event in can be represented as a string of length over , and each occurs with probability . The event

is a subset of of size , since there are strings of length over that contain exactly ‘s. The probability of event is thus .

Sometimes we have some prior partial knowledge about the outcome of an experiment. For example, suppose that a friend has flipped two fair coins and has told you that at least one of the coins showed a head. what is the probability that both coins are heads? The information given eliminates the possibility of two tails. the three remaining elementary events are equally likely, so we infer that each occurs with probability . Since only one of these elementary events shows two heads, the answer to our question is .

Conditional probability formalizes the notion of having prior partial knowledge of the outcome of an experiment. The **conditional probability** of an event given that another event occurs is defined to be

whenever . We read “” as “the probability of given .” Intuitively, since we are given that event occurs, the event that also occurs is . That is, is the set of outcomes in which both and occur. Since the outcome is one of the elementary events in , we normalize the probabilities of all the elementary events in by dividing them by , so that they sum to . The conditional probability of given is, therefore, the ratio of the probability of event to the probability of event . In the example above, is the event that both coins are heads, and is the event that at least one coin is a head. Thus

Two events are “independent” if

which is equivalent, if , to the condition

For example, suppose that two fair coins are flipped and that the outcomes are independent. Then the probability of two heads is . Now suppose that one event is that the first coin comes up heads and the other event is that the coins come up differently. Each of these events occurs with probability , and the probability that both events occur is ; thus, according to the definition of independence, the events are independent–even though one might think that both events depend on the first coin. Finally, suppose that the coins are welded together so that they both fall heads or both fall tails and that the two possibilities are equally likely. The the probability that each coin comes up heads is , but the probability that they both come up heads is . Consequently, the event that one comes up heads and the event that the other comes up heads are not independent.

A collection of events is said to be **pairwise independent** if

for all . We say that they are **(mutually) independent** if every -subset of the collection, where and , satisfies

For example, suppose we flip two fair coins. Let be the event that the first coin is heads, let be the event that the second coin is heads, and let be the event that the two coins are different. We have

Since for , we have , the events and are pairwise independent. The events are not mutually independent, however, because and . From definition of conditional probability (13), it follows that for two events and , each with nonzero probability,

which is known as **Bayes’s theorem**. The denominator is a normalizing constant that we can express as follows. Since and and are mutually exclusive events,

Substituting into Equation (15), we obtain an equivalent form of Bayes’s theorem:

Bayes’s theorem can simplify the computing of conditional probabilities. For example, suppose that we have a fair coin and a biased coin that always comes up heads. We run an experiment consisting of three independent events: one of the two coins is chosen at random, the coin is flipped once, and then it is flipped again. Suppose that the chosen coin comes up heads both times. What is the probability that is is biased? We solve this problem using Bayes’s theorem. Let be the event that the biased coin is chosen, and let be the event that the coin comes up heads both times. We wish to determine . We have , , , and ; hence,

**4. Discrete random variables **

A **(discrete) random variable** is a function from a finite or countably infinite sample space to the real numbers. It associates a real number with each possible outcome of an experiment, which allows us to work with the probability distribution induced on the resulting set of numbers. Random variables can also be defined for uncountably infinite sample spaces. For our purposes, we shall assume that random variables are discrete.

For a random variable and a real number , we define the event to be ; thus,

The function

is the **probability density function** of the random variable . From the probability axioms, and . As an example, consider the experiment of rolling a pair of ordinary -sided dice. There are possible elementary events in the sample space. We assume that the probability distribution is uniform, so that each elementary event is equally likely: . Define the random variable to be the *maximum* of the two values showing on the dice. We have , since assigns a value of to of the possible elementary events, namely , , , , and . It is common for several random variables to be defined on the same sample space. If and are random variables, the function

is the **joint probability density function** of and . For a fixed value y,

and similarly, for a fixed value ,

Using the definition (13) of conditional probability, we have

We define two random variables and to be **independent** if for all and , the events and are independent or, equivalently, if for all and , we have .

Given a set of random variables defined over the same sample space, one can define new random variables as sums, products, or other functions of the original variables. The simplest and most useful summary of the distribution of a random variable is the “average” of the values it takes on. The **expected value** (or, synonymously, **expectation** or **mean**) of a discrete random variable is

which is well defined if the sum is finite or converges absolutely. Sometimes the expectation of is denoted by or, when the random variable is apparent from context, simply by .

Consider a game in which you flip two fair coins. You earn $ for each head but lose $ for each tail. The expected value of the random variable representing your earnings is

The expectation of the sum of two random variables is the sum of their expectations, that is,

whenever and are defined. This property extends to finite and absolutely convergent summations of expectations, and it is called **linearity of expectation**:

If is any random variable, any function defines a new random variable . If the expectation of is defined, then

Letting , we have for any constant ,

Consequently, expectations are linear: for any two random variables and and any constant ,

When two random variables and are independent and each has a defined expectation,

In general, when random variables are mutually independent,

When a random variable takes on values from the natural numbers , there is a nice formula for its expectation:

since each term is added in times and subtracted out times (except , which is added in times and not subtracted out at all). The **variance** of a random variable with mean is

The justification for the equalities and is that is not a random variable but simply a real number, which means that Equation (18) applies (with ). Equation (22) can be rewritten to obtain an expression for the expectation of the square of a random variable:

The variance of a random variable and the variance of are related:

When and are independent random variables,

In general, if random variables are pairwise independent, then

The **standard deviation** of a random variable X is the positive square root of the variance of . The standard deviation of a random variable is sometimes denoted or simply when the random variable is understood from context. With this notation, the variance of is denoted .

**5. The geometric and binomial distributions **

A coin flip is an instance of a **Bernoulli trial**, which is defined as an experiment with only two possible outcomes: **success**, which occurs with probability , and **failure**, which occurs with probability . When we speak of **Bernoulli trials** collectively, we mean that the trials are mutually independent and, unless we specifically say otherwise, that each has the same probability for success. Two important distributions arise from Bernoulli trials: the geometric distribution and the binomial distribution.

Suppose we have a sequence of Bernoulli trials, each with a probability of success and a probability of failure. How many trials occur before we obtain a success? Let the random variable be the number of trials needed to obtain a success. Then has values in the range , and for ,

since we have failures before the one success. A probability distribution satisfying Equation (25) is said to be a **geometric distribution**.

Assuming , the expectation of a geometric distribution can be calculated using

Thus, on average, it takes trials before we obtain a success, an intuitive result. The variance, which can be calculated similarly, is

As an example, suppose we repeatedly roll two dice until we obtain either a seven or an eleven. Of the possible outcomes, yield a seven and yield an eleven. Thus, the probability of success is , and we must roll times on average to obtain a seven or eleven. How many successes occur during Bernoulli trials, where a success occurs with probability and a failure with probability ? Define the random variable to be the number of successes in trials. Then has values in the range , and for ,

since there are ways to pick which of the trials are successes, and the probability that each occurs is . A probability distribution satisfying Equation (28) is said to be a **binomial distribution**. For convenience, we define the family of binomial distributions using the notation

The name “binomial” comes from the fact that Equation (29) is the th term of the expansion of . Consequently, since ,

as is required by axiom of the probability axioms. We can compute the expectation of a random variable having a binomial distribution from Equation (30). Let be a random variable that follows the binomial distribution , and let . By the definition of expectation, we have

By using the linearity of expectation, we obtain the same result with substantially less algebra. Let be the random variable describing the number of successes in the th trial. Then , and by linearity of expectation Equation (19), the expected number of successes for trials is

The same approach can be used to calculate the variance of the distribution. Using Equation (22), we have . Since only takes on the values and , we have , and hence

To compute the variance of , we take advantage of the independence of the trials; thus, by Equation (32),

The binomial distribution increases as runs from to until it reaches the mean , and then it decreases. We can prove that the distribution always behaves in this manner by looking at the ratio of successive terms:

This ratio is greater than precisely when is positive. Consequently, for (the distribution increases), and (the distribution decreases). If is an integer, then , so the distribution has two maxima: at and at . Otherwise, it attains a maximum at the unique integer that lies in the range . The following lemma provides an upper bound on the binomial distribution.

Lemma 2Let , let $latex {0

<1}&fg=000000$, let , and let . Then

*Proof:* Using Equation (9), we have

**6. The Probabilistic Method **

The probabilistic method is a very powerful method in combinatorics which can be used to show the existence of objects that satisfy certain properties. (For more, see the book by Alon and Spencer \cite{AS92}.) In this course, we will use the **probabilistic method** to prove existence of a code with certain property . Towards that end, we define a distribution over all possible codes and prove that

Note that the above inequity proves the existence of with property . The typical approach will be to define such that and show that for every :

Finally, by the union bound, the above will prove that , as desired.

**7. Summary of Important Results **

We now summarize the results from probability theory that we are going to use in this course.

**Linearity of Expectation.**Given random variables :**Union Bound.**Given events :**Markov’s Inequality.**Given a non-zero random variable and :**Chernoff Bound.**Let be binary random variables and define . Then</p>

## Leave a Reply