Probability Review

From time to time, we need to review those definitions and basic concepts from probability field.

This post is to go through a review of probability concepts whenever you need it.

To treat probability rigorously, we define a sample space SS whose elements are the possible outcomes of some process or experiment. For example, the sample space might be the outcomes of the roll of a die, or flips of a coin. To each element xx of the sample space, we assign a probability, which will be a non-negative number between 0 and 1, which we will denote by P(x)\mathbb{P}(x). We require that

xSP(x)=1\sum_{x \in S} \mathbb{P}(x) = 1

Let AA be an event with non-zero probability. We define the probability of an event BB conditioned on event A, denoted by P(BA)\mathbb{P}(B | A), to be

P(BA)=P(AB)P(A)\mathbb{P}(B | A) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)}

If we have two events AA and BB, we say that they are independent if the probability that both happen is the product of the probability that the first happens and the probability that the second happens, that is, if

P(AB)=P(A)P(B)\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)

Bayes’s rule. For any two events AA and BB, one has

P(BA)=P(AB)P(A)=P(AB)P(B)P(A)\mathbb{P}(B | A) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)} = \frac{\mathbb{P}(A| B) \mathbb{P}(B)}{\mathbb{P}(A)}

Suppose we have a false positive rate:

P(positive testno disease)=130\mathbb{P}(\text{positive test} | \text{no disease}) = \frac{1}{30}

and some false negative rate is

P(negative test disease)=110\mathbb{P}(\text{negative test} | \text{ disease}) = \frac{1}{10}

The incidence of the desease is 1/10001/1000

P(B)=0.001\mathbb{P}(B) = 0.001

Let’s define:

Then we can have

P(BA)=P(AB)P(A)=P(AB)P(B)P(A)=P(AB)P(B)P(AB)P(B)+P(A¬B)P(¬B)0.0265 \begin{aligned} \mathbb{P}(B | A) = \frac{\mathbb{P}( A \cap B)}{\mathbb{P}(A)} & = \frac{\mathbb{P}(A | B) \mathbb{P}(B) }{\mathbb{P}(A)} \\ & = \frac{\mathbb{P}(A | B) \mathbb{P}(B) }{\mathbb{P}(A|B)\mathbb{P}(B) + \mathbb{P}(A | \neg B )\mathbb{P}(\neg B)} \\ & \approx 0.0265 \end{aligned}

Inclusion–exclusion principle. In combinatorics, a branch of mathematics, the inclusion–exclusion principle is a counting technique which generalizes the familiar method of obtaining the number of elements in the union of two finite sets; symbolically expressed as

AB=A+BAB|A \cup B| = |A| + |B| - |A \cap B |

The inclusion-exclusion principle, being a generalization of the two-set case, is perhaps more clearly seen in the case of three sets, which for the sets A, B and C is given by

ABC=A+B+CABACBC+ABC \begin{aligned} |A\cup B\cup C|=|A|+|B|+|C|- & |A\cap B|- |A\cap C| \\ & -|B\cap C|+|A\cap B\cap C| \end{aligned}

A demo figure
Figure 1. Illustration of inclusion-exclusion principle.

In its general formula, the principle of inclusion–exclusion states that for finite sets A1, …, An, one has the identity

i=1nAi=i=1nAi1i<jnAiAj+1i<j<knAiAjAk+(1)n+1A1An \begin{aligned} \left|\bigcup _{i=1}^{n}A_{i}\right|=\sum _{i=1}^{n}|A_{i}|-\sum _{1\leqslant i<j\leqslant n}|A_{i}\cap A_{j}|+ & \sum _{1\leqslant i<j<k\leqslant n}|A_{i}\cap A_{j}\cap A_{k}|- \\ & \cdots +(-1)^{n+1}\left|A_{1}\cap \cdots \cap A_{n}\right| \end{aligned}

Expectation. Informally, the expectation of a random variable with a countable set of possible outcomes is defined analogously as the weighted average of all possible outcomes, where the weights are given by the probabilities of realizing each given value. This is to say that

E[X]=i=1xipi, \operatorname {E} [X]=\sum _{i=1}^{\infty }x_{i}\,p_{i},

where x1,x2,x_1, x_2, \cdots are the possible outcomes of the random variable XX and p1,p2,p_1, p_2, \cdots are their corresponding probabilities.

Now consider a random variable XX which has a probability density function given by a function f on the real number line. This means that the probability of XX taking on a value in any given open interval is given by the integral of ff over that interval. The expectation of XX is then given by the integral

E[X]=x f(x)dx \operatorname {E} [X]=\int _{-\infty }^{\infty }x \ f(x)\,dx

Indicator function. The indicator variable(function) of a subset AA of a set XX is a function

IA ⁣:X{0,1} \mathbf{I} _{A}\colon X\to \{0,1\}

defined as

IA(x):={1  if  xA ,0  if  xA . \mathbf {I} _{A}(x):={\begin{cases}1~&{\text{ if }}~x\in A~,\\0~&{\text{ if }}~x\notin A~.\end{cases}}

At first glance, there might not seem like much point in using such a simple random variable. However, it can be very useful, especially in conjunction with the following important fact.

Linearity of expectation. The expected value operator (or expectation operator) E[] \operatorname {E} [\cdot ] is linear in the sense that, for any random variables XX and YY, and a constant aa,

E[X+Y]=E[X]+E[Y],E[aX]=aE[X], {\begin{aligned}\operatorname {E} [X+Y]&=\operatorname {E} [X]+\operatorname {E} [Y],\\\operatorname {E} [aX]&=a\operatorname {E} [X],\end{aligned}}

whenever the right-hand side is well-defined. When XX and YY are independent, we also have

E(XY)=E(X)E(Y)E(X \cdot Y) = E(X) E(Y)

If XX and YY are independent, it holds that

p(x,y)=p(x)p(y)for allxX,yYp(x,y) = p(x) \, p(y) \quad \text{for all} \quad x \in \mathcal{X}, y \in \mathcal{Y}

Applying this to the expected value for discrete random variables, we have

E(XY)=xXyY(xy)fX,Y(x,y)=xXyY(xy)(fX(x)fY(y))=xXxfX(x)yYyfY(y)=xXxfX(x)E(Y)=E(X)E(Y) \begin{aligned} \mathrm{E}(X\,Y) &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} (x \cdot y) \cdot f_{X,Y}(x,y) \\ &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} (x \cdot y) \cdot \left( f_X(x) \cdot f_Y(y) \right) \\ &= \sum_{x \in \mathcal{X}} x \cdot f_X(x) \, \sum_{y \in \mathcal{Y}} y \cdot f_Y(y) \\ &= \sum_{x \in \mathcal{X}} x \cdot f_X(x) \cdot \mathrm{E}(Y) \\ &= \mathrm{E}(X) \, \mathrm{E}(Y) \end{aligned}

And applying it to the expected value for continuous random variables, we have

E(XY)=XY(xy)fX,Y(x,y)dydx=XY(xy)(fX(x)fY(y))dydx=XxfX(x)YyfY(y)dydx=XxfX(x)E(Y)dx=E(X)E(Y)   \begin{aligned} \mathrm{E}(X\,Y) &= \int_{\mathcal{X}} \int_{\mathcal{Y}} (x \cdot y) \cdot f_{X,Y}(x,y) \, \mathrm{d}y \, \mathrm{d}x \\ &= \int_{\mathcal{X}} \int_{\mathcal{Y}} (x \cdot y) \cdot \left( f_X(x) \cdot f_Y(y) \right) \, \mathrm{d}y \, \mathrm{d}x \\ &= \int_{\mathcal{X}} x \cdot f_X(x) \, \int_{\mathcal{Y}} y \cdot f_Y(y) \, \mathrm{d}y \, \mathrm{d}x \\ &= \int_{\mathcal{X}} x \cdot f_X(x) \cdot \mathrm{E}(Y) \, \mathrm{d}x \\ &= \mathrm{E}(X) \, \mathrm{E}(Y) \; \end{aligned}

Variance and Covariance. The variance of a random variable XX is the expected value of the squared deviation from the mean of XX, μ=E[X]\mu =\operatorname {E} [X]:

Var(X)=E[(Xμ)2]. \operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right].

This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed. The variance can also be thought of as the covariance of a random variable with itself:

Var(X)=Cov(X,X). \operatorname {Var} (X)=\operatorname {Cov} (X,X).

For two jointly distributed real-valued random variables XX and YY with finite second moments, the covariance is defined as the expected value (or mean) of the product of their deviations from their individual expected values:

cov(X,Y)=E[(XE[X])(YE[Y])] \operatorname {cov} (X,Y)=\operatorname {E} {\big [(X-\operatorname {E} [X])(Y-\operatorname {E} [Y])\big ]}

where E[X] \operatorname {E} [X] is the expected value of XX, also known as the mean of XX. The covariance is also sometimes denoted σXY\sigma _{XY} or σ(X,Y)\sigma (X,Y), in analogy to variance.

The expression for the variance can be expanded as follows:

Var(X)=E[(XE[X])2]=E[X22XE[X]+E[X]2]=E[X2]2E[X]E[X]+E[X]2=E[X2]E[X]2 \begin{aligned}\operatorname {Var} (X)&=\operatorname {E} \left[(X-\operatorname {E} [X])^{2}\right]\\[4pt]&=\operatorname {E} \left[X^{2}-2X\operatorname {E} [X]+\operatorname {E} [X]^{2}\right]\\[4pt]&=\operatorname {E} \left[X^{2}\right]-2\operatorname {E} [X]\operatorname {E} [X]+\operatorname {E} [X]^{2}\\[4pt]&=\operatorname {E} \left[X^{2}\right]-\operatorname {E} [X]^{2}\end{aligned}

In other words, the variance of X is equal to the mean of the square of X minus the square of the mean of X. This equation should not be used for computations using floating point arithmetic, because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude.

For the covariance, by using the linearity property of expectations, it can be simplified to the expected value of their product minus the product of their expected values:

cov(X,Y)=E[(XE[X])(YE[Y])]=E[XYXE[Y]E[X]Y+E[X]E[Y]]=E[XY]E[X]E[Y]E[X]E[Y]+E[X]E[Y]=E[XY]E[X]E[Y] \begin{aligned}\operatorname {cov} (X,Y)&=\operatorname {E} \left[\left(X-\operatorname {E} \left[X\right]\right)\left(Y-\operatorname {E} \left[Y\right]\right)\right]\\&=\operatorname {E} \left[XY-X\operatorname {E} \left[Y\right]-\operatorname {E} \left[X\right]Y+\operatorname {E} \left[X\right]\operatorname {E} \left[Y\right]\right]\\&=\operatorname {E} \left[XY\right]-\operatorname {E} \left[X\right]\operatorname {E} \left[Y\right]-\operatorname {E} \left[X\right]\operatorname {E} \left[Y\right]+\operatorname {E} \left[X\right]\operatorname {E} \left[Y\right]\\&=\operatorname {E} \left[XY\right]-\operatorname {E} \left[X\right]\operatorname {E} \left[Y\right] \end{aligned}

but this equation is susceptible to catastrophic cancellation.

Covariance of independent random variables. Let XX and YY be independent random variables. Then, the covariance of XX and YY is zero:

X,Y  independentCov(X,Y)=0X, Y \; \text{independent} \quad \Rightarrow \quad \mathrm{Cov}(X,Y) = 0

The covariance can be expressed in terms of expected values as

Cov(X,Y)=E(XY)E(X)E(Y)\mathrm{Cov}(X,Y) = \mathrm{E}(X\,Y) - \mathrm{E}(X) \, \mathrm{E}(Y)

For independent random variables, the expected value of the product is equal to the product of the expected values:

E(XY)=E(X)E(Y)\mathrm{E}(X\,Y) = \mathrm{E}(X) \, \mathrm{E}(Y)

Therefore,

Cov(X,Y)=E(XY)E(X)E(Y)=E(X)E(Y)E(X)E(Y)=0 \begin{aligned} \mathrm{Cov}(X,Y) &= \mathrm{E}(X\,Y) - \mathrm{E}(X) \, \mathrm{E}(Y) \\ &= \mathrm{E}(X) \, \mathrm{E}(Y) - \mathrm{E}(X) \, \mathrm{E}(Y) \\ &= 0 \end{aligned}

Variance of the sum of two random variables. The variance of the sum of two random variables equals the sum of the variances of those random variables, plus two times their covariance:

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\mathrm{Var}(X+Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \, \mathrm{Cov}(X,Y)

Using the linearity of expectation we can prove:

Var(X+Y)=E[((X+Y)E(X+Y))2]=E[([XE(X)]+[YE(Y)])2]=E[(XE(X))2+(YE(Y))2+2(XE(X))(YE(Y))]=E[(XE(X))2]+E[(YE(Y))2]+E[2(XE(X))(YE(Y))]=Var(X)+Var(Y)+2Cov(X,Y) \begin{aligned} \mathrm{Var}(X+Y) &= \mathrm{E}\left[ ((X+Y)-\mathrm{E}(X+Y))^2 \right] \\ &= \mathrm{E}\left[ ([X-\mathrm{E}(X)] + [Y-\mathrm{E}(Y)])^2 \right] \\ &= \mathrm{E}\left[ (X-\mathrm{E}(X))^2 + (Y-\mathrm{E}(Y))^2 + 2 \, (X-\mathrm{E}(X)) (Y-\mathrm{E}(Y)) \right] \\ &= \mathrm{E}\left[ (X-\mathrm{E}(X))^2 \right] + \mathrm{E}\left[ (Y-\mathrm{E}(Y))^2 \right] + \mathrm{E}\left[ 2 \, (X-\mathrm{E}(X)) (Y-\mathrm{E}(Y)) \right] \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \, \mathrm{Cov}(X,Y) \end{aligned}

Variance of the sum of multiple random variables. For random variables XiX_i we have:

Var(i=1nXi)=i=1nj=1nCov(Xi,Xj)\text{Var}\left(\sum_{i=1}^{n}X_{i}\right)=\sum_{i=1}^{n}\sum_{j=1}^{n}\text{Cov}\left(X_{i},X_{j}\right)

Using the decomposition formula of variance, we have

Var(i=1nXi)=E([i=1nXi]2)[E(i=1nXi)]2{\rm Var} \left( \sum_{i=1}^{n} X_i \right) = E \left( \left[ \sum_{i=1}^{n} X_i \right]^2 \right) - \left[ E\left( \sum_{i=1}^{n} X_i \right) \right]^2

Now note that (i=1nai)2=i=1nj=1naiaj(\sum_{i=1}^{n} a_i)^2 = \sum_{i=1}^{n} \sum_{j=1}^{n} a_i a_j, which is clear if you think about what you’re doing when you calculate

(a1+...+an)(a1+...+an),(a_1+...+a_n) \cdot (a_1+...+a_n),

by hand. Therefore,

E([i=1nXi]2)=E(i=1nj=1nXiXj)=i=1nj=1nE(XiXj)E \left( \left[ \sum_{i=1}^{n} X_i \right]^2 \right) = E \left( \sum_{i=1}^{n} \sum_{j=1}^{n} X_i X_j \right) = \sum_{i=1}^{n} \sum_{j=1}^{n} E(X_i X_j)

similarly,

[E(i=1nXi)]2=[i=1nE(Xi)]2=i=1nj=1nE(Xi)E(Xj)\left[ E\left( \sum_{i=1}^{n} X_i \right) \right]^2 = \left[ \sum_{i=1}^{n} E(X_i) \right]^2 = \sum_{i=1}^{n} \sum_{j=1}^{n} E(X_i) E(X_j)

So,

Var(i=1nXi)=i=1nj=1n(E(XiXj)E(Xi)E(Xj))=i=1nj=1ncov(Xi,Xj) \begin{aligned} {\rm Var} \left( \sum_{i=1}^{n} X_i \right) & = \sum_{i=1}^{n} \sum_{j=1}^{n} \big( E(X_i X_j)-E(X_i) E(X_j) \big) \\ & = \sum_{i=1}^{n} \sum_{j=1}^{n} {\rm cov}(X_i, X_j) \end{aligned}

Pairwise independence. In general:

Var(i=1nXi)=i=1nj=1nCov(Xi,Xj)\text{Var}\left(\sum_{i=1}^{n}X_{i}\right)=\sum_{i=1}^{n}\sum_{j=1}^{n}\text{Cov}\left(X_{i},X_{j}\right)

If XiX_i and XjX_j are independent then EXiXj=EXiEXj\mathbb{E}X_{i}X_{j}=\mathbb{E}X_{i}\mathbb{E}X_{j} and consequently

Cov(Xi,Xj):=EXiXjEXiEXj=0 \text{Cov}\left(X_{i},X_{j}\right):=\mathbb{E}X_{i}X_{j}-\mathbb{E}X_{i}\mathbb{E}X_{j}=0

This leads to:

Var(i=1nXi)=i=1nVarXi\text{Var}\left(\sum_{i=1}^{n}X_{i}\right)=\sum_{i=1}^{n}\text{Var}X_{i}

Binomial distribution. In probability theory and statistics, the binomial distribution with parameters nn and pp is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability pp) or failure (with probability q=1pq = 1-p).

The probability of getting exactly kk successes in nn independent Bernoulli trials is given by the probability mass function:

f(k,n,p)=(nk)pk(1p)nkf(k, n, p) = \binom{n}{k} p^k (1-p)^{n-k}