Develop Some Fluency in Probabilistic Thinking (Part II)

The foundation of machine learning and data science is probability theory. In this post, we will develop some fluency in probabilistic thinking with different examples, which prepare data scientists well for the sexist job of the 21st century.

This the second part of this series of posts as the title has stated. We will continue to study the material from Cameron Musco’s course - Algorithms for Data Science (Musco, 2022). This section assumes readers know what hash table is. If not, just check out this video from MIT.

When we use a random hash function h(x)h(x) to map keys into hashing numbers, we could design a fully random one which has the following property:

P(h(x)=i)=1n \mathbb{P}(h(x) = i) = \frac{1}{n}

However, wit the above property, we need to store a table of xx values and their hash values. This would take at least O(m)O(m) space and O(m)O(m) query time to look up, making our whole quest for O(1)O(1) query time pointless. Therefore we need to find another property that will avoid collision and O(1)O(1) query time.

2-Universial Hash Function (low collision probability). A random hash function from h:U[n]h: U \to [n] is two universal if :

P[h(x)=h(y)]1n(1) \mathbb{P}[h(x) = h(y)] \leq \frac{1}{n} \tag{1}

One of the functions that are often used is the following one:

h(x)=(ax+b modp) mod n h(x) = (a x + b \ \mathrm{mod} p) \ \mathrm{mod} \ n

Pairwise Independent Hash Function. A random hash function from h:U[n]h: U \to [n] is pairwise independence if for all i,j[n]i, j \in [n]:

P[h(x)=ih(y)=j]=1n2(2) \mathbb{P}[h(x) = i \cap h(y) = j] = \frac{1}{n^2} \tag{2}

We can show that pairwise hash function are 2-universal. Now, for i[n]i \in [n], it has n1n-1 pairs and one pair with itself, therefore:

P[h(x)=h(y)]=j=1nP[h(x)=ih(y)=j]=n1n2=1n \begin{aligned} \mathbb{P}[h(x) = h(y)] & = \sum_{j=1}^n \mathbb{P}[h(x) = i \cap h(y) = j] \\ & = n \cdot \frac{1}{n^2} = \frac{1}{n} \end{aligned}

Example 5: Randomized load balancing

We have nn requests randomly assigned to kk servers. How many requests must each server handle? Let’s assume RiR_i as the number of requests assigned to server ii. The expectation tells

E[Ri]=j=1nE[Irequest j assigned to i]=j=1nP(j assigned to i)×1=nk(3) \begin{aligned} \mathbb{E}[R_i] & = \sum_{j=1}^n \mathbb{E} [ \mathbb{I}_{\text{request j assigned to i}}] \\ & = \sum_{j=1}^n \mathbb{P}(\text{j assigned to i}) \times 1 \\ & = \frac{n}{k} \tag{3} \end{aligned}

If we provision each server be able to handle twice the expected load, what is the probability that a server is overloaded? We can calculate it based on Markov’s inequality:

P[Rj>2E[Ri]]E[Ri]2E[Ri]=12 \mathbb{P}[R_j > 2 \mathbb{E}[R_i]] \leq \frac{\mathbb{E}[R_i]}{2\mathbb{E}[R_i]} = \frac{1}{2}

Chebyshev’s inequality

With a very simple twist, Markov’s inequality can be made much more powerful. For any random variable XX and any value t>0t > 0, we have:

P(Xt)=P(X2t2)E[X2]t2 \mathbb{P}(|X| \geq t ) = \mathbb{P}(X^2 \geq t^2 ) \leq \frac{\mathbb{E}[X^2]}{t^2}

Now, by plugging in the random variable XEX - \mathbb{E}, we can hae Chebyshev’s inequality.

Chebyshev’s inequality. For any random variable XX and t>0t > 0,

P(XE[X]t)Var[X]t2(4) \mathbb{P}(|X - E[X]| \geq t ) \leq \frac{\mathrm{Var}[X]}{t^2} \tag{4}

With Chebyshev’s inequality, we can calculate the probability that XX falls ss standard deviations from its mean as:

P(XE[X]sVar[X])Var[X]s2Var[X]=1s2(5) \mathbb{P}(|X - E[X]| \geq s \cdot \sqrt{\mathrm{Var}[X]} ) \leq \frac{\mathrm{Var}[X]}{s^2\mathrm{Var}[X]} = \frac{1}{s^2} \tag{5}

Equation (5) gives us the quick estimation on deviation of our estimations. It could also shows that the sample average will always concentrate on mean with enough samples nn (law of large numbers).

Consider drawing independent identically distributed (i.i.d) random variables X1,,XnX_1, \cdots, X_n with mean μ\mu and variance σ2\sigma^2. How well does the sample average

S=1ni=1nXi S = \frac{1}{n} \sum_{i=1}^n X_i

approximate the true mean μ\mu?

First, we need to calculate the sample variance

Var[S]=Var[1ni=1nXi]=1n2i=1nVar[Xi]=1n2nσ2=σ2n \mathrm{Var}[S] = \mathrm{Var} \left [ \frac{1}{n} \sum_{i=1}^n X_i \right] = \frac{1}{n^2} \sum_{i=1}^n \mathrm{Var} [X_i] = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}

By Chebyshev’s inequality, for any fixed value ϵ>0\epsilon > 0, we have

P(Sμϵ)Var[S]ϵ2=σ2nϵ2(5) \mathbb{P}( |S- \mu| \geq \epsilon) \leq \frac{\mathrm{Var}[S]}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \tag{5}

Equation (6) could be used to estimate the size of sample we need for a certain level of accuracy of the estimation. Instead of using ϵ\epsilon, we will use ϵσ\epsilon \cdot \sigma, then we can have

P(Sμϵσ)Var[S]ϵ2σ2=1nϵ2(6) \mathbb{P}( |S- \mu| \geq \epsilon \cdot \sigma) \leq \frac{\mathrm{Var}[S]}{\epsilon^2 \sigma^2} = \frac{1}{n \epsilon^2} \tag{6}

We can do quick estimation on sample size we need to achieve a certain level of estimation based on equation (6). Figure 1 shows that the probability of our sample mean deviating 10% percent of its variance is less or equal to 0.10010.1001 when the sample size is greater than 100100, whereas the probability is still bigger than 1 if we want our estimation with high accuracy (σ=0.01\sigma = 0.01), which means that one needs a larger sample for that.

A demo figure
Figure 1. Plot of probability against sample size based on equation (6). Remark: the probability is bounded by 1, the bound value calculation does not make too much sense for dotted line.

Now, back to our balancing load example, we have assumed that each server could be able to handle twice the expected load and the expected load is

E[Rj]=nk\mathbb{E}[R_j] = \frac{n}{k}

as it has been derived in equation (3) and we want to find the probability that a server is overloaded. We can use Chebyshev’s inequality to calculate it. However, we need to calculate the variance first. Let’s denote Ri,jR_{i, j} as 11 if jj is assigned to server ii and 00 otherwise, then

Var[Ri,j]=E[(Ri,jE[Ri,j])2]=P(Ri,j=1)(1E[Ri,j])2+P(Ri,j=0)(0E[Ri,j])2=1k(11k)2+(11k)(01k)2=1k1k21k(7) \begin{aligned} \mathrm{Var}[R_{i,j}] & = \mathbb{E} \left [ (R_{i, j} - \mathbb{E}[R_{i,j}])^2 \right ] \\ & = \mathbb{P}(R_{i, j}= 1) \cdot (1 - \mathbb{E}[R_{i, j}])^2 + \mathbb{P}(R_{i,j} = 0)(0 - \mathbb{E}[R_{i, j}])^2 \\ & =\frac{1}{k} \cdot\left(1-\frac{1}{k}\right)^2+\left(1-\frac{1}{k}\right) \cdot\left(0-\frac{1}{k}\right)^2 \\ & =\frac{1}{k}-\frac{1}{k^2} \\ & \leq \frac{1}{k} \tag{7} \end{aligned}

Now, with the linearity of variance, we can have

Var[Ri]=j=1nVar[Ri,j]nk \mathrm{Var}[R_i] = \sum_{j=1}^n \mathrm{Var}[R_{i, j}] \leq \frac{n}{k}

Now, with E[Rj]=nk\mathbb{E}[R_j] = \frac{n}{k} and Var[Ri]nk \mathrm{Var}[R_i] \leq \frac{n}{k}, we can apply Chebychev’s inequality (RiR_i is the number of requests sent to server ii)

P(Ri2nk)=P(Rinknk)P(Rinknk)(take absolute value)=P(RiE[Rj]nk)Var[Ri,j](n/k)2kn(8) \begin{aligned} \mathbb{P}(R_i \geq \frac{2n}{k}) & = \mathbb{P}(R_i - \frac{n}{k} \geq \frac{n}{k}) \\ & \leq \mathbb{P}( |R_i - \frac{n}{k}| \geq \frac{n}{k}) \quad \text{(take absolute value)} \\ & = \mathbb{P}( |R_i - | \mathbb{E}[R_j] \geq \frac{n}{k}) \\ & \leq \frac{ \mathrm{Var}[R_{i, j}]}{(n/k)^2} \\ & \leq \frac{k}{n} \tag{8} \end{aligned}

Equation (8) shows that overload probability is extremely small when k<<nk << n, which seems counterintuitive as it means bound gets worse as kk grows (more servers do not help). When kk is large, the number of requests each server sees in expectation is very small so the law of larger numbers does not ‘kick in’.

Maximum server load

We have shown the probability of one server overloaded in equation (8), now we are interested in the probability of maximum server load exceeds its capacity. To get the maximum server load, we need to have a list of servers that overloaded, this means not just one sever but possible many of them are overloaded. The value of maximum load on any server can be written as the random variable:

S=maxi[1,,k]RiS = \max_{i \in [1, \cdots, k]} R_i

How could we form a probability for S(2n)/kS \geq (2n)/k in terms of RiR_i? Let’s start from null case, which is that no RiR_i is greater or equal to (2n)/k(2n)/k. For this kind of case, S<(2n)/kS < (2n)/k. Therefore, we need to have at least one of RiR_i is greater or equal to (2n)/k(2n)/k. On the other side, if all of RiR_i is greater or equal to (2n)/k(2n)/k, then there must be the case that S(2n)kS \geq (2n)k. However, we do not need all of them. We can transform them as the ‘or’ problem that guarantees the event SS:

P(S2nk)=P([R12nk] or [R22nk] or  or [Rk2nk])=P(i=1k[Ri2nk]) \begin{aligned} \mathbb{P}(S \geq \frac{2n}{k}) & = \mathbb{P} \left ( \left[ R_1 \geq \frac{2n}{k} \right ] \ \text{or} \ \left[ R_2 \geq \frac{2n}{k} \right ] \ \text{or} \ \cdots \ \text{or} \ \left[ R_k \geq \frac{2n}{k} \right ] \right ) \\ & = \mathbb{P} \left ( \cup_{i=1}^k \left[ R_i \geq \frac{2n}{k} \right ] \right ) \end{aligned}

Union Bound. For any random events A1,A2,,AkA_1, A_2, \cdots, A_k, we have

P(A1A2Ak)P(A1)+P(A2)++P(Ak)(9) \mathbb{P}(A_1 \cup A_2 \cup \cdots \cup A_k) \leq \mathbb{P}(A_1) + \mathbb{P}(A_2) + \cdots + \mathbb{P}(A_k) \tag{9}

With the union bound, it is easy to show

P(S2nk)=P(i=1k[Ri2nk])i=1kP([Ri2nk])i=1kknk2n \begin{aligned} \mathbb{P}(S \geq \frac{2n}{k}) & = \mathbb{P} \left ( \cup_{i=1}^k \left[ R_i \geq \frac{2n}{k} \right ] \right ) \\ & \leq \sum_{i=1}^k \mathbb{P} \left ( \left[ R_i \geq \frac{2n}{k} \right ] \right ) \\ & \leq \sum_{i=1}^k \frac{k}{n} \\ & \leq \frac{k^2}{n} \end{aligned}

This means as long as kO(n)k \leq O(\sqrt{n}), with good probability, the maximum server load will be small (compared to the expected to load).

Example 6: flipping coins

You might think this example will be super easy as we learned this example probably from our elementary school. However, if I tell you that we flip n=100n =100 independent coins, each are heads with probability 1/21/2 and tails with probability 1/21/2. Let HH be the number of heads, then we could have:

E[H]=n2=50;Var[H]=n4=25(10) \mathbb{E}[H] = \frac{n}{2} = 50; \quad \mathrm{Var}[H] = \frac{n}{4} = 25 \tag{10}

When you look at the equation (10), it seems very straightforward. However, to get numbers like 50,2550, 25, we need a long derivation. Let’s use the definition of expectation. I have to admit that I found it very unnatural to write down the formula of expectation for our example. I had an almost irresistible impulse to review the definition of expectation again.

Expected value
Informally, the expected value is the weighted arithmetic mean of a large number of independently selected outcomes of a random variable. This means we have: one random variable and a countable set of possible outcomes. For example, weight of human being follows the normal distribution: random variable is weight of human being, a countable set of possible outcomes is the set of all possible values of weight (some people are fit, some are slim, some are overweight). At last, we need to calculate the weighted arithmetic mean.

For the example of flipping coins, the random variable is HH (the number of heads), the countable set of possible outcomes is the set of all possible values h[0,n]h \in [0, n]. We know each trial follows the binomial distribution, then we can write down the formula for the expected value.

E[H]=h=0nhP(h)=h=0nh(nh)ph(1p)nh(11) \mathbb{E}[H] = \sum_{h=0}^n h \cdot \mathbb{P}(h) = \sum_{h=0}^n h \cdot \binom{n}{h}p^h(1-p)^{n-h} \tag{11}

I hope you feel released when you compare equation (11) and (10), which shows that it is natural for anyone to have unnatural reactions at the beginning. Now, we will derive the expected value. There are different ways to do it (see all proofs). With binomial theory, it can be shown that

E[H]=np \mathbb{E}[H] = np

we will leverage the linearity of expectation again because each trial is independent. We can treat HH as the sum of discrete random variable hh that follows the Bernoulli trial, this means

H=i=1nhi(12) H = \sum_{i=1}^n h_i \tag{12}

Remark: Be careful about the index. In equation (11), hh is not a random variable, it is just a mathematical symbol for us to count the number of heads (which could be 00). However, in equation (12), hih_i is a random variable which follows the Bernoulli trials. We run nn independent trials and we are interested in the sum of those discrete random varaibles.

Therefore, E[H]=E[i=1nhi]=i=1nE[hi]=np(13) \begin{aligned} \mathbb{E}[H] & = \mathbb{E} \left [ \sum_{i=1}^n h_i \right ] = \sum_{i=1}^n \mathbb{E}[h_i] = np \tag{13} \end{aligned}

Bernoulli random variables and indicator variables are two aspects of the same concept. As a review, a random variable II is called an indicator variable for an event AA if I=1I=1 when AA occurs and I=0I=0 if AA does not occur. P(I=1)=P(A)P(I=1)=P(A) and E[I]=P(A)E[I]=P(A). Indicator random variables are Bernoulli random variables, with p=P(A)p=P(A). For instance, using indicator function, the expectation could be calculated as follows:

E[H]=i=1n[Ihi=headP] \mathbb{E}[H] = \sum_{i=1}^n \left [ \mathbb{I}_{\text{$h_i$=head}} \cdot \mathbb{P} \right]

To use the same trick we did in equation (13), we could derive the variance of HH is

Var[H]=Var[i=1nhi]=h=1nVar[hi]=np(1p)(14)\mathrm{Var}[H] = \mathrm{Var} \left [ \sum_{i=1}^n h_i \right ] = \sum_{h=1}^n \mathrm{Var}[h_i] = np(1-p) \tag{14}

Now, with the results in equation (11), we can test how concentrated of Markov’s inequality and Chebyshev’s inequality:

P(Xa)E[X]aP(XE[X]a)Var[X]a2 \begin{aligned} & \mathbb{P}(X \geq a) \leq \frac{\mathbb{E}[X]}{a} \\ & \mathbb{P}(|X - E[X]| \geq a) \leq \frac{\mathrm{Var}[X]}{a^2} \end{aligned}

Inequality bounds compare
Figure 2. Plot of different inequality bounds, which shows that the concentration improves a lot from Markov's inequality to Chebyshev's inequality; but there is still some space to improve.

Example 7: Hoeffding’s inequality

Suppose that ZZ has a finite mean and that P(Z0)=1\mathbb{P}(Z \geq 0)=1. Then, for any ϵ>0\epsilon>0, E(Z)=0zdP(z)ϵzdP(z)ϵϵdP(z)=ϵP(Z>ϵ)(15) \mathbb{E}(Z)=\int_0^{\infty} z d P(z) \geq \int_\epsilon^{\infty} z d P(z) \geq \epsilon \int_\epsilon^{\infty} d P(z)=\epsilon \mathbb{P}(Z>\epsilon) \tag{15} which yields Markov’s inequality: P(Z>ϵ)E(Z)ϵ \mathbb{P}(Z>\epsilon) \leq \frac{\mathbb{E}(Z)}{\epsilon}

Equation 15 is very powerful as it implies that any monotonic transformation preserve this property, which gives us other equalities, such as Chernoff’s bound and Bernstein inequality.

Figure 2 shows that although Chebyshev’s inequality is closer to the true probability value, however it does not decay very fast. Let’s check the formula

P(XE[X]ϵ)Var[X]ϵ2 \mathbb{P}(|X - E[X]| \geq \epsilon ) \leq \frac{\mathrm{Var}[X]}{\epsilon^2}

We know Var[Xi]=σ2/n\mathrm{Var}[X_i] = \sigma^2/n if XiX_i are iid with mean μ\mu and variance σ2\sigma^2, this means we have

P(Xiμϵ)σ2nϵ2 \mathbb{P}(|X_i - \mu| \geq \epsilon ) \leq \frac{\sigma^2}{n \epsilon^2}

While this inequality is useful, it does not decay exponentially fast as nn increases. To improve the inequality, we use Chernoff ’s method: for any t>0t > 0,

P(Z>ϵ)=P(ez>eϵ)=P(etz>etϵ)E(etz)etϵ(16) \mathbb{P}(Z > \epsilon) = \mathbb{P}(e^z > e^{\epsilon}) = \mathbb{P}(e^{tz} > e^{t \epsilon}) \leq \frac{\mathbb{E}(e^{tz})}{ e^{t \epsilon}} \tag{16}

This gives the inequality with the moment generating function.

The moment-generating function is so named because it can be used to find the moments of the distribution. The series expansion of etXe^{t X} is etX=1+tX+t2X22!+t3X33!++tnXnn!+. e^{t X}=1+t X+\frac{t^2 X^2}{2 !}+\frac{t^3 X^3}{3 !}+\cdots+\frac{t^n X^n}{n !}+\cdots . Hence MX(t)=E(etX)=1+tE(X)+t2E(X2)2!+t3E(X3)3!++tnE(Xn)n!+=1+tm1+t2m22!+t3m33!++tnmnn!+ \begin{aligned} M_X(t)=\mathrm{E}\left(e^{t X}\right) & =1+t \mathrm{E}(X)+\frac{t^2 \mathrm{E}\left(X^2\right)}{2 !}+\frac{t^3 \mathrm{E}\left(X^3\right)}{3 !}+\cdots+\frac{t^n \mathrm{E}\left(X^n\right)}{n !}+\cdots \\ & =1+t m_1+\frac{t^2 m_2}{2 !}+\frac{t^3 m_3}{3 !}+\cdots+\frac{t^n m_n}{n !}+\cdots \end{aligned}

where mnm_n is the nnth moment. Differentiating MX(t)iM_X(t) i times with respect to tt and setting t=0t=0, we obtain the iith moment about the origin, mim_i.

To learn more about probability concentration, please read this notes.

Before we present Hoeffding’s inequality, we will learn a new concept called symmetric random variable.

A random variable ZZ is symmetric if ZZ and Z-Z have the same distribution. A simple example of a symmetric random variable is the well know symmetric Bernoulli, which takes values 1-1 and +1+1 with equal probability 1/21/2 each, i.e,

P[Z=1]=P[Z=1]=1/2\mathbb{P}[Z = 1] = \mathbb{P}[Z = -1] = 1/2

For a usual Bernoulli random variable (XX) with two possible values 1,01, 0 and parameter 1/21/2, we can transfer it into a symmetric Bernoulli random variable by setting

Z=2X1Z = 2X - 1

Hoeffding’s Inequality Let X1,,XNX_1, \cdots, X_N be independent Bernoulli random variables, and let a=(a1,,aN)RNa = (a_1, \cdots, a_N) \in \mathbb{R}^N. Then, for any t0t \geq 0, we have

P[i=1NaiXit]exp(t22a22)(17) \mathbb{P} \left [\sum_{i=1}^N a_i X_i \geq t \right ] \leq \exp \left ( - \frac{t^2}{2||a||_2^2} \right ) \tag{17}

Remark: sometimes, you see the above formula for two-side inequality with absolute symbol. For the proof, you can find it here or in the book by Vershynin.

Now, to use this inequality, we have to be careful as our random variable is XiX_i - each trial of flipping the coin, instead of XX in equation (10) and figure 2.

Let XiX_i be the random variable of mapping head or tail of flipping the coin into 1,01, 0, we are interested in the value of the number of heads in N=100N = 100 trials. To use equation (17), we need to transform our random variable into symmetric one by setting

Yi=2(Xi12)Y_i = 2(X_i - \frac{1}{2})

Then we can have

P[i=1NaiXi>t]=P[i=1Nai(Yi2+12)>t]=P[i=1NYi>2tN]ai=1 iNexp((2tN)22N)a22=N \begin{aligned} \mathbb{P} \left [ \sum_{i=1}^N a_i X_i > t \right ] &= \mathbb{P} \left[ \sum_{i=1}^N a_i \left(\frac{Y_i}{2} + \frac{1}{2}\right) > t \right ] \\ &= \mathbb{P} \left[ \sum_{i=1}^N Y_i>2t-N \right ] \quad a_i = 1 \ \forall i \in N \\ & \leq \exp\left( - \frac{(2t-N)^2}{2 N}\right) \quad ||a||_2^2 = N \end{aligned}

Inequality bounds compare
Figure 3. Plot of different inequality bounds, which shows that Hoeffding's inequality follows the cumulative density function very closely.

Example 8: Bernstein inequality

Bernstein Inequality Consider symmetric independent random variables X1,,XnX_1, \cdots, X_n all falling in [a,a][-a, a] and E[Xi]=0E[X_i] = 0 (if not we can standardized it) with E[Xi2]=σ2E[X_i^2] = \sigma^2. Then Let

P[i=1NXit]2exp(t22nσ2+23at)(18) \mathbb{P} \left [ \left | \sum_{i=1}^N X_i \geq t \right | \right] \leq 2 \mathrm{exp} \left ( - \frac{ t^2 }{2 n \sigma^2 + \frac{2}{3} a \cdot t } \right) \tag{18}

Now, to apply the equation (18) in our flipping coins example, we could have

P[i=1NXit]=P[i=1N(Yi2+12)t]=P[i=1NYi(2tN)](μ=E[Yi]=0)exp((2tN)22N+23(2tN))(a=1;σ2=1) \begin{aligned} \mathbb{P} \left[ \sum_{i=1}^N X_i \geq t \right] & = \mathbb{P} \left[ \sum_{i=1}^N \left ( \frac{Y_i}{2} + \frac{1}{2} \right ) \geq t \right] \\ & = \mathbb{P} \left[ \sum_{i=1}^N Y_i \geq (2t-N) \right] \quad (\mu = \mathbb{E}[Y_i] = 0) \\ & \leq \exp \left ( - \frac{ (2t-N)^2}{2 N + \frac{2}{3} \cdot (2t -N)} \right) \quad (a = 1; \sigma^2 = 1) \end{aligned}

Remark: we drop 22 because we are dealing with one side inequality.

Inequality bounds compare
Figure 4. Plot of different inequality bounds, which shows that Hoeffding's inequality follows the cumulative density function very closely and Bernstein's inequality gives the larger value than Hoeffding's inequality in this case.

To learn when Hoeffding is stronger than Bernstein and when Bernstein is stronger than Hoeffding, please read this post.

Summary Those inequalities provide the theoretical basis for many statistical machine learn- ing methods. To learn more about related theories, please visit those sites:

  1. Musco, C. (2022). Randomized Methods, Sketching & Streaming. https://people.cs.umass.edu/ cmusco/CS514F22/schedule.html