Develop Some Fluency in Probabilistic Thinking (Part II)
The foundation of machine learning and data science is probability theory. In this post, we will develop some fluency in probabilistic thinking with different examples, which prepare data scientists well for the sexist job of the 21st century.
This the second part of this series of posts as the title has stated. We will continue to study the material from Cameron Musco’s course - Algorithms for Data Science (Musco, 2022). This section assumes readers know what hash table is. If not, just check out this video from MIT.
- Example 5: Randomized load balancing
- Example 6: flipping coins
- Example 7: Hoeffding’s inequality
- Example 8: Bernstein inequality
When we use a random hash function to map keys into hashing numbers, we could design a fully random one which has the following property:
However, wit the above property, we need to store a table of values and their hash values. This would take at least space and query time to look up, making our whole quest for query time pointless. Therefore we need to find another property that will avoid collision and query time.
2-Universial Hash Function (low collision probability). A random hash function from is two universal if :
One of the functions that are often used is the following one:
Pairwise Independent Hash Function. A random hash function from is pairwise independence if for all :
We can show that pairwise hash function are 2-universal. Now, for , it has pairs and one pair with itself, therefore:
Example 5: Randomized load balancing
We have requests randomly assigned to servers. How many requests must each server handle? Let’s assume as the number of requests assigned to server . The expectation tells
If we provision each server be able to handle twice the expected load, what is the probability that a server is overloaded? We can calculate it based on Markov’s inequality:
Chebyshev’s inequality
With a very simple twist, Markov’s inequality can be made much more powerful. For any random variable and any value , we have:
Now, by plugging in the random variable , we can hae Chebyshev’s inequality.
Chebyshev’s inequality. For any random variable and ,
With Chebyshev’s inequality, we can calculate the probability that falls standard deviations from its mean as:
Equation (5) gives us the quick estimation on deviation of our estimations. It could also shows that the sample average will always concentrate on mean with enough samples (law of large numbers).
Consider drawing independent identically distributed (i.i.d) random variables with mean and variance . How well does the sample average
approximate the true mean ?
First, we need to calculate the sample variance
By Chebyshev’s inequality, for any fixed value , we have
Equation (6) could be used to estimate the size of sample we need for a certain level of accuracy of the estimation. Instead of using , we will use , then we can have
We can do quick estimation on sample size we need to achieve a certain level of estimation based on equation (6). Figure 1 shows that the probability of our sample mean deviating 10% percent of its variance is less or equal to when the sample size is greater than , whereas the probability is still bigger than 1 if we want our estimation with high accuracy (), which means that one needs a larger sample for that.

Now, back to our balancing load example, we have assumed that each server could be able to handle twice the expected load and the expected load is
as it has been derived in equation (3) and we want to find the probability that a server is overloaded. We can use Chebyshev’s inequality to calculate it. However, we need to calculate the variance first. Let’s denote as if is assigned to server and otherwise, then
Now, with the linearity of variance, we can have
Now, with and , we can apply Chebychev’s inequality ( is the number of requests sent to server )
Equation (8) shows that overload probability is extremely small when , which seems counterintuitive as it means bound gets worse as grows (more servers do not help). When is large, the number of requests each server sees in expectation is very small so the law of larger numbers does not ‘kick in’.
Maximum server load
We have shown the probability of one server overloaded in equation (8), now we are interested in the probability of maximum server load exceeds its capacity. To get the maximum server load, we need to have a list of servers that overloaded, this means not just one sever but possible many of them are overloaded. The value of maximum load on any server can be written as the random variable:
How could we form a probability for in terms of ? Let’s start from null case, which is that no is greater or equal to . For this kind of case, . Therefore, we need to have at least one of is greater or equal to . On the other side, if all of is greater or equal to , then there must be the case that . However, we do not need all of them. We can transform them as the ‘or’ problem that guarantees the event :
Union Bound. For any random events , we have
With the union bound, it is easy to show
This means as long as , with good probability, the maximum server load will be small (compared to the expected to load).
Example 6: flipping coins
You might think this example will be super easy as we learned this example probably from our elementary school. However, if I tell you that we flip independent coins, each are heads with probability and tails with probability . Let be the number of heads, then we could have:
When you look at the equation (10), it seems very straightforward. However, to get numbers like , we need a long derivation. Let’s use the definition of expectation. I have to admit that I found it very unnatural to write down the formula of expectation for our example. I had an almost irresistible impulse to review the definition of expectation again.
Expected value
Informally, the expected value is the weighted arithmetic mean of a large number of independently selected outcomes of a random variable. This means we have:
one random variable and a countable set of possible outcomes.
For example, weight of human being follows the normal distribution: random
variable is weight of human being, a countable set of possible outcomes
is the set of all possible values of weight (some people are fit, some are
slim, some are overweight).
At last, we need to calculate the weighted arithmetic mean.
For the example of flipping coins, the random variable is (the number of heads), the countable set of possible outcomes is the set of all possible values . We know each trial follows the binomial distribution, then we can write down the formula for the expected value.
I hope you feel released when you compare equation (11) and (10), which shows that it is natural for anyone to have unnatural reactions at the beginning. Now, we will derive the expected value. There are different ways to do it (see all proofs). With binomial theory, it can be shown that
we will leverage the linearity of expectation again because each trial is independent. We can treat as the sum of discrete random variable that follows the Bernoulli trial, this means
Remark: Be careful about the index. In equation (11), is not a random variable, it is just a mathematical symbol for us to count the number of heads (which could be ). However, in equation (12), is a random variable which follows the Bernoulli trials. We run independent trials and we are interested in the sum of those discrete random varaibles.
Therefore,
Bernoulli random variables and indicator variables are two aspects of the same concept. As a review, a random variable is called an indicator variable for an event if when occurs and if does not occur. and . Indicator random variables are Bernoulli random variables, with . For instance, using indicator function, the expectation could be calculated as follows:
To use the same trick we did in equation (13), we could derive the variance of is
Now, with the results in equation (11), we can test how concentrated of Markov’s inequality and Chebyshev’s inequality:

Example 7: Hoeffding’s inequality
Suppose that has a finite mean and that . Then, for any , which yields Markov’s inequality:
Equation 15 is very powerful as it implies that any monotonic transformation preserve this property, which gives us other equalities, such as Chernoff’s bound and Bernstein inequality.
Figure 2 shows that although Chebyshev’s inequality is closer to the true probability value, however it does not decay very fast. Let’s check the formula
We know if are iid with mean and variance , this means we have
While this inequality is useful, it does not decay exponentially fast as increases. To improve the inequality, we use Chernoff ’s method: for any ,
This gives the inequality with the moment generating function.
The moment-generating function is so named because it can be used to find the moments of the distribution. The series expansion of is Hence
where is the th moment. Differentiating times with respect to and setting , we obtain the th moment about the origin, .
To learn more about probability concentration, please read this notes.
Before we present Hoeffding’s inequality, we will learn a new concept called symmetric random variable.
A random variable is symmetric if and have the same distribution. A simple example of a symmetric random variable is the well know symmetric Bernoulli, which takes values and with equal probability each, i.e,
For a usual Bernoulli random variable () with two possible values and parameter , we can transfer it into a symmetric Bernoulli random variable by setting
Hoeffding’s Inequality Let be independent Bernoulli random variables, and let . Then, for any , we have
Remark: sometimes, you see the above formula for two-side inequality with absolute symbol. For the proof, you can find it here or in the book by Vershynin.
Now, to use this inequality, we have to be careful as our random variable is - each trial of flipping the coin, instead of in equation (10) and figure 2.
Let be the random variable of mapping head or tail of flipping the coin into , we are interested in the value of the number of heads in trials. To use equation (17), we need to transform our random variable into symmetric one by setting
Then we can have

Example 8: Bernstein inequality
Bernstein Inequality Consider symmetric independent random variables all falling in and (if not we can standardized it) with . Then Let
Now, to apply the equation (18) in our flipping coins example, we could have
Remark: we drop because we are dealing with one side inequality.

To learn when Hoeffding is stronger than Bernstein and when Bernstein is stronger than Hoeffding, please read this post.
Summary Those inequalities provide the theoretical basis for many statistical machine learn- ing methods. To learn more about related theories, please visit those sites:
- Musco, C. (2022). Randomized Methods, Sketching & Streaming. https://people.cs.umass.edu/ cmusco/CS514F22/schedule.html