Develop Some Fluency in Probabilistic Thinking (Part I)

The foundation of machine learning and data science is probability theory. In this post, we will develop some fluency in probabilistic thinking with different examples, which prepare data scientists well for the sexist job of the 21st century.

When I was studying probability theories, it was difficult for me to see its application in real life until we enter a new era with all potential of machine learning and deep learning models. This post is to explain how those basic theories (especially those related to inequalities) could help us for solving real world problems. Personally, I benefit a lot from those examples because it shows me what is probabilistic thinking and how powerful it is. Much of the content of this post is based on Cameron Musco’s course - Algorithms for Data Science (Musco, 2022) . If you want to review your knowledge about probability theory, you can check my post - Probability Review. However, I suggest you read them only when this post refers to it as it is more efficient and much easier to understand with the context.

This post is example driven meaning we will use different examples to develop our fluency in probabilistic thinking.

Example 1: verify a claim

You have contracted with a new company to provide CAPTCHAS for your website. They claim that they have a database of 1,000,000 unique CAPTCHAS. A random one is chosen for each security check. You want to independently verify this claimed database size. You could make test checks until you see all unique CAPTCHAS, which would take more than 1 million checks.

Then the number of pairwise duplicates is

D=i,j[m],i<jDi,j(1) D = \sum_{i, j \in[ m], i < j} D_{i, j} \tag{1}

Then we can have

E[D]=i,j[m],i<jE[Di,j](2) \mathbb{E} [D] = \sum_{i, j \in [m], i < j} \mathbb{E} [D_{i, j}] \tag{2}

For any pair i,j[m],i<ji, j \in [m], i < j , we have:

E[Di,j]=P(Di,j=1)=1n(3) \mathbb{E} [D_{i, j}] = P(D_{i, j} = 1) = \frac{1}{n} \tag{3}

because i,ji, j were drawn independently with replacement.

This gives us

E[D]=i,j[m],i<jE[Di,j]=(m2)1n=m(m1)2n(4) \mathbb{E} [D] = \sum_{i, j \in [m], i < j} \mathbb{E} [D_{i, j}]= \binom{m}{2} \frac{1}{n} = \frac{m(m-1)}{2n} \tag{4}

For the set mm elements, we choose any two from it, say i,ji, j, there must be a case which either i<ji < j or j<ij < i. Therefore, i,j[m],i<j\sum_{i, j \in [m], i < j} gives the combination of (m2)\binom{m}{2}.

Connection to the birthday paradox. If there are a 130 people in this room, each whose birthday we assume to be a uniformly random day of the 365 days in the year, how many pairwise duplicate birthdays do we expect there are?

According to the formula (4), it should be

E[D]=(m2)1n=m!2!(m2)!1n=130×(1301)21365 \mathbb{E} [D] = \binom{m}{2} \frac{1}{n} = \frac{m!}{2! (m-2)!} \frac{1}{n} = \frac{130 \times (130 -1)}{2} \frac{1}{365}

A demo figure
Figure 1. Plot of number of pairs having the same birthday based on the population size; notice that the trend grows 'exponentially'.

Now, we will plot the similar graph for our CAPTCHAS checking problem. As it is shown in Figure 2, we are not expected to see many duplicates when we draw a sample randomly with the size of 2000.

A demo figure
Figure 2. Plot of expected number of duplicates for different sample size with n=1,000,000n=1,000,000.

With m=1000m = 1000, if you see 10 pairwise duplicates and then you can suspect that something is up. But how confident can you be in your test?

Markov’s Inequality

If XX is a nonnegative random variable and a>0a > 0, then the probability that XX is at least aa is at most the expectation of XX divided by aa

P(Xa)E[X]a(5) \mathbb{P}(X \geq a) \leq \frac{ \mathbb{E} [X]}{a} \tag{5}

It is easy to prove this. For a discrete random variable, we have

E[X]=P(X=x)xxaP(X=x)sxaP(X=x)a=aP(Xa) \begin{aligned} \mathbb{E} [X] = \sum \mathbb{P}(X = x) \cdot x & \geq \sum_{x \geq a} \mathbb{P}(X = x) \cdot s \\ & \geq \sum_{x \geq a} \mathbb{P}(X = x) \cdot a \\ & = a \cdot \mathbb{P}(X \geq a) \end{aligned}

A demo figure
Figure 3. Plot of Markov's inequality with n=1,000,000n=1,000,000 and a fixed sample size m=1000m=1000.

With the formula (5), we can measure how confident we are when it comes to the decision that whether the service provider has a large number of CAPTCHAS in database. As it is shown in Figure 3, The probability of having 1010 duplicate pairs with m=1000m=1000 is 0.049950.04995, which is around 5%. Therefore, we can state that we are ‘95%’ right when we reject the service provider’s claim that their database is of size n=1,000,000n=1,000,000.

Example 2: approximate counting

We are using this example to explain the difference between deterministic models and stochastic models, which is relevant for the example 4 in the coming section. This section is also connected with deterministic and stochastic optimization problems.

Approximate Counting algorithms are techniques that allow us to count a large number of events using a very small amount of memory. It was invented by Robert Morris in 1977 and was published through his paper Counting large number of events in small registers. The algorithm uses probabilistic techniques to increment the counter, although it does not guarantee the exactness it does provide a fairly good estimate of the true value while inducing a minimal and yet fairly constant relative error.

Here is the problem:

The idea is very simple: instead of counting one by one, we could count 1 after 10 times of even or log2(1+n)\log_2(1+n). This means we will have a function to map nn to XnX_n like the following one:

Xn=log2(1+n) X_n = \log_2(1 + n)

The tradeoff is to lose accuracy to count a large number :

Therefore the accuracy rate of the above algorithm depends on how we round our errors. We could design a rule as follows:

This kind of algorithm is called deterministic as the rule is determined pre-running and the error was also introduced deterministically. However, we could also introduce the error stochastically as a random variable following some distribution.

Now, again the value XnX_n stored in a counter is

Xn=log2(1+n) X_n = \log_2(1 + n)

where nn is the number of events that have occurred so far. The plus one is to make sure that n=0Xn=0n = 0 \equiv X_n = 0. At any given time, our best estimate of nn is

θ^n=2Xn1\hat{\theta}_n = 2^{X_n} -1

θ^n\hat{\theta}_n is the estimation. For the next event, we have

θ^n+1=2Xn\hat{\theta}_n +1 = 2^{X_n}

Suppose in the midst of counting, we have the value XnX_n stored in the register. Whenever we get another event, we attempt to modify the contents of the register in the most appropriate way. The question is: how could we modify the contents of the register in the most appropriate way ?

To increment the counter, we compute

Xn+1=log2(1+2Xn)X_{n+1} = \log_2(1 + 2^{X_n})

However, values we calculated or mapped are not integer anyway. If we round the value, we might accumulate errors in our approximation. Instead, what we can do is the following (by replying on probability).

First, initialize X0=0X_0 = 0. Next, for n>0n > 0, compute

p=12X,rUniform(0,1)p = \frac{1}{2^X}, \quad r \sim \text{Uniform}(0, 1)

Then the value of Xn+1X_{n+1} is

Xn+1={Xn+1if p>rXnelse X_{n+1} = \begin{cases} X_n + 1 & \text{if} \ p > r \\ X_n & \text{else} \end{cases}

Before we analyze Morris’ algorithm, let’s compare the counting of those two methods: deterministic and stochastic one.

A demo figure
Figure 4. Plot of two different kinds of counting algorithms: notice the deterministic one gives the same results, whereas the stochastic one has the different path (comparing left and right panel).

As it is shown in Figure 4, every time when we run Morris’ approximate counting, the result is different as it depends on the value of rr. However we can show that the _expectation _of the approximation is nn:

θ^n=2Xn1E[θ^n]=n \begin{aligned} \hat{\theta}_n & = 2^{X_n} -1 \\ \mathbb{E} [\hat{\theta}_n] & = n \end{aligned}

For the base case when n=0n = 0, we have E[2X01]=0 \mathbb{E} [2^{X_0} - 1] = 0.

Now, we will prove the above expectation by induction. With the condition that base case is true, assume that

E[θ^n]=n\mathbb{E} [\hat{\theta}_n] = n

Then

E[θ^n+1]E[2Xn+11]=i=1P(Xn=i)E[2Xn+1Xn=i]1=i=1P(Xn=i)[12j2j+1increment +(112j)2jno change ]1=i=1P(Xn=i)(2j+1)1=i=1P(Xn=i)2j+i=1P(Xn=i)1=E[2Xn]=n+1. \begin{aligned} \mathbb{E}\left[\hat{\theta}_{n+1}\right] & \triangleq \mathbb{E}\left[2^{X_{n+1}}-1\right] \\ & =\sum_{i=1}^{\infty} \mathbb{P}\left(X_n=i\right) \mathbb{E}\left[2^{X_{n+1}} \mid X_n=i\right]-1 \\ & =\sum_{i=1}^{\infty} \mathbb{P}\left(X_n=i\right)[\underbrace{\frac{1}{2^j} \cdot 2^{j+1}}_{\text {increment }}+\underbrace{\left(1-\frac{1}{2^j}\right) \cdot 2^j}_{\text {no change }}]-1 \\ & =\sum_{i=1}^{\infty} \mathbb{P}\left(X_n=i\right)\left(2^j+1\right)-1 \\ & =\sum_{i=1}^{\infty} \mathbb{P}\left(X_n=i\right) 2^j+\sum_{i=1}^{\infty} \mathbb{P}\left(X_n=i\right)-1 \\ & =\mathbb{E}\left[2^{X_n}\right] \\ & =n+1 . \end{aligned}

Although Morris’ algorithm could deviate a lot from the true value, the expectation is still equal to the true value. The deterministic one has a constant deviation but the error is accumulated for sure and we might end up with over counting or under counting when nn grows very large.

Example 3: unit testing

Unit tests are typically automated tests written and run by software developers to ensure that a section of an application (known as the “unit”) meets its design and behaves as intended. In procedural programming, a unit could be an entire module, but it is more commonly an individual function or procedure.

Let’s say that you have a dataset with millions of rows and hundreds of variables as columns. Now, you need to write a function to go through each row and do some transformation column wise. For example, suppose you need to write a function to separate year and month from publn_date in Table 1. As you can see that the format is different. Since the dataset is big, there is no way to go through every cell to check the format and write a function to do the separate. What should we do? We rely on probability reasoning and sampling again.

Table 1. A dataset example
pat_publn_id publn_kind appln_id publn_date publn_first_grant
277148936 A1 156990 2008-05-14 N
437725522 B1 156990 2015-04-08 Y
437725789 A2 156990 2018,04,08 Y
437723467 B1 156990 19, July 2020 Y

Let’s transform this problem as a probabilistic one:

Scenario 1

For scenario i, we could model it as a hypergeometric distribution problem. Before we solve our problem, let’s review a classical example first. If we have an urn filled with ww white and bb black balls, then drawing mm balls out of the urn with replacement, then it yields a Binomial distribution Bin(m,w/(w+b))\mathrm{Bin}(m, w/(w+b)) for the number of white balls obtained in mm trials, since the draws are independent Bernoulli trials, each with probability w/(w+b)w/(w+b) of success (Blitzstein & Hwang, 2015).

Remark: when I tried to model the unit testing problem, I was confused whether I should treat my unit testing as Bernoulli trial or not. The way of thinking about it is to think about:

For our unit testing, it is true that each unit test is a trial with success rate and failure rate. However, those rates are not known to us and they are not fixed either. Therefore, we cannot model it as Bernoulli trial.

Now, back to our white and black ball problem, if we instead sample without replacement, then it will not follow Bernoulli distribution. Therefore, we need to do probabilistic reasoning. The population size is N=w+bN = w + b. Let’s calculate count the number of possible ways o draw exactly kk white balls (assuming k<wk < w) and nkn-k black balls from the urn (without distinguishing between different orderings for getting the same set of balls):

(wk)(Nkb) \binom{w}{k} \binom{N-k}{b}

Think about this way: you have 3 ways of cooking food A and 4 ways of cooking food B, how many ways of cooking food ABAB then?

For the total ways of drawing mm balls without differentiating colors, we have (Nm) \binom{N}{m}. Since all samples are equally like, the naive definition of probability gives us

P(X=k)=(wk)(bmk)(Nm) \mathbb{P}(X = k) = \frac{\binom{w}{k} \binom{b}{m-k} }{\binom{N}{m}}

For our unit testing problem, we have the population size is NN and size of anomalies is AA (treat them as black balls) and number of normal cells is NAN-A. If we draw mm out of NN, we want to know the probability of having 1,2,k1, 2, \cdots k anomalies (k<mk < m and AA).

P(X=k)=(Ak)(NAmk)(Nm)(6) \mathbb{P}(X = k) = \frac{\binom{A}{k} \binom{N-A}{m-k}}{\binom{N}{m}} \tag{6}

Based on the equation (6), we will run the simulation with the following parameters:

Since kk has to be smaller than mm and AA, we will set m[2,99)m \in [2, 99) and

k={0.5×m,0.5×m<AA,otherwise k = \begin{cases} \lfloor 0.5 \times m \rfloor, 0.5 \times m < A \\ A, \text{otherwise} \end{cases}

this means we are simulating the probability of having half of mm as anomalies whenever 0.5×m<A0.5 \times m < A, otherwise simulate the probability of having AA anomalies.

A demo figure
Figure 5. Simulation for the case 1, which shows the probability of having either half of the sample as anomalies or all anomalies (K=10) drops dramatically, which is 5.32e-11.

Figure 5 shows that the probability of having all anomalies in our sample is very small, which is 5.32e115.32e-11, this means there is no unit test could cover all anomalies once unless you test all population. Figure 6 tells the same story in terms of trend but the probability is higher as the share of anomalies is higher too.

hypergeom simulation case2
Figure 6. Simulation for the case 2, which shows the same trend but the probability grows with a larger share of anomalies.

Our simulation tells us that no one should try to come up a unit test to cover all of anomalies unless one is willing to test all. If we test based on the sample, then the optimal size is 33, which shows the probability of having one 1 anomalies is around 3%3\% when the share of anomalies is 1%1\%, that is around 25%25\% when the share of anomalies is 10%10\%.

Scenario 2

Now, we will fix the number of anomalies, and study the probability of having exact 11 anomalies in our sample when the parameters of interest, N,A,mN, A, m are changing.

m Probability
5 0.048
5 0.329
5 0.411
25 0.201
25 0.198
25 0.023
hypergeom scenario 2 simulation
Figure 6. The hypergeometric simulation when m is fixed and N and A are updated

Figure 6 shows that the probability of having exact one anomaly in our test increases first and then drops. The optimal testing size mm also shifts to the left when the share of anomalies in the population is higher, meaning we don’t have to test that much to see an anomaly in our sample.

Bernoulli trial

Now, we can use Bernoulli trial as have our binary probability. Let’s assume that N=1000,A=10,m=5N=1000, A=10, m=5, which means the probability of having exact one anomaly in our sample is around 0.0480.048 as it is shown in Figure 6. Now, the probability of having no anomaly is:

P(k=0)=1P(k=1)P(k=2)P(k=5)10.049=0.951 \begin{aligned} \mathbb{P}(k=0) & = 1 - \mathbb{P}(k=1) - \mathbb{P}(k=2) - \cdots \mathbb{P}(k=5) \\ & \approx 1 - 0.049 = 0.951 \end{aligned}

Binomial trials

Therefore, we can set p=0.049,q=0.951p=0.049, q = 0.951 and if we run our unit testing three times or five times, the probability that we will see one anomaly from any of those three tests is

P(u=3)=(31)p(1p)20.132;P(u=5)=(51)p(1p)40.20; \mathbb{P(u=3)} = \binom{3}{1} p (1-p)^2 \approx 0.132; \quad \mathbb{P(u=5)} = \binom{5}{1} p (1-p)^4 \approx 0.20;

If you look at the table in the figure 6, running 5 times unit testing with sample size m=5m=5 is no difference from running one time unit testing with sample size m=25m=25.

relationship of distributions
Figure 7. The relationship of different distributions (as a reflection of example 3); figure was taken from the website of John D. Cook, PhD.

Example 4: hash tables

This section assumes readers know what hash table is. If not, just check out this video from MIT.

The hash function h:U[n]h: U \to [n] maps the elements of universe or population to indices 1,,n1, \cdots, n of an array. The data structure dict from Python is built on hash tables as this kind of map makes it easy to extract information.

A demo figure
Figure 8. Illustration of hash table for dict structure of Python.

Why could not use array to build the dictionary, one array for keys and one array for values and both arrays are mapped with the same index. If we implement this idea, then we need to figure out a way of extracting the value based on the key, say dict_foo["B"]. If all keys are numbers, then dictionary is equivalent to array. However, very often keys are strings. How could achieve O(1)O(1) performance for finding dict_foo["B"]? A hash table could do that by mapping keys into the indexes of an array (for further explanation).

A hashing function could be either deterministic one (Python implemented it) or stochastic one. We will focus on the stochastic one.

For a big dataset of U>>n|U| >> n, we need to reply on probability to map elements from universe to the array as the it is economically efficient. For instance, big firms like Google or Facebook are always trying to use less servers or database to serve their customs. In this case the population size is much larger than the number of servers or database.

A demo figure
Figure 9. Illustration of hash table for IP addresses, figure was taken from Musco's course (2022).

As we have explained that it costs a lot to have an index for each element shown in Figure 9. To use less indexes in our hash table, we need to find a way to avoid collisions.

Remark: the application of hashing in data science and data structure like dict is slightly different. For building a data structure like dict, the focus is on achieving O(1)O(1) performance of finding the value based on the key, whereas for managing a large number of visiting to server the focus is more on mapping between mm and nn dynamically.

Now, we store mm items from a large universe in a hash table with nn positions. When we insert mm items into the hash table we may have to store multiple items in the same location (typically as a linked list).

Let’s frame our problem as a probabilistic one:

Remark: In example 1, the population size n=1,000,000n=1,000,000 is fixed, whereas the sample size mm is updated dynamically every time we draw a sample; for example 2, the hash table size nn is fixed too, whereas total number of stored items is updated dynamically. It is important to know when and how to set up population and sample space. It might be counterintuitive when you see we set the hash table as the population.

General principle for setting up population and sample: Anything is fixed should be treated as the population, whereas anything is updated dynamically should be treated as the sample space that could be mapped with the random variable function.

With the right setup, we can have the following expectation of pairwise collision

E[C]=i,j[m],i<jE[Ci,j]=(m2)1n=m(m1)2n(7) \mathbb{E} [C] = \sum_{i, j \in [m], i < j} \mathbb{E} [C_{i, j}]= \binom{m}{2} \frac{1}{n} = \frac{m(m-1)}{2n} \tag{7}

For n=4m2n = 4m^2 (size of hashing table is much larger than the sample size), we have

E[C]=m(m1)8m2<m28m2=18(8) \mathbb{E} [C] = \frac{m(m-1)}{8m^2} < \frac{m^2}{8m^2} = \frac{1}{8} \tag{8}

This means we could have the collision-free hash table if the hash table is large enough. This completely makes sense. However, we do not gain any economic benefit as we are using O(m2)O(m^2) space to store mm elements.

Let’s apply Markov’s inequality in this case, we can have

P(C1)E(C)1=18P(C=0)=1P(C1)=78(9) \begin{aligned} \mathbb{P} (C \geq 1) \leq \frac{\mathbb{E} (C)}{1} = \frac{1}{8} \\ \mathbb{P}(C = 0) = 1 - \mathbb{P} (C \geq 1) = \frac{7}{8} \tag{9} \end{aligned}

Two Level Hashing. Now, if we want to preserve O(1)O(1) query with O(m)O(m) space, we can do two level hashing as shown in Figure 5.

A demo figure
Figure 10. Illustration of two level hashing; figure was taken from Musco's course (2022).

In equation 9, we have shown that the random hashing function could guarantee that the collision is avoided at 7/87/8 probability level. Therefore, we could just use a random hashing function at level two. Now, let’s set up the scene to do probabilistic thinking:

What is the expected space usage for two level hashing? It is quite straightforward to calculating the total space usage.

E[S]=n+i=1nE[si2](10) \mathbb{E} [S] = n + \sum_{i=1}^n \mathbb{E} [s_i^2] \tag{10}

Before we calculating E[si2]\mathbb{E} [s_i^2], let’s go through two scenario (as always, we assumed m>>nm > > n):

I{si}[h(xj)]:={1h(xj)=i0h(xj)i \mathbb{I}_{\{s_i\}}[h(x_j)] := \begin{cases} 1 & h(x_j) = i \\ 0 & h(x_j) \neq i \end{cases}

Therefore, we could have

E[si2]=E[(j=1mIh(xj)=i)2](11) \mathbb{E}[s_i^2] = \mathbb{E} \left [ \left ( \sum_{j=1}^m \mathbb{I}_{h(x_j)=i} \right )^2 \right] \tag{11}

Equation (11) is saying that we sum up all possible elements assigned into sis_i for element xj,j[m]x_j, j \in [m]. Now, let’s expand the equation 11,

E[si2]=E[(j=1mIh(xj)=i)2]=E[(j=1mIh(xj)=i)(j=1mIh(xj)=i)]=E[j,k[m]Ih(xj)=iIh(xk)=i]=j,k[m]E [Ih(xj)=iIh(xk)=i](12) \begin{aligned} \mathbb{E}[s_i^2] & = \mathbb{E} \left [ \left ( \sum_{j=1}^m \mathbb{I}_{h(x_j)=i} \right )^2 \right] \\ & = \mathbb{E} \left [ \left ( \sum_{j=1}^m \mathbb{I}_{h(x_j)=i} \right ) \left ( \sum_{j=1}^m \mathbb{I}_{h(x_j)=i} \right ) \right] \\ & = \mathbb{E} \left[ \sum_{j, k \in [m]} \mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \right ] \\ & = \sum_{j, k \in [m]} \mathbb{E} \ \Bigl[\mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \Bigr] \tag{12} \end{aligned}

The second step of equation (12) is derived because many values of indicator function are zeros and the last step is derived due to the linearity of expectation.

With equation (12), we need to calculate the expectation

j,k[m]E [Ih(xj)=iIh(xk)=i]\sum_{j, k \in [m]} \mathbb{E} \ \Bigl [\mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \Bigr ]

We will do this by calculating two cases:

For j=kj=k, the derivation is simple

E [Ih(xj)=iIh(xk)=i]=E[(Ih(xj)=i)2]=P[(h(xj)=i)]=1n \mathbb{E} \ \Biggl [\mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \Biggr] = \mathbb{E} \left[ \left (\mathbb{I}_{h(x_j)=i} \right)^2 \right] = \mathbb{P}[(h(x_j) = i)] = \frac{1}{n}

For jkj \neq k, we have

E [Ih(xj)=iIh(xk)=i]=P[(h(xj)=i)h(xk)=i]=1n2 \mathbb{E} \ \Biggl [\mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \Biggr] = \mathbb{P}[(h(x_j) = i) \cap h(x_k) = i] = \frac{1}{n^2}

Therefore, according to the linearity of expectation, we have

E[si2]=j,k[m]E [Ih(xj)=iIh(xk)=i]=m1n+2(m2)1n2=mn+m(m1)n2(13) \begin{aligned} \mathbb{E}[s_i^2] & = \sum_{j, k \in [m]} \mathbb{E} \ \Bigl[\mathbb{I}_{h(x_j)=i} \cdot \mathbb{I}_{h(x_k)=i} \Bigr] \\ & = m \cdot \frac{1}{n} + 2 \cdot \binom{m}{2} \frac{1}{n^2} \\ & = \frac{m}{n} + \frac{m(m-1)}{n^2} \tag{13} \end{aligned}

When we set n=mn = m in equation (13), we can get

E[si2]=mn+m(m1)n2<2(14) \mathbb{E}[s_i^2] = \frac{m}{n} + \frac{m(m-1)}{n^2} < 2 \tag{14}

Now, with this bound, we can substitute the equation (14) into equation (10) and have our final result,

E[S]=n+i=1nE[si2]n+i=1n2=3n=3m(15) \mathbb{E} [S] = n + \sum_{i=1}^n \mathbb{E} [s_i^2] \leq n + \sum_{i=1}^n 2 = 3n = 3m \tag{15}

This means we achieved a near optimal space with O(1)O(1) query time!

  1. Blitzstein, J. K., & Hwang, J. (2015). Introduction to probability. CRC Press/Taylor & Francis Group Boca Raton, FL, USA.
  2. Musco, C. (2022). Randomized Methods, Sketching & Streaming. https://people.cs.umass.edu/ cmusco/CS514F22/schedule.html