The Johnson Lindenstrauss Lemma
In the era of AI, the Johnson Lindenstrauss lemma provides the mathematical foundation for many applications of machine learning and deep learning, such as ChatGPT.
Very high-dimensional vectors are ubiquitous in science, engineering, and machine learning. They give a simple way of representing data either in a vector or a matrix, which then could be analyzed with all those linear algebra and probability tools.
For genetic data sets, images, time series (e.g. audio), texts, the number of features is normally large, which makes them become high-dimensional data easily.
What do we want to do with such high dimensional vectors? Cluster them, use them in regression analysis, feed them into machine learning algorithms. As an even more basic goal, all of these tasks require being able to determine if one vector is similar to another. However, human beings are not very good at analyzing high dimensional features as the complexity and computing cost of solving any problems of big and high dimensional datasets grow exponentially.
Therefore, we need to transfer high dimensional vectors into low dimension and ask: how much of the feature and structure of high dimensional data will be preserved if we map it into the lower dimensional ones?
The Johnson Lindenstrauss lemma gives the answer. This blog is based on the course notes from Anupam Gupta, Daniel Raban, MIT, and the book by Blum et al (Blum et al., 2020).
- Markov’s inequality and the Chernoff bound
- The Johnson Lindenstrauss lemma
- Comments about the Johnson Lindenstrauss lemma
Markov’s inequality and the Chernoff bound
Before we give the Johnson Lindenstrauss lemma and prove it, let’s review some key inequalities in probability.
(Markov’s inequality). Let be a nonnegative random variable, then for any ,
Based on Markov’s inequality, we can have
This is the famous Chernoff bound.
Now, let’s see an example of this bound in action. Let , then follows the log-normal distribution, we can compute
Plugging the above equation into the Chernoff bound gives
For the equation
We know that it has the minimum value at , which means we can have
We get the same thing if we apply the argument to , so adding the two cases together gives
Now, we construct a random variable such that it is an average of i.i.d. random variables , then we have
Then, follows the normal distribution with mean and variance . Then we apply the Chernoff bound to , we have
This type of inequality is called the concentration inequality. It tells us that the probability of a random variable being far away from its mean concentrates around the mean at an exponential rate.
Using the above inequality, we can prove the Johnson Lindenstrauss lemma.
We know that a random variable follows the Chi-square distribution with degrees of freedom if , where are i.i.d. random variables.
We could calculate the first and second moments of as
This means for the average of , we have
This gives us the following inequality if we substitute the above values () into the Chernoff bound in the equation (1).
Or with the variable involved, we have
which is equivalent to
Now, we are ready to prove the Johnson Lindenstrauss lemma.
The Johnson Lindenstrauss lemma
Given any set of points in , and any , there exists a linear map with such that
for any and .
- Since it is a linear map, which means is just a function of a matrix, let’s say this matrix is
- Intuitively, the above statement tells us that there is a mapping of -dimensional vectors to -dimensional vectors, with that keep the -norm distances be (approximately) the same.
- Note that this mapping is linear, which is a very useful property in many of the applications.
- Observe that does not depend on
- Additionally, even though, in principle, in the statement above the matrix could depend on the vectors , we will construct it in a way that is completely oblivious to what these vectors are.
- However, as one could show that there is no single that works for all collections of vectors, we will make the construction of probabilistic and prove that for any fixed set of vectors it works with high probability.
- specifically, we will make each entry of be sampled independently from a Gaussian distribution , where the density of a Gaussian random variable is given by
Now, let be a matrix, such that every entry of is filled with an i.i.d draw from a standard normal distribution (a.k.a the “Gaussian” distribution). For , define

Now, we will establish a certain norm preservation property of . Specifically, it suffices to prove that, for any fixed vector , we have that
To prove equation (6), let us fix , and consider the distribution of the -th coordinate of the mapping of
Each random variable is a sum of Gaussian distributions. Specifically, we have that
where we used the crucial property of Gaussian distributions that they are closed (linear) under taking a sum. More precisely, for any two Independent Gaussian random variables and , we have that
Consequently, we can conclude that
This means norm is preserved, but how about the concentration around this expectation? We need a tail bound.
Recall that is distributed as Gaussian variable. We know that Gaussian variables are very concentrated around expectation - the tail vanishes at an exponential rate. For instance, is within of the expectation with probability at least .
This strong concentration is quantified via Chernoff bounds (for -squared distributions).
In equation (9), we have shown
In equation (10), we have shown
Now, we standardize the random variable as follows:
This means follows the chi-squared distribution with degrees of freedom.
Now, we can apply the Chernoff bound to the random variable to obtain the following bound:
Now, if we set the value of as follows:
Then we could have
Now, if we treat all the pairs of vectors in as a random sample, then we can apply the union bound to obtain the following bound:
This concludes the proof .
Comments about the Johnson-Lindenstrauss Lemma
-
The above proof shows not only the existence of a good map, we also get that a random map as above works with constant probability! In other words, a Monte-Carlo randomized algorithm for dimension reduction. (Since we can efficiently check that the distances are preserved to within the prescribed bounds, we can convert this into a Las Vegas algorithm.)
-
The algorithm (at least the Monte Carlo version) does not even look at the set of points : it works for any set with high probability. Hence, we can pick this map before the points in arrive.
-
Two issues: the matrix A we sample above has real-valued entries, and it is very dense, i.e., each entry is non-zero (with probability 1). This is not very practical.
-
To deal with the first issue, we can replace the matrix with a random matrix with i.i.d. entries, and then normalize the columns of to have unit norm. This is called the Gaussian Random Projection.
-
the second method is to use a sparse random matrix . This is called the Sparse Random Projection. The paper by Kane and Nelson (2014) shows there exists a way of sampling A to ensure that there is only at most non-zero entries per column. (The distribution here cannot correspond to simple entry-wise independent sampling anymore.) (Kane & Nelson, 2014).
-
The paper by Achlioptas et al. (2003) shows that sampling each entry of independently from a Bernoulli distribution with parameter works well. For instance, with probability , with probability , and with probability . This is called the Bernoulli Random Projection (Achlioptas, 2003).
-
Natural question: Can we have such a strong dimensionality reduction phenomena also for other distances, e.g., or norms?
-
Unfortunately, no. It turns out that -norm is very special.
- Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687.
- Blum, A., Hopcroft, J., & Kannan, R. (2020). Foundations of data science. Cambridge University Press.
- Kane, D. M., & Nelson, J. (2014). Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1), 1–23.