The Johnson Lindenstrauss Lemma

In the era of AI, the Johnson Lindenstrauss lemma provides the mathematical foundation for many applications of machine learning and deep learning, such as ChatGPT.

Very high-dimensional vectors are ubiquitous in science, engineering, and machine learning. They give a simple way of representing data either in a vector or a matrix, which then could be analyzed with all those linear algebra and probability tools.

For genetic data sets, images, time series (e.g. audio), texts, the number of features is normally large, which makes them become high-dimensional data easily.

What do we want to do with such high dimensional vectors? Cluster them, use them in regression analysis, feed them into machine learning algorithms. As an even more basic goal, all of these tasks require being able to determine if one vector is similar to another. However, human beings are not very good at analyzing high dimensional features as the complexity and computing cost of solving any problems of big and high dimensional datasets grow exponentially.

Therefore, we need to transfer high dimensional vectors into low dimension and ask: how much of the feature and structure of high dimensional data will be preserved if we map it into the lower dimensional ones?

The Johnson Lindenstrauss lemma gives the answer. This blog is based on the course notes from Anupam Gupta, Daniel Raban, MIT, and the book by Blum et al (Blum et al., 2020).

Markov’s inequality and the Chernoff bound

Before we give the Johnson Lindenstrauss lemma and prove it, let’s review some key inequalities in probability.

(Markov’s inequality). Let X0X \geq 0 be a nonnegative random variable, then for any a>0a > 0,

P(Xa)E[X]a \mathbb{P}(X \geq a) \leq \frac{\mathbb{E}[X]}{a}

Based on Markov’s inequality, we can have

P(Xa)=P(txta)=P(etxeta)etaE[etx], \begin{aligned} \mathbb{P}(X \geq a) & = \mathbb{P}(t x \geq ta) \\ & = \mathbb{P}(e^{tx} \geq e^{ta}) \\ & \leq e^{-ta} \mathbb{E}[e^{tx}], \end{aligned}

This is the famous Chernoff bound.

Now, let’s see an example of this bound in action. Let XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then eXe^X follows the log-normal distribution, we can compute

E[et(Xμ)]=et2σ22 \mathbb{E}[e^{t(X-\mu)}] = e^{\frac{t^2\sigma^2}{2}}

Plugging the above equation into the Chernoff bound gives

P(Xμa)etaet2σ22=exp(t2σ22ta) \mathbb{P}(X- \mu \geq a) \leq e^{-ta} e^{\frac{t^2\sigma^2}{2}} = \exp \left ( \frac{t^2 \sigma^2}{2} - ta \right)

For the equation

t2σ22ta, \frac{t^2 \sigma^2}{2} - ta,

We know that it has the minimum value at t=a/σ2 t = a/\sigma^2, which means we can have

P(Xμa)etaet2σ22=exp(t2σ22ta)=exp(a22σ2a2σ2)=exp(a22σ2) \begin{aligned} \mathbb{P}(X- \mu \geq a) \leq e^{-ta} e^{\frac{t^2\sigma^2}{2}} & = \exp \left ( \frac{t^2 \sigma^2}{2} - ta \right)\\ & = \exp \left( \frac{a^2}{2 \sigma^2} - \frac{a^2}{\sigma^2} \right) \\ & = \exp \left( - \frac{a^2}{2\sigma^2} \right) \end{aligned}

We get the same thing if we apply the argument to X-X, so adding the two cases together gives

P(Xμa)2exp(a22σ2)(1) \mathbb{P}(|X-\mu| \geq a) \leq 2 \exp \left( - \frac{a^2}{2\sigma^2} \right) \tag{1}

Now, we construct a random variable Xˉ\bar{X} such that it is an average of i.i.d. random variables X1,,XnX_1, \cdots, X_n, then we have

Xˉ=1ni=1nXi. \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.

Then, Xˉ\bar{X} follows the normal distribution with mean μ\mu and variance σ2n\frac{\sigma^2}{n}. Then we apply the Chernoff bound to Xˉ\bar{X}, we have

P(Xˉμa)2exp(na22σ2) \mathbb{P}(|\bar{X}-\mu| \geq a) \leq 2 \exp \left( - \frac{n a^2}{2\sigma^2 } \right)

This type of inequality is called the concentration inequality. It tells us that the probability of a random variable being far away from its mean concentrates around the mean at an exponential rate.

Using the above inequality, we can prove the Johnson Lindenstrauss lemma.

We know that a random variable XX follows the Chi-square distribution with mm degrees of freedom if X=i=1mZi2X = \sum_{i=1}^m Z_i^2, where ZiN(0,1)Z_i \sim \mathcal{N}(0, 1) are i.i.d. random variables.

We could calculate the first and second moments of XX as

E[X]=i=1mE[Zi2]=mV[X]=i=1mV[Zi2]=2m \begin{aligned} \mathbb{E}[X] & = \sum_{i=1}^m \mathbb{E}[Z_i^2] \\ & = m \\ \mathbb{V}[X] & = \sum_{i=1}^m \mathbb{V}[Z_i^2] \\ & = 2m \\ \end{aligned}

This means for the average of XX, we have

E[Xˉ]=1mi=1mE[Zi2]=1mm=1V[Xˉ]=1mi=1mV[Zi2]=1m2m=2m \begin{aligned} \mathbb{E}[\bar{X}] & = \frac{1}{m} \sum_{i=1}^m \mathbb{E}[Z_i^2] \\ & = \frac{1}{m} m \\ & = 1 \\ \mathbb{V}[\bar{X}] & = \frac{1}{m} \sum_{i=1}^m \mathbb{V}[Z_i^2] \\ & = \frac{1}{m} 2m \\ & = \frac{2}{m} \\ \end{aligned}

This gives us the following inequality if we substitute the above values (μ=1,V=2/m\mu=1, \mathbb{V}=2/m) into the Chernoff bound in the equation (1).

P(Xˉ1a)2exp(ma28) \mathbb{P}(|\bar{X}-1| \geq a) \leq 2 \exp \left( - \frac{m a^2}{8} \right)

Or with the variable ZZ involved, we have

P(1mi=1mZi21a)2exp(ma28),(2) \mathbb{P} \left(\left | \frac{1}{m} \sum_{i=1}^m Z_i^2 - 1 \right | \geq a \right ) \leq 2 \exp \left( - \frac{m a^2}{8} \right), \tag{2}

which is equivalent to

P(i=1mZi2mma)2exp(ma28).(3) \mathbb{P} \left( \left | \sum_{i=1}^m Z_i^2 - m \right | \geq ma \right ) \leq 2 \exp \left( - \frac{m a^2}{8} \right). \tag{3}

Now, we are ready to prove the Johnson Lindenstrauss lemma.

The Johnson Lindenstrauss lemma

Given any set of points X={x1,x2,,xn}X = \{ x_1, x_2, \cdots, x_n \} in Rd\mathbb{R}^d, and any ϵ(0,1/2]\epsilon \in (0, 1/2], there exists a linear map Π:RdRm\Pi: \mathbb{R}^d \to \mathbb{R}^m with m=O(lognϵ2)m = O( \frac{\log n}{\epsilon^{2}}) such that

1ϵΠ(xi)Π(xj)22xixj221+ϵ(4) 1- \epsilon \leq \frac{||\Pi(x_i) - \Pi(x_j) ||_2^2}{||x_i - x_j||_2^2} \leq 1+ \epsilon \tag{4}

for any ii and jj.

N(μ,σ2,x)=12πσ2exp((xμ)22σ2) \mathcal{N}(\mu, \sigma^2, x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left ( - \frac{(x-\mu)^2}{2 \sigma^2} \right)

Now, let AA be a m×dm \times d matrix, such that every entry of AA is filled with an i.i.d draw from a standard normal ai,jN(0,1/m)a_{i, j} \sim N(0, 1/m) distribution (a.k.a the “Gaussian” distribution). For xRdx \in \mathbb{R}^d, define

Π(x)=Ax,(5) \Pi(x) = A x, \tag{5}

Inequality bounds compare
Figure 1. Illustration of random project process with matrix A.

Now, we will establish a certain norm preservation property of AA. Specifically, it suffices to prove that, for any fixed vector xRdx \in \mathbb{R}^d, we have that

P[(1ϵ)x22Ax22(1+ϵ)x22]11n3(6) \mathbb{P} \bigg [ (1- \epsilon) ||x||_2^2 \leq ||A x||_2^2 \leq (1+\epsilon) ||x||_2^2 \bigg ] \geq 1 - \frac{1}{n^3} \tag{6}

To prove equation (6), let us fix xRdx \in \mathbb{R}^d, and consider the distribution of the ii-th coordinate YiY_i of the mapping AxAx of xx

Yi:=(Ax)i=j=1dAi,jxjAx=[Y1Y2Ym](7) \begin{aligned} & Y_i := (Ax)_i = \sum_{j=1}^d A_{i, j} x_j \\ & Ax = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_m \end{bmatrix} \tag{7} \end{aligned}

Each random variable YiY_i is a sum of Gaussian distributions. Specifically, we have that

Yij=1dN(0,1m)xjj=1dN(0,xj2k)N(0,x22k),(8) Y_i \sim \sum_{j=1}^d \mathcal{N} \left (0, \frac{1}{m} \right ) x_j \sim \sum_{j=1}^d \mathcal{N} \left(0, \frac{x_j^2}{k} \right ) \sim \mathcal{N} \left (0, \frac{||x||_2^2}{k} \right ), \tag{8}

where we used the crucial property of Gaussian distributions that they are closed (linear) under taking a sum. More precisely, for any two Independent Gaussian random variables Z1N(μ1,σ12)Z_1 \sim \mathcal{N}(\mu_1, \sigma_1^2) and Z2N(μ2,σ22)Z_2 \sim \mathcal{N}(\mu_2, \sigma_2^2), we have that

Z1+Z2N(μ1+μ2,σ12+σ22). Z_1 + Z_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

Consequently, we can conclude that

E[Ax22]=E[i=1mYi2]=imE[Yi2]=mVar(Yi)=mx22m=x22.(9) \mathbb{E} \big [ ||Ax||_2^2 \big] = \mathbb{E} \left [ \sum_{i=1}^m Y_i^2 \right ]= \sum_{i}^m \mathbb{E}[Y_i^2] = m \cdot \mathrm{Var}(Y_i) = m \cdot \frac{||x||_2^2}{m} = ||x||_2^2 . \tag{9}

This means 2\ell_2 norm is preserved, but how about the concentration around this expectation? We need a tail bound.

Recall that Ax22||Ax||_2^2 is distributed as Gaussian variable. We know that Gaussian variables are very concentrated around expectation - the tail vanishes at an exponential rate. For instance, N(μ,σ2)\mathcal{N}(\mu, \sigma^2) is within 2σ2 \sigma of the expectation with probability at least 0.950.95.

This strong concentration is quantified via Chernoff bounds (for χ\chi-squared distributions).

In equation (9), we have shown

E[Ax22]=E[i=1mYi2]=x22(10) \mathbb{E} \big [ ||Ax||_2^2 \big] = \mathbb{E} \left [ \sum_{i=1}^m Y_i^2 \right ]= ||x||_2^2 \tag{10}

In equation (10), we have shown

YiN(0,x22k),(11) Y_i \sim \mathcal{N} \left (0, \frac{||x||_2^2}{k} \right ), \tag{11}

Now, we standardize the random variable YiY_i as follows:

mx2YiN(0,1).(12) \frac{\sqrt{m}}{||x||_2} Y_i \sim \mathcal{N} \left (0, 1 \right ). \tag{12}

This means imYi2\sum_{i}^m Y_i^2 follows the chi-squared distribution with mm degrees of freedom.

Now, we can apply the Chernoff bound to the random variable imYi2\sum_{i}^m Y_i^2 to obtain the following bound:

P[Ax22E[Ax22]ϵx22]=P[Ax22x22ϵx22]=P[imYi2x22ϵx22]=P[im(mx2Yi)2mϵm]2exp(ϵ2m8)(13) \begin{aligned} \mathbb{P} \left [ \bigg | ||Ax||_2^2 - \mathbb{E}[ ||Ax||_2^2] \bigg | \geq \epsilon ||x||_2^2 \right] & = \mathbb{P} \left [ \bigg | ||Ax||_2^2 - ||x||_2^2 \bigg | \geq \epsilon ||x||_2^2 \right] \\ & = \mathbb{P} \left [ \bigg | \sum_{i}^m Y_i^2 - ||x||_2^2 \bigg | \geq \epsilon ||x||_2^2 \right] \\ & = \mathbb{P} \left [ \left | \sum_{i}^m \left( \frac{\sqrt{m}}{||x||_2} Y_i \right)^2 - m \right | \geq \epsilon m \right] \\ & \leq 2 \exp \left ( - \frac{\epsilon^2 m}{8} \right ) \tag{13} \end{aligned}

Now, if we set the value of mm as follows:

m16ϵ2log(nϵ)(14) m \geq 16 \cdot \epsilon^{-2} \log (\frac{n}{\sqrt{\epsilon}}) \tag{14}

Then we could have

P[Ax22x22ϵx22]2exp(2log(n/ϵ))=2exp(logn/ϵ)2=2ϵn2 \begin{aligned} \mathbb{P} \left [ \bigg | ||Ax||_2^2 - ||x||_2^2 \bigg | \geq \epsilon ||x||_2^2 \right] & \leq 2 \exp \left ( - 2 \log (n/\sqrt{\epsilon}) \right ) \\ & = 2 \exp(\log n/\sqrt{\epsilon})^{-2} \\ & = \frac{2\epsilon}{n^2} \end{aligned}

Now, if we treat all the pairs of vectors xx in Rd\mathbb{R}^d as a random sample, then we can apply the union bound to obtain the following bound:

P[A(xi)A(xi)22xixj22ϵxixj22]=P[A(xi)A(xi)22xixj221ϵ]ijP[A(xi)A(xi)22xixj22ϵxixj22](n2)2ϵn2ϵ \begin{aligned} & \mathbb{P} \left [ \bigg | ||A(x_i) - A(x_i)||_2^2 - ||x_i- x_j||_2^2 \bigg | \geq \epsilon ||x_i - x_j||_2^2 \right] \\ & \quad \quad \quad \quad = \mathbb{P} \left [ \left | \frac{||A(x_i) - A(x_i)||_2^2 }{||x_i- x_j||_2^2} - 1 \right | \geq \epsilon \right] \\ & \quad \quad \quad \quad \leq \sum_{i \neq j} \mathbb{P} \left [ \bigg | ||A(x_i) - A(x_i)||_2^2 - ||x_i- x_j||_2^2 \bigg | \geq \epsilon ||x_i - x_j||_2^2 \right] \\ & \quad \quad \quad \quad \leq { n \choose 2} \cdot \frac{2\epsilon}{n^2} \\ & \quad \quad \quad \quad \leq \epsilon \end{aligned}

This concludes the proof \square.

Comments about the Johnson-Lindenstrauss Lemma

  1. Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687.
  2. Blum, A., Hopcroft, J., & Kannan, R. (2020). Foundations of data science. Cambridge University Press.
  3. Kane, D. M., & Nelson, J. (2014). Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1), 1–23.