Locality Sensitive Hashing (LSH)

LSH is recognized as a key breakthrough that has had great impact in many fields of computer science including computer vision, databases, information retrieval, machine learning, and signal processing.

In 2012, minwise hashing and locality sensitive hashing (LSH) were recognized as a key breakthrough and inventors were awarded ACM Paris Kanellakis Theory and Practice Award. Those inventors were awarded for “their groundbreaking work on locality-sensitive hashing that has had great impact in many fields of computer science including computer vision, databases, information retrieval, machine learning, and signal processing”.

In this post, we will learn those brilliant ideas behind LSH. Before we proceed, let’s have a prelude on probabilistic reasoning.

Probabilistic thinking again

I have several posts to discuss the essence of probabilistic reasoning. If you have not read them, I recommend reading them first as this post is the climax of probabilistic thinking.

When we talk about probability, we are talking about the following three things:

When we think in the way of probability, our logic follows the order as such

sample spacefieldprobability;ΩFP \text{sample space} \rightarrow \text{field} \rightarrow \text{probability}; \quad \Omega \to \mathcal{F} \to \mathbb{P}

When we think in the way of statistics, our logic follows the order as such

likelihood (model)samples (data)population;PFΩ \text{likelihood (model)} \rightarrow \text{samples (data)} \rightarrow \text{population}; \quad \mathbb{P} \to \mathcal{F} \to \Omega

One follows the data generating process, the other follows the data inference process.

In data science, those two logics very often entwined. However, if we keep the key concepts in our mind, we will not lost. The key concepts in data science are: population (cardinality), samples (events), and probability (function mapping).

When you are dealing with big data, I encourage you to think dynamically which means you should think hard on how you can transfer a static problem into a random process (data generating) problem.

Okay, enough talking on philosophy of probability, let’s see some real world problems.

Similarity issue

In the field of data science, many problems are related to calculating similarities of different items (such as documents, or firm names). For instance, we are interested in: given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content?

One way to calculate similarity is to calculate the share of intersections among two sets.

Consider two sets A={0,1,2,5,6}A=\{0,1,2,5,6\} and B={0,2,3,5,7,9}B=\{0,2,3,5,7,9\}. How similar are AA and BB ? The Jaccard similarity is defined JS(A,B)=ABAB={0,2,5}{0,1,2,3,5,6,7,9}=38=0.375 \begin{aligned} \mathrm{JS}(A, B) & =\frac{|A \cap B|}{|A \cup B|} \\ & =\frac{|\{0,2,5\}|}{|\{0,1,2,3,5,6,7,9\}|}=\frac{3}{8}=0.375 \end{aligned}

Suppose we want to find the similar documents or names, we could use Jaccard similarity. However, we need to construct the sample space Ω\Omega and samples first:

naive method

Now, suppose we have four documents as four sets

S1={1,2,5}S2={3}S3={2,3,4,6}S4={1,4,6} \begin{aligned} S_1 & = \{1, 2, 5 \} \\ S_2 & = \{ 3 \} \\ S_3 & = \{ 2, 3, 4, 6 \} \\ S_4 & = \{1, 4, 6 \} \end{aligned}

and we will construct one-hot vector for each document

Index Element S1S_1 S2S_2 S3S_3 S4S_4
1 hel 1 0 0 1
2 ell 1 0 1 0
3 llo 0 1 1 0
4 o w 0 0 1 1
5 wor 1 0 0 0
6 orl 0 0 1 1

When both sets have 11 for each row, it means that they have intersections. It is very intuitive. It looks like we do not need to use hash function. However, what we did is

Let’s calculate the running time for this process:

You can see that it requires lots of calculations. Furthermore, reading all elements of n-grams and constructing a matrix could already takes a huge amount of time if you have a big dataset. Therefore, we need a random process to reduce the running time and space.

Here is the pseudo code for the native method to finding similar items.

# construct samples and save them into a dict 
doc_dict = {}
for i in range(number_of_files):
    doc = read_file(f"file_name_{i}") 
    n_grams_i = construct_n_grams(doc)
    doc_dict["file_i"] = n_grams_i

# sample space (population) by get the union
doc_union = []
for sample in doc_dict.values():
    doct_union.append(sample)

doc_union = set(doc_union)

# construct the matrix 
n = len(doc_union)
m = len(doc_dict)
jacarrd_matrix = np.zeros((n, m))

# loop over all samples (sets) and do one-hot technique
for s, idx in enumerate(doc_dict.values()):
    one_hot = np.isin(doc_union, s).reshape(-1, 1)
    jacarrd_matrix[:, idx] = one_hot

# calcuate the similarity
idx_list = list(range(m))
for i in range(m):
    pair_list_idx = set(idx_list).diff(set(i))
    for j in pari_list_idx:
        # calcuate the share
        share = np.sum(jacarrd_matrix[:, i] + jaccard_matrix[:, j])
# hope you can see how naive this method is 

It could run days if you have a large amount of documents, and even years when you have a big dataset.

Min hashing algorithm

The idea of min hashing is to do everything we did in the above section with one pass, which means we want to get everything at once when we load the dataset.

Instead of calculating the full vector, we use kk hash functions to create a signature of each document and then compare those signatures pairwise to find similar documents. The intuition of this idea is similar with that of finding the distinct values (read this post).

This means we will do two passes. Is it possible to calculate Jaccard similarity with only one pass? YES, it is! To do this, we need our favorite random process.

Inequality bounds compare
Figure 1. Illustration of Minimum Hashing process.

As it is shown in Figure 1, we will create n-Grams for each documents when we read them in and then hash them with kk hash functions. For each batch of n-Grams of each document, we choose the minimum value among them and save it into a signature vector of length kk.

Once we have the signature vector, we could calculate the Jaccard similarity pairwise.

Compare to the native method, what did we gain from MinHash algorithm?

  native method MinHash method
load the dataset M M
construct n-Grams sets n n
encode each set as one-hot vector n×N\footnotesize n \times N 0
calculate Jaccard similarities pairwise (MN)2\footnotesize (M \cdot N )^2 (Mk)2\footnotesize (M \cdot k)^2

As you can see, we save lots of space for one-hot vector encoding part and we also save the running time for calculating Jaccard similarities. Since NN is the population size, the saved running time is a big number.

However the running time for calculating the share of intersection still grows with O(M2)O(M^2). Can we do better with smart algorithm? The answer is YES!

We can if we use Locality Sensitive Hashing (LSH) algorithm. Wouldn’t it being amazing if we could calculate Jaccard similarities within O(Mk)\footnotesize O(M \cdot k)? I hope you could appreciate the invention of LSH and understand now why they won ACM Paris Kanellakis Theory and Practice Award.

Locality sensitive hashing

In figure 1, we have end up having a M×kM \times k signature matrix. Now we will design a band structure to filter (select) possible candidates before calculating Jaccard similarity pairwise. The idea is very similar to two level hashing. In the end, we want to have a grid of b×rb \times r where b<<M,r<kb << M, r < k.

Recall that a hash table has a probability of collision (two different values are hashed as the same numeric value). But do not get confused with the collision. When we build a two level hashing, we want to avoid collision with the optimal number of hashing functions. Here, we are trying to find the optimal number of hashing functions to find similar items.

For xx and yy with Jaccard similarity J(x,y)=p[0,1]J(x ,y) = p \in [0, 1], which is equivalent with

Pr[MinHash(x)=MinHash(y)]=J(x,y)=p[0,1](1) \mathrm{Pr}\left [ \text{MinHash}(x) = \text{MinHash}(y) \right ] = J(x ,y) = p \in [0, 1] \tag{1}

Now, with rr hash functions, probability that xx and yy having the matching signature vector is

i=1rPr[MinHashi(x)=MinHashi(y)]=pr(2) \prod_{i=1}^r \mathrm{Pr}\left [ \text{MinHash}_i(x) = \text{MinHash}_i(y) \right ] = p^r \tag{2}

Recall that in Figure 1, we compare the signature vector with length of kk because we have kk hash functions, whereas we are using rr hashing functions. When two items have the same signature vector, they are identical; when they have the high Jaccard similarity, they are similar.

Then the probability that xx and tt do not match is

1pr 1 - p^r

Now we divide our signature matrix into bb bands each of rr columns, which is the key idea behind LSH. Figure 2 gives the illustration.

Inequality bounds compare
Figure 2. Illustration of LSH.

In Equation (1) and (2), we are only compare xx and yy. However, when we search for similar items, we need to compare xx with all other members in our sample space. Now, with bb bands, the probability that none of bb bands match is given by

(1pr)b (1-p^r)^b

Therefore, the probability that we will have a match from at least one band is

Pr(one match in one band)=[1(1pr)b](3) \mathrm{Pr}(\text{one match in one band}) = [1- (1-p^r)^b] \tag{3}

Now, we are interested in the optimal number of bb and rr as we want to use as less as blocks to get those similar items (instead of using M×kM \times k matrix). So, how do we find the optimal number of bb and rr then? Let’s think about the boundaries first:

Therefore, we would like to find bb and rr that could at least give us one match with some probability and then we do not have to iterate over all MM rows. Thinking of this problem like throwing many balls into different bins and what’s the probability of having similar balls in one bin.

Again, we have two dimensions:

Now, let’s assume that J(x,y)=0.4J(x,y) = 0.4, then we want to find out how likely they will be threw into the same bin with different bb and rr.

Inequality bounds compare
Figure 3. Illustration of LSH with different band sizes and r=3r = 3 for both lines.

As it has shown in Figure 3, the more bands you have the higher chance you get for having at least one matched pair in one block increases, which is aligned with our common sense. It is worth mentioning that we do not need so many bands when two items are very similar. Considering we are only using 33 hash functions, this works quite well.

Since in practice we do not know the probability pp of sample space, we need to tune bb and rr based on the threshold we set for Pr\mathrm{Pr}(one match in one band).

Now, we will set the threshold Pr=0.8\mathrm{Pr} = 0.8 and plot the relationship betweeon pp and Pr\mathrm{Pr} for different sets of (b,r)(b, r).

Inequality bounds compare
Figure 4. Illustration of LSH with different band sizes and rr; notice the curve shift to the different directions for bb and rr.

With LSH, we will not compare each signature vector pairwise. Instead, we will select possible candidates from the bucket first and then find those similar items.