Dive into Latent Dirichlet Allocation Model

Not many models balance complexity and effectiveness as well as LDA. I like this model so much as it is perhaps the best model to start with when you want to learn about machine learning and deep learning models. Why? I will explain in this post.

What we talk about when we talk about machine learning and deep learning? It is hard to say as there are so many things going on right now. They say one should follow a saint in turbulent times. Who should we follow and where should we start then? I think we should start with the basics and follow our intuition and logics.

So, let’s quote ‘the Godfater’ of deep learning, Geoffrey Hinton, who said: “The deep learning revolution has transformed the field of machine learning over the last decade. It was inspired by attempts to mimic the way the brain learns but it is grounded in basic principles of statistics, information theory, decision theory and optimization.”.

To illustrate the above quote, I will use a simple model called Latent Dirichlet Allocation (LDA) to explain the basic principles of machine learning and deep learning. I think after understanding LDA, one could easily understand the basic principles of machine learning and deep learning.

This post is based on the paper by the author of LDA, David Blei, titled Probabilistic Topic Models (Blei, 2012). I will recommend you to read the paper after reading this post. Here is the roadmap of this post:

The big picture

Please look at the following figure. Notice that words are highlighted in different colors. This is because we want to show that words are grouped into different topics.

fp7 totalcost hist1
Figure 1. The illustration of LDA from the paper by David Blei (2012). You can click on the image to zoom in.

In the above article titled Seeking Life’s Bare (Genetic) Necessities, Four topics are highlighted: topic 1 is about genetics, topic 2 is about evolutionary biology, topic 3 is about data analysis and topic 4 is about others.

genetics evolution data analysis others
gene 0.04 life 0.02 data 0.02 brain 0.04
dna 0.02 evolve 0.01 neuron 0.02 number 0.02

This is how LDA works. It groups words into topics and it assumes that each document is a mixture of topics. This is the big picture of LDA, which really aligns with our common sense and intuition. When we read an article or a newspaper, we always try to identify the thesis (or the theme) and the topics of the article. If the theme is to deliver a key message, then the topics are the supporting arguments or supporting evidences.

The probability distributions

After understanding the big picture of LDA, let’s dive into the details. The first thing we need to understand is the probability distributions. There are three probability distributions in LDA: the document-topic distribution, the topic-word distribution and the word distribution.

If you look at the Figure 1, you will see that we have present:

Our goal is to extract the topics from the corpus and assign each document to one or more topics. But how could we find the topics? Or another question we need to think: how many documents do we need to find a set of topics? Do you think an algorithm could find the topics by reading one document? Why? Can human find the topics by reading one document? Why?

For a human being, it is easy to find the topics by reading one document if she or he was educated and familiar with the topic. But for an algorithm, it is not easy to find the topics by reading one document. Why? Because the algorithm does not have any prior knowledge about the topics. So, we need to give the algorithm many documents to find the topics, which means learning from the data.

To let algorithm learn from the data, we have to design a model and algorithm to let it learn. To do this, we need to think how a document is generated.

fp7 totalcost hist1
Figure 2. The illustration of LDA in terms of generating a document. You can click on the image to zoom in.

In Figure 2, assume we have MM documents making up the corpus. Each document is a mixture of KK topics. Each topic is a mixture of VV words. Notice that a document does not have to use every word in the dictionary.

Now, we define a document ww as a finite sequence of of NN words:

w={w1,w2,,wN} w = \{w_1, w_2, \cdots, w_N\}

and denote a corpus as DD of MM documents:

D={w1,w2,,wM} D = \{w_1, w_2, \cdots, w_M\}

We assume there are kk topics in the corpus and for a document wiw_i in the corpus with length NiN_i, we have the following generating process:

  1. for the length of document, we assume NiPoisson(λ)N_i \sim \text{Poisson}(\lambda)

  2. for a vector θiRk\theta_i \in \mathbb{R}^{k}, which follows a kk-dimensional Dirichlet distribution with parameter α\alpha: θiDirichlet(α)\theta_i \sim \text{Dirichlet}(\alpha)

  3. for each word wijw_{ij} in the document wiw_i with index jj:

    1. sample a topic zijz_{ij} from a kk-dimensional multinomial distribution with parameter θi\theta_i: zijMultinomial(θi)z_{ij} \sim \text{Multinomial}(\theta_i)

    2. sample a word wijw_{ij} from a vv-dimensional multinomial distribution with parameter βzij\beta_{z_{ij}}: wijMultinomial(βzij)w_{ij} \sim \text{Multinomial}(\beta_{z_{ij}})

where α\alpha and β\beta are hyperparameters, where αRk\alpha \in \mathbb{R}^{k} and βRk×V\beta \in \mathbb{R}^{k \times V}.

If we put everything into a matrix, we have:

[θ11θ1kM×Kθm1θmk]Topic assignment θDir(α)  [K×V]β[M×V]Corpus(1) \underbrace{ \begin{bmatrix} \theta_{11} & \cdots & \theta_{1k} \\ & & \\ & M \times K & \\ & & \\ \theta_{m1} & \cdots & \theta_{mk} \end{bmatrix}}_{\text{Topic assignment} \ \theta \sim \text{Dir}(\alpha) } \ \ \underbrace{\begin{bmatrix} & & & & \\ & & & & \\ & & K \times V & & \\ & & & & \\ & & & & \end{bmatrix}}_{\beta} \approx \underbrace{\begin{bmatrix} & & & & \\ & & & & \\ & & M \times V & & \\ & & & & \\ & & & & \end{bmatrix}}_{\text{Corpus}} \tag{1}

The original paper by Blei et al. did not use the above matrix representation. But they did give the following graphical representation, which is very helpful to understand the generating process of LDA.

fp7 totalcost hist1
Figure 3. The original illustration of LDA in terms of generating a document from Blei et al. (2003).

Now, it’s the time to understand the probability distributions. For a vector θiRk\theta_i \in \mathbb{R}^{k}, which follows a kk-dimensional Dirichlet distribution with parameter α\alpha, we have:

i=1kθi=1f(θ,α)=1B(α)i=1kθiαi1B(α)=i=1kΓ(αi)Γ(i=1kαi)(2) \begin{aligned} \sum_{i=1}^{k} \theta_{i} &= 1 \\ f(\theta, \alpha) & = \frac{1}{B(\alpha)} \prod_{i=1}^{k} \theta_{i}^{\alpha_{i} - 1} \\ B(\alpha) & = \frac{\prod_{i=1}^{k} \Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{k} \alpha_{i})} \end{aligned} \tag{2}

where α=(α1,α2,,αk)\alpha = (\alpha_1, \alpha_2, \cdots, \alpha_k), αi>0\alpha_i > 0 and B(α)B(\alpha) is the normalizing constant (a multivariate generalization of the beta function).

For NN number of trials, the probability mass function of a multinomial distribution with parameter θ=(θ1,θk)\theta = (\theta_1, \cdots \theta_k) and i=1kθi=1\sum_{i=1}^{k} \theta_{i} = 1 is:

f(x,θ,N)=N!x1!xk!i=1kθixi(3) f(x, \theta, N) = \frac{N!}{x_1! \cdots x_k!} \prod_{i=1}^{k} \theta_{i}^{x_{i}} \tag{3}

with x=(x1,,xk)x = (x_1, \cdots, x_k), xi0x_i \geq 0 and i=1kxi=N\sum_{i=1}^{k} x_{i} = N.

The estimation of LDA

Before we estimate the parameters of LDA, let’s walk through the generative process again with our probability distributions. I found it easier to understand the generative process with probability distributions.

Our goal is to choose a combination of topics and words that can best explain the corpus. First, we set up a key assumption: the number of topics kk is known and fixed before we start the estimation. Each topic is a mixture of words, which will be sampled to construct a document.

For kk topics, we assign weights θij\theta_{ij} to each topic, where ii is the index of document and jj is the index of topic.

di=documenti  with weights of topics=[θi1,θi2,,θik]=[0.2,0.07,,0.03](4) \begin{aligned} d_i & = \text{document}_i \ \ \text{with weights of topics} \\ & = [\theta_{i1}, \theta_{i2}, \cdots, \theta_{ik}] \\ & = [0.2, 0.07, \cdots, 0.03] \end{aligned} \tag{4}

Equation (4) gives the weight of topics in document ii. It looks if we have weights of topics in a document, we can generate a document by sampling words from the topics. Figure 4 shows the process.

fp7 totalcost hist1
Figure 4. The illustration of LDA in terms of generating a document from Blei et al. (2003). We assume there are 4 topics in this illustration.

As it has posed in figure 4, what are the key assumptions we set up in the above process? To generate a document and corpus, we need to know:

  1. the weights of topics in a document
  2. the number of words in each topic

When I read many blogs and papers about LDA, I found that most of them did not explain the above two assumptions, especially the second one. Most blogs and papers only explain the first assumption, which is easy to understand.

For the weights of topics in a document, we can use the Dirichlet distribution to sample the weights. This means we have our prior knowledge about the weights of topics in a document as a Dirichlet distribution.

θDir(α)(5) \theta \sim \text{Dir}(\alpha) \tag{5}

But how about the number of words in each topic? We can use the multinomial distribution to sample the number of words in each topic. This means we have our prior knowledge about the number of words in each topic as a multinomial distribution. That’s why we need to set up the hyperparameter β\beta in the beginning.

Now, we need to come up a mechanism to determine the number of words in each topic. Suppose we have VV words in our corpus, we will assign weights βkv\beta_{kv} to each word, where kk is the index of topic and vv is the index of word. This could be done by sampling from a Dirichlet distribution. I don’t know why authors of the original paper did not mention this.

fp7 totalcost hist1
Figure 5. The illustration of LDA in terms of generating a document from Blei et al. (2003). We assume there are 4 topics in this illustration.

Suppose V=5V = 5, and k=4k=4, we can simulate the β\beta matrix as follows.

beta_hyperparameter <- c(1,2,3,2.7,9)
rdirichlet(4, beta_hyperparameter)
# 0.039350115	0.20518288	0.1656238	0.19104531	0.3987978
# 0.001968719	0.11850665	0.1971715	0.13461901	0.5477341
# 0.109902876	0.06116357	0.2768535	0.07761342	0.4744666
# 0.039633147	0.25311539	0.1254669	0.17837372	0.4034108

The each row gives the weight of words in a topic.

Now, let’s walk through the generative process again. We have the weights of topics in a document as θd\theta_d, which is a 1×k1 \times k vector. Then we will construct a k×Vk \times V matrix β\beta. To construct this matrix, we could do it two ways:

  1. for each word assign weights to each topic, and then repeat the process for all words - VV times
  2. for each topic assign weights to each word, and then repeat the process for all topics - kk times

Authors of the original paper chose the second way. When we compute the probability of a word in the document, we just need to multiply the weights of topics in a document and the weights of words in a topic, such as θd×β\theta_{d} \times \beta.

There over kk topics, the probability of a word in a document is:

P(wd,nθd,β)=k=1kθd,k×βk,wd,n(6) P(w_{d,n} | \theta_{d}, \beta) = \sum_{k=1}^{k} \theta_{d,k} \times \beta_{k,w_{d,n}} \tag{6}

Equation (6) is just the vector multiplication of θd\theta_{d} and βwd,n\beta_{w_{d,n}}, where θd\theta_{d} is 1×k1 \times k vector and βwd,n\beta_{w_{d,n}} is k×1k \times 1 vector.

Now, with NN number of words in a document, we can have the probability of a document as:

P(diθi,β)=n=1NP(wd,nθd,β)(7) P(d_{i} | \theta_{i}, \beta) = \prod_{n=1}^{N} P(w_{d,n} | \theta_{d}, \beta) \tag{7}

We can use the above equation to compute the probability of a document. But how can we estimate the parameters θd\theta_{d} and β\beta? We can use the maximum likelihood estimation to estimate the parameters.

θ^d=argmaxθdP(diθi,β)β^=argmaxβP(diθi,β)(8) \begin{aligned} \hat{\theta}_{d} & = \text{argmax}_{\theta_{d}} P(d_{i} | \theta_{i}, \beta) \\ \hat{\beta} & = \text{argmax}_{\beta} P(d_{i} | \theta_{i}, \beta) \\ \end{aligned} \tag{8}

Now, we can estimate the posterior distribution of θd\theta_{d} and β\beta by using the Gibbs sampling method.

  1. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.