Dive into Latent Dirichlet Allocation Model
Not many models balance complexity and effectiveness as well as LDA. I like this model so much as it is perhaps the best model to start with when you want to learn about machine learning and deep learning models. Why? I will explain in this post.
What we talk about when we talk about machine learning and deep learning? It is hard to say as there are so many things going on right now. They say one should follow a saint in turbulent times. Who should we follow and where should we start then? I think we should start with the basics and follow our intuition and logics.
So, let’s quote ‘the Godfater’ of deep learning, Geoffrey Hinton, who said: “The deep learning revolution has transformed the field of machine learning over the last decade. It was inspired by attempts to mimic the way the brain learns but it is grounded in basic principles of statistics, information theory, decision theory and optimization.”.
To illustrate the above quote, I will use a simple model called Latent Dirichlet Allocation (LDA) to explain the basic principles of machine learning and deep learning. I think after understanding LDA, one could easily understand the basic principles of machine learning and deep learning.
This post is based on the paper by the author of LDA, David Blei, titled Probabilistic Topic Models (Blei, 2012). I will recommend you to read the paper after reading this post. Here is the roadmap of this post:
The big picture
Please look at the following figure. Notice that words are highlighted in different colors. This is because we want to show that words are grouped into different topics.

In the above article titled Seeking Life’s Bare (Genetic) Necessities, Four topics are highlighted: topic 1 is about genetics, topic 2 is about evolutionary biology, topic 3 is about data analysis and topic 4 is about others.
genetics | evolution | data analysis | others |
---|---|---|---|
gene 0.04 | life 0.02 | data 0.02 | brain 0.04 |
dna 0.02 | evolve 0.01 | neuron 0.02 | number 0.02 |
This is how LDA works. It groups words into topics and it assumes that each document is a mixture of topics. This is the big picture of LDA, which really aligns with our common sense and intuition. When we read an article or a newspaper, we always try to identify the thesis (or the theme) and the topics of the article. If the theme is to deliver a key message, then the topics are the supporting arguments or supporting evidences.
The probability distributions
After understanding the big picture of LDA, let’s dive into the details. The first thing we need to understand is the probability distributions. There are three probability distributions in LDA: the document-topic distribution, the topic-word distribution and the word distribution.
If you look at the Figure 1, you will see that we have present:
- several documents that are usually called corpus in machine learning and deep learning
- one document
- several topics that are usually called latent variables in machine learning and deep learning
Our goal is to extract the topics from the corpus and assign each document to one or more topics. But how could we find the topics? Or another question we need to think: how many documents do we need to find a set of topics? Do you think an algorithm could find the topics by reading one document? Why? Can human find the topics by reading one document? Why?
For a human being, it is easy to find the topics by reading one document if she or he was educated and familiar with the topic. But for an algorithm, it is not easy to find the topics by reading one document. Why? Because the algorithm does not have any prior knowledge about the topics. So, we need to give the algorithm many documents to find the topics, which means learning from the data.
To let algorithm learn from the data, we have to design a model and algorithm to let it learn. To do this, we need to think how a document is generated.

In Figure 2, assume we have documents making up the corpus. Each document is a mixture of topics. Each topic is a mixture of words. Notice that a document does not have to use every word in the dictionary.
Now, we define a document as a finite sequence of of words:
and denote a corpus as of documents:
We assume there are topics in the corpus and for a document in the corpus with length , we have the following generating process:
-
for the length of document, we assume
-
for a vector , which follows a -dimensional Dirichlet distribution with parameter :
-
for each word in the document with index :
-
sample a topic from a -dimensional multinomial distribution with parameter :
-
sample a word from a -dimensional multinomial distribution with parameter :
-
where and are hyperparameters, where and .
If we put everything into a matrix, we have:
The original paper by Blei et al. did not use the above matrix representation. But they did give the following graphical representation, which is very helpful to understand the generating process of LDA.

Now, it’s the time to understand the probability distributions. For a vector , which follows a -dimensional Dirichlet distribution with parameter , we have:
where , and is the normalizing constant (a multivariate generalization of the beta function).
For number of trials, the probability mass function of a multinomial distribution with parameter and is:
with , and .
The estimation of LDA
Before we estimate the parameters of LDA, let’s walk through the generative process again with our probability distributions. I found it easier to understand the generative process with probability distributions.
Our goal is to choose a combination of topics and words that can best explain the corpus. First, we set up a key assumption: the number of topics is known and fixed before we start the estimation. Each topic is a mixture of words, which will be sampled to construct a document.
For topics, we assign weights to each topic, where is the index of document and is the index of topic.
Equation (4) gives the weight of topics in document . It looks if we have weights of topics in a document, we can generate a document by sampling words from the topics. Figure 4 shows the process.

As it has posed in figure 4, what are the key assumptions we set up in the above process? To generate a document and corpus, we need to know:
- the weights of topics in a document
- the number of words in each topic
When I read many blogs and papers about LDA, I found that most of them did not explain the above two assumptions, especially the second one. Most blogs and papers only explain the first assumption, which is easy to understand.
For the weights of topics in a document, we can use the Dirichlet distribution to sample the weights. This means we have our prior knowledge about the weights of topics in a document as a Dirichlet distribution.
But how about the number of words in each topic? We can use the multinomial distribution to sample the number of words in each topic. This means we have our prior knowledge about the number of words in each topic as a multinomial distribution. That’s why we need to set up the hyperparameter in the beginning.
Now, we need to come up a mechanism to determine the number of words in each topic. Suppose we have words in our corpus, we will assign weights to each word, where is the index of topic and is the index of word. This could be done by sampling from a Dirichlet distribution. I don’t know why authors of the original paper did not mention this.

Suppose , and , we can simulate the matrix as follows.
beta_hyperparameter <- c(1,2,3,2.7,9)
rdirichlet(4, beta_hyperparameter)
# 0.039350115 0.20518288 0.1656238 0.19104531 0.3987978
# 0.001968719 0.11850665 0.1971715 0.13461901 0.5477341
# 0.109902876 0.06116357 0.2768535 0.07761342 0.4744666
# 0.039633147 0.25311539 0.1254669 0.17837372 0.4034108
The each row gives the weight of words in a topic.
Now, let’s walk through the generative process again. We have the weights of topics in a document as , which is a vector. Then we will construct a matrix . To construct this matrix, we could do it two ways:
- for each word assign weights to each topic, and then repeat the process for all words - times
- for each topic assign weights to each word, and then repeat the process for all topics - times
Authors of the original paper chose the second way. When we compute the probability of a word in the document, we just need to multiply the weights of topics in a document and the weights of words in a topic, such as .
There over topics, the probability of a word in a document is:
Equation (6) is just the vector multiplication of and , where is vector and is vector.
Now, with number of words in a document, we can have the probability of a document as:
We can use the above equation to compute the probability of a document. But how can we estimate the parameters and ? We can use the maximum likelihood estimation to estimate the parameters.
Now, we can estimate the posterior distribution of and by using the Gibbs sampling method.
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.