Dirichlet Distribution and Its Applications

From latent Dirichlet allocation to Bayesian inference, and beyond, the Dirichlet distribution is a powerful tool in the data scientist's toolbox.

It has been twenty years since the paper Latent Dirichlet Allocation was published by (Blei et al., 2003). The paper introduced a new model for text mining, and it has since become one of the most popular models for text mining. The model is based on the Dirichlet distribution, which is a distribution over probability distributions. The Dirichlet distribution is a powerful tool in the data scientist’s toolbox, and it is used in a variety of applications, including Bayesian inference, text mining, and high-dimensional data analysis.

The original idea of using the Dirichlet distribution for inference of population structure was introduced in the context of population genetics by (Pritchard et al., 2000).

Both papers are great and worth reading, but they are not easy to understand. To be fair, the paper by (Pritchard et al., 2000) is easier to follow than the paper by (Blei et al., 2003).

In this post, I will try to explain the Dirichlet distribution in a simple way, and I will also discuss some of its applications.

From bernoulli to binomial to poisson

before we can discuss the Dirichlet distribution, we need to discuss the concept of a probability distribution. A probability distribution is a function that assigns a probability to each possible outcome of a random experiment.

For example, the Bernoulli distribution is a probability distribution that assigns a probability to the outcome of a coin flip. The Bernoulli distribution has a single parameter, which is the probability of getting a head. The Bernoulli distribution is defined as follows (probability mass function):

f(xp)=px(1p)1x,x{0,1}(1) \mathcal{f}(x \mid p) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\} \tag{1}

where xx is the outcome of the coin flip, and pp is the probability of getting a head. The Bernoulli distribution is a discrete probability distribution, which means that it assigns a probability to each possible outcome of the coin flip. In this case, the possible outcomes are 0 and 1, and the probability of getting a head is pp. The probability of getting a tail is 1p1-p.

Now if we run the coin flip experiment multiple times, we are conducting binomial experiment. The binomial experiment is a random experiment that consists of nn coin flips. The binomial experiment has two parameters: nn and pp. The binomial distribution (probability mass function) is defined as follows:

f(xn,p)=(nx)px(1p)nx,(2) \mathcal{f}(x \mid n, p) = \binom{n}{x} p^x (1-p)^{n-x}, \tag{2}

where xx is the number of heads in nn coin flips, and pp is the probability of getting a head. The binomial distribution is a discrete probability distribution, which means that it assigns a probability to each possible outcome of the binomial experiment. In this case, the possible outcomes are the number of heads in nn coin flips, and the probability of getting xx heads is given by the binomial distribution.

Figure 1 gives the plot of the binomial distribution for different values of nn and pp.

Inequality bounds compare
Figure 1. The plot of the binomial distribution for different values of nn and pp. Notice that we are using different markers and connected lines to emphasize it is a discrete probability distribution.

The connection between the Bernoulli distribution and the binomial distribution is that the binomial distribution is the distribution of the number of heads in nn coin flips, where each coin flip is a Bernoulli trial. And each trial is run independently of the other trials.

At this stage, you need to understand the concept of identically independent (i.i.d) random variables. Identically independent random variables are random variables that are independent of each other, and they are also identically distributed. In other words, the random variables are independent of each other, and they have the same distribution.

In this case, we say that the coin flips are identically independent Bernoulli trials. This means that each coin flip is independent of the other coin flips, and each coin flip is a Bernoulli trial.

Binomial distribution is connected with the Poisson distribution. The Poisson distribution is a discrete probability distribution that assigns a probability to the number of events that occur in a fixed interval of time. The Poisson distribution has a single parameter, which is the average number of events that occur in a fixed interval of time. The Poisson distribution is defined as follows (probability mass function):

f(xλ)=λxeλx!,xN(3) \mathcal{f}(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}, \quad x \in \mathbb{N} \tag{3}

where xx is the number of events that occur in a fixed interval of time, and λ\lambda is the average number of events that occur in a fixed interval of time. The Poisson distribution is a discrete probability distribution, which means that it assigns a probability to each possible outcome of the Poisson experiment. In this case, the possible outcomes are the number of events that occur in a fixed interval of time, and the probability of getting xx events is given by the Poisson distribution.

Figure 2 gives the plot of the Poisson distribution for different values of λ\lambda.

Inequality bounds compare
Figure 2. The plot of the Poisson distribution for different λ\lambda. Notice that we are using different markers and connected lines to emphasize it is a discrete probability distribution.

Now, we will derive the Poisson distribution from the binomial distribution. To do this, we need to calculate the expectation of the binomial distribution (see the derivation). The expectation of the binomial distribution is given by:

E(X)=np(4) \mathbb{E}(X) = np \tag{4}

If you check the figure 1 again, you should notice that the expectation of the binomial distribution shifts to the right as nn increases or pp increases. This is because the expectation of the binomial distribution is the number of heads in nn coin flips, and the number of heads increases as nn increases or pp increases.

Now, we let λ=np\lambda = np. Then, we can write the probability mass function of the binomial distribution as follows:

f(xn,p)=(nx)px(1p)nx=n!x!(nx)!px(1p)nx=n!x!(nx)!(λn)x(1λn)nx=n!x!(nx)!λxnx(1λn)n(1λn)x=λxx!n!(nx)!nx(1λn)n(1λn)x(5) \begin{aligned} \mathcal{f}(x \mid n, p) & = \binom{n}{x} p^x (1-p)^{n-x} \\ & = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x} \\ & = \frac{n!}{x!(n-x)!} \left(\frac{\lambda}{n}\right)^x \left(1-\frac{\lambda}{n}\right)^{n-x} \\ & = \frac{n!}{x!(n-x)!} \frac{\lambda^x}{n^x} \left(1-\frac{\lambda}{n}\right)^{n} \left(1-\frac{\lambda}{n}\right)^{-x} \\ & = \frac{\lambda^x}{x!} \frac{n!}{(n-x)! n^x } \left(1-\frac{\lambda}{n}\right)^{n} \left(1-\frac{\lambda}{n}\right)^{-x} \\ \tag{5} \end{aligned}

For the last three components of equation (5), we have the following:

limnn!(nx)!nx=limnn!(nx)!nx=limnn(n1)(nx+1)nx=limnO(nx)O(nx)there are x terms=1limn(1λn)n=eλlimn(1λn)x=1(6) \begin{aligned} \lim_{n \to \infty} \frac{n !}{(n-x)! n^x} & = \lim_{n \to \infty} \frac{n !}{(n-x)! n^x} \\ & = \lim_{n \to \infty} \frac{n (n-1) \cdots (n-x+1)}{n^x} \\ & = \lim_{n \to \infty} \frac{O(n^x)}{O(n^x)} \quad \text{there are x terms} \\ & = 1 \\ \lim_{n \to \infty} \left(1-\frac{\lambda}{n}\right)^{n} & = e^{-\lambda}\\ \lim_{n \to \infty} \left(1-\frac{\lambda}{n}\right)^{-x} & = 1 \tag{6} \end{aligned}

Therefore, we have

f(xn,p)=λxx!eλ(7) \begin{aligned} \mathcal{f}(x \mid n, p) & = \frac{\lambda^x}{x!} e^{-\lambda} \\ \tag{7} \end{aligned}

The intuition behind this derivation is that we fix the expectation of binomial distribution to be λ\lambda, and then we run nn \to \infty to check within a fixed interval of time, what’s the probability of getting xx events. This is exactly the definition of the Poisson distribution.

Multinomial distribution

The multinomial distribution is a generalization of the binomial distribution. Instead of having two possible outcomes, the multinomial distribution has kk possible outcomes with probabilities p1,p2,,pkp_1, p_2, \cdots, p_k for each outcome. The number of successes for each outcome is x1,x2,,xkx_1, x_2, \cdots, x_k, respectively.

Let’s have an example. Suppose we will run nn trials and each trial could have three possible outcomes. Here is the key points of this example:

Now, let vector x=(x1,x2,,xk)x = (x_1, x_2, \cdots, x_k) be the number of successes for each outcome, and p=(p1,p2,,pk)p = (p_1, p_2, \cdots, p_k). Then, the probability of getting xx is given by the multinomial distribution:

f(xn,p)=(nx1)(nx1x2)(nx1x2xk1xk)p1x1p2x2pkxk=n!x1!(nx1)!(nx1)!x2!(nx1x2)!(nx1x2xk1)!xk!(nx1x2xk1xk)!p1x1p2x2pkxk=n!x1!x2!xk!p1x1p2x2pkxk=n!i=1kxi!i=1kpixi(8) \begin{aligned} \mathcal{f}(x \mid n, p) & = \binom{n}{x_1} \binom{n-x_1}{x_2} \cdots \binom{n-x_1-x_2-\cdots-x_{k-1}}{x_k} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k} \\ & = \frac{n!}{x_1!(n-x_1)!} \frac{(n-x_1)!}{x_2!(n-x_1-x_2)!} \cdots \\ & \quad \quad \quad \frac{(n-x_1-x_2-\cdots-x_{k-1})!}{x_k!(n-x_1-x_2-\cdots-x_{k-1}-x_k)!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k} \\ & = \frac{n!}{x_1! x_2! \cdots x_k!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k} \\ & = \frac{n!}{\prod_{i=1}^k x_i!} \prod_{i=1}^k p_i^{x_i} \tag{8} \end{aligned}

where n=x1+x2++xkn = x_1 + x_2 + \cdots + x_k, and p1+p2++pk=1p_1 + p_2 + \cdots + p_k = 1.

As you can see in equation (8), we are using quite a lot of factorials. When it comes to factorials, we can leverage the Gamma function to simplify the expression. The Gamma function is defined as

Γ(x)=0tx1etdt(9) \Gamma(x) = \int_0^\infty t^{x-1} e^{-t} dt \tag{9}

where tt is a real number. Gamma function has a nice property which is given by the following equation:

Γ(x+1)=xΓ(x)(10) \Gamma(x+1) = x \Gamma(x) \tag{10}

The proof of this equation is quite simple. We can use the following equation to prove it:

Γ(x+1)=0txetdt=[tx(et)]00(xtx1(et))dt=(00)+x0tx1etdt=xΓ(x)(11) \begin{aligned} \Gamma(x+1) & = \int_0^\infty t^x e^{-t} dt \\ & = \left [t^x (- e^{-t}) \right ]_0^\infty - \int_0^\infty (x t^{x-1} (-e^{-t})) dt \\ & = (0 - 0) + x \int_0^\infty t^{x-1} e^{-t} dt \\ & = x \Gamma(x) \tag{11} \end{aligned}

For every positive integer nn, we have

Γ(n+1)=nΓ(n)=n(n1)21=n!(12) \begin{aligned} \Gamma(n+1) & = n \Gamma(n) \\ & = n (n-1) \cdots 2 \cdot 1 \\ & = n! \tag{12} \end{aligned}

Equation (12) is recursive.

The magic of Gamma function is that it can be used not only for positive integers, but also for real numbers. For example, we have

Γ(12)=π(13) \begin{aligned} \Gamma(\frac{1}{2}) = \sqrt{\pi} \tag{13} \end{aligned}

To prove equation (13), we have to use the function of Gaussian distribution, which is given by

f(xμ,σ2)=12πσ2e(xμ)22σ2(14) \begin{aligned} \mathcal{f}(x \mid \mu, \sigma^2) & = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \tag{14} \end{aligned}

where μ\mu is the mean, and σ2\sigma^2 is the variance. When μ=0,σ2=1/2\mu = 0, \sigma^2 = 1/2, we have

f(x0,12)=1πex2(15) \begin{aligned} \mathcal{f}(x \mid 0, \frac{1}{2}) & = \frac{1}{\sqrt{ \pi}} e^{-x^2} \tag{15} \end{aligned}

Since Gaussian distribution is symmetric around x=0x=0, we have

0f(x0,12)dx=01πex2dx=1π0ex2dx=12(half of the area under the curve)(16) \begin{aligned} \int_{0}^\infty \mathcal{f}(x \mid 0, \frac{1}{2}) dx & = \int_{0}^\infty \frac{1}{\sqrt{\pi}} e^{-x^2} dx \\ & = \frac{1}{\sqrt{\pi}} \int_{0}^\infty e^{-x^2} dx \\ & = \frac{1}{2} \quad \text{(half of the area under the curve)} \tag{16} \end{aligned}

Therefore, we have

0ex2dx=π2(17) \int_{0}^\infty e^{-x^2} dx = \frac{\sqrt{\pi}}{2} \tag{17}

With equation (17), we can prove equation (13) as follows:

Γ(12)=0t12etdt=20eu2du=π(18) \begin{aligned} \Gamma(\frac{1}{2}) & = \int_0^\infty t^{-\frac{1}{2}} e^{-t} dt \\ & = 2 \int_0^\infty e^{-u^2} du \\ & = \sqrt{\pi} \tag{18} \end{aligned}

We are using the substitution u2=tu^2 = t in equation (18), which is valid because u2u^2 is always positive. Here is the process of substitution:

u2=tu=t12dudt=12t122du=t12dt \begin{aligned} u^2 = t & \Rightarrow u = t^{\frac{1}{2}} \\ \frac{du}{dt} = \frac{1}{2} t^{-\frac{1}{2}} & \Rightarrow 2 du = t^{-\frac{1}{2}} dt \\ \end{aligned}

Jacobian matrix

In the last section, we said that we will deal with multi-category multinomial distribution. We state that we have kk categories, and each category has a probability pip_i, where i{1,2,,k}i \in \{1, 2, \cdots, k\}. For different categories, we have a vector x=x1,x2,,xkx = {x_1, x_2, \cdots, x_k}, where xix_i is the number of samples in category ii.

From data generating process, we can model different stages of this process. For instance,

For each stage, no matter how we model the data, we can use a regression to link each element of the vector xx to a series of independent variables, such as

x1=Φ1(y1,y2,,ym)x2=Φ2(y1,y2,,ym)xk=Φk(y1,y2,,ym)(19) \begin{aligned} x_1 & = \Phi _1(y_1, y_2, \cdots, y_m) \\ x_2 & = \Phi_2(y_1, y_2, \cdots, y_m) \\ \vdots & \quad \quad \quad \quad \quad \vdots \\ x_k & = \Phi_k(y_1, y_2, \cdots, y_m) \\ \tag{19} \end{aligned}

where Φi\Phi_i is a regression function, and y1,y2,,ymy_1, y_2, \cdots, y_m are independent variables. If we want to use machine learning or deep learning to learn the regression function Φi\Phi_i, we need to know the Jacobian matrix of Φi\Phi_i.

The Jacobian matrix of Φi\Phi_i is given by

(x1,x2xk)(y1,y2,,ym)=[x1y1x1y2x1ymx2y1x2y2x2ymxky1xky2xkym](20) \frac{\partial (x_1, x_2 \cdots x_k) }{\partial (y_1, y_2, \cdots, y_m)} = \begin{bmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_m} \\ \frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \cdots & \frac{\partial x_2}{\partial y_m} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \cdots & \frac{\partial x_k}{\partial y_m} \\ \end{bmatrix} \tag{20}

where we use the chain rule in equation (20).

Now, to make our life easier, we will set m=km = k, which means we have kk independent variables, and kk regression functions. In this case, we can use the following equation to calculate the Jacobian matrix:

(x1,x2xk)(y1,y2,,yk)=[x1y1x1y2x1ykx2y1x2y2x2ykxky1xky2xkyk](21) \frac{\partial (x_1, x_2 \cdots x_k) }{\partial (y_1, y_2, \cdots, y_k)} = \begin{bmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_k} \\ \frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \cdots & \frac{\partial x_2}{\partial y_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \cdots & \frac{\partial x_k}{\partial y_k} \\ \end{bmatrix} \tag{21}

The good thing about using the same dimension for mm and kk is that we preserve structure of the linear space as they have the same number of coordinates. Then, the intuitive meaning of integral (whether it is area or volume) is preserved but stretched or compressed with the determinant of the Jacobian matrix.

We denote JJ as the determinant of the Jacobian matrix, which is given by

J=(x1,x2xk)(y1,y2,,yk)(22) J = \left| \frac{\partial (x_1, x_2 \cdots x_k) }{\partial (y_1, y_2, \cdots, y_k)} \right| \tag{22}

Now, when we map a region DD in kk-dimensional space to another region DD' in kk-dimensional space, we can use the following equation to calculate the volume of DD':

 ⁣Df(x1,x2,,xk)dx1dx2dxk= ⁣Df(y1,y2,,yk)Jdy1dy2dyk(23) \int \dotsi_{D} \int f(x_1, x_2, \cdots, x_k) dx_1 dx_2 \cdots dx_k = \int \dotsi_{D'} \int f(y_1, y_2, \cdots, y_k) J dy_1 dy_2 \cdots dy_k\tag{23}

For those who want to refresh their memory about the Jacobian matrix, please refer to the following documents:Jacobian.

From Chi-square distribution to Gamma distribution to Beta distribution

A random variable XX has a chi-square distribution with kk degrees of freedom if its probability density function is given by

X=Y12+Y22++Yk2(24) X = Y_1^2 + Y_2^2 + \cdots + Y_k^2 \tag{24}

where Y1,Y2,,YkY_1, Y_2, \cdots, Y_k are independent and identically distributed random variables with a standard normal distribution.

The Gamma distribution is a generalization of the chi-square distribution. If a random variable ZZ has a Chi-square distribution with kk degrees of freedom, and θ\theta is a positive constant, then the random variable XX defined by

X=θkZ(25) X = \frac{\theta }{k} Z \tag{25}

has a Gamma distribution with shape parameter kk and scale parameter θ\theta. We often use α\alpha to denote the shape parameter, and β\beta to denote the scale parameter. Then, probability density function of the Gamma distribution is given by

f(x)={1Γ(α)βαxα1exβx>00x0(26) f(x) = \begin{cases} \frac{1}{\Gamma(\alpha) \beta^\alpha} x^{\alpha - 1} e^{-\frac{x}{\beta}} & x > 0 \\ 0 & x \leq 0 \end{cases} \tag{26}

where α>0,β>0,Γ(α)\alpha > 0, \beta > 0, \Gamma(\alpha) is the gamma function, which is given by

Γ(α)=0tα1etdt(27) \Gamma(\alpha) = \int_0^\infty t^{\alpha - 1} e^{-t} dt \tag{27}

The Beta distribution is a generalization of the Gamma distribution. If a random variable XGamma(α,1)X \sim \mathrm{Gamma}(\alpha, 1) and YGamma(β,1)Y \sim \mathrm{Gamma}(\beta, 1), then the random variable Z=XX+YZ = \frac{X}{X + Y} has a Beta distribution with parameters α\alpha and β\beta:

XX+YBeta(α,β)(28) \frac{X}{X + Y} \sim \mathrm{Beta}(\alpha, \beta) \tag{28}

The probability density function of the Beta distribution is given by

f(x)={1B(α,β)xα1(1x)β10<x<10otherwise(29) f(x) = \begin{cases} \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} & 0 < x < 1 \\ 0 & \text{otherwise} \end{cases} \tag{29}

where B(α,β)B(\alpha, \beta) is the beta function, which is given by

B(α,β)=01tα1(1t)β1dt(30) B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt \tag{30}

We could also express equation (30) as

B(α,β)=Γ(α)Γ(β)Γ(α+β)(31) B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} \tag{31}

To derive equation (31), we will first show

01f(x)dx=1=011B(α,β)xα1(1x)β1dxB(α,β)=01xα1(1x)β1dxsame as equation (30)(32) \begin{aligned} & \int_0^1 f(x) dx = 1 = \int_0^1 \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} dx \\ & \Rightarrow B(\alpha, \beta) = \int_0^1 x^{\alpha - 1} (1 - x)^{\beta - 1} dx \quad \text{same as equation (30)} \\ \end{aligned} \tag{32}

Now, let’s calculate the value of Γ(α)Γ(β)\Gamma(\alpha) \Gamma(\beta):

Γ(α)Γ(β)=0u(α1)eudu0v(β1)evdv=00u(α1)v(β1)e(u+v)dudv \begin{aligned} \Gamma(\alpha) \Gamma(\beta) & = \int_0^\infty u^{(\alpha - 1)} e^{-u} du \int_0^\infty v^{(\beta - 1)} e^{-v} dv \\ & = \int_0^\infty \int_0^\infty u^{(\alpha - 1)} v^{(\beta - 1)} e^{-(u + v)} du dv \end{aligned}

Now, we set x=uu+vx = \frac{u}{u + v} and y=u+vy = u+v, with the bounds 0x10 \leq x \leq 1 and 0y0 \leq y \leq \infty. Then, we have the mapping from uvuv to xyxy:

u=xy,v=(1x)y(33) u = xy, \quad v = (1-x)y \tag{33}

The Jacobian matrix is given by

(u,v)(x,y)=[uxuyvxvy]=[yxy1x](34) \frac{\partial (u, v)}{\partial (x, y)} = \begin{bmatrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \end{bmatrix} = \begin{bmatrix} y & x \\ -y & 1 - x \end{bmatrix} \tag{34}

The jacobian is then given by

J=(u,v)(x,y)=y(1x)x(y)=y(35) J = \left| \frac{\partial (u, v)}{\partial (x, y)} \right| = y(1-x) - x(-y) = y \tag{35}

By transforming the integral, we have

Γ(α)Γ(β)=y=0x=01(xy)(α1)[(1x)y](β1)eyJdxdy=y=0y(α+β1)eydyx=01xα1(1x)β1dx=Γ(α+β)x=01xα1(1x)β1dx=Γ(α+β)B(α,β)(36) \begin{aligned} \Gamma(\alpha) \Gamma(\beta) & = \int_{y=0}^\infty \int_{x=0}^1 (xy)^{(\alpha - 1)} [(1 - x)y]^{(\beta - 1)} e^{-y} J dx dy \\ & = \int_{y=0}^\infty y^{(\alpha + \beta - 1)} e^{-y} dy \int_{x=0}^1 x^{\alpha - 1} (1 - x)^{\beta - 1} dx \\ & = \Gamma(\alpha + \beta) \int_{x=0}^1 x^{\alpha - 1} (1 - x)^{\beta - 1} dx \\ & = \Gamma(\alpha + \beta) B(\alpha, \beta) \end{aligned} \tag{36}

This concludes the proof of equation (31).

Before we talk about Dirichlet distribution, there is one thing you need to know, which is the conjugate prior. For instance, beta distribution is the conjugate prior of the binomial distribution. This means that if we have a binomial distribution with parameters nn and pp, then the beta distribution with parameters α\alpha and β\beta is the conjugate prior of the binomial distribution. We could write this as

π(px,n,α,β)Beta(x+α,nx+β)(37) \pi(p | x, n, \alpha, \beta) \sim \mathrm{Beta}(x + \alpha, n - x + \beta) \tag{37}

I have a post about conjugate priors, so I won’t go into details here.

Dirichlet distribution

The Dirichlet distribution is a generalization of the Beta distribution. It is a probability distribution describing the probability of a vector of probabilities. For instance, if we have a vector of probabilities p=(p1,p2,,pk)\mathbf{p} = (p_1, p_2, \cdots, p_k), then the Dirichlet distribution is given by

π(pα)Dirichlet(α)(38) \pi(\mathbf{p} | \alpha) \sim \mathrm{Dirichlet}(\alpha) \tag{38}

where α=(α1,α2,,αk)\alpha = (\alpha_1, \alpha_2, \cdots, \alpha_k) is a vector of positive real numbers, which means it has support on the interval [0,1][0, 1] such that i=1Kpi=1\sum_{i=1}^K p_i = 1. The probability density function of the Dirichlet distribution is given by

f(p)=1B(α)i=1kpiαi1(39) f(\mathbf{p}) = \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i - 1} \tag{39}

where B(α)B(\alpha) is the multivariate beta function, which is given by

B(α)=i=1kΓ(αi)Γ(i=1kαi)(40) B(\alpha) = \frac{\prod_{i=1}^k \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^k \alpha_i)} \tag{40}

If we set k=2k=2, then we have the beta distribution.

The Dirichlet distribution is a conjugate prior of the multinomial distribution. This means that if we have a multinomial distribution with parameters nn and p\mathbf{p}, then the Dirichlet distribution with parameters α\alpha is the conjugate prior of the multinomial distribution. We could write this as

π(px,n,α)Dirichlet(x+α)(41) \pi(\mathbf{p} | x, n, \alpha) \sim \mathrm{Dirichlet}(x + \alpha) \tag{41}

where xx is a vector of counts.

Comments

For a undergraduate student, probability of a model is often given such as in Bernoulli distribution of flipping a coin. However, in real life, the probability of a model is often unknown. In this case, we could use a prior distribution to represent our belief about the probability of a model.

Therefore, overall we have two kinds of distributions:

The first kind of distribution is called the likelihood distribution, and the second kind of distribution is called the prior distribution. The posterior distribution is the distribution that describes our belief about the probability of a model after we have observed the data. The posterior distribution is given by

π(θx)=π(xθ)π(θ)π(x)(42) \pi(\theta | x) = \frac{\pi(x | \theta) \pi(\theta)}{\pi(x)} \tag{42}

where π(xθ)\pi(x | \theta) is the likelihood distribution, π(θ)\pi(\theta) is the prior distribution, and π(x)\pi(x) is the marginal likelihood. The marginal likelihood is given by

π(x)=π(xθ)π(θ)dθ(43) \pi(x) = \int \pi(x | \theta) \pi(\theta) d\theta \tag{43}

The posterior distribution is the product of the likelihood distribution and the prior distribution. Therefore, if we have a conjugate prior, then the posterior distribution is also a conjugate prior. This is very convenient because we can use the posterior distribution as the prior distribution for the next round of data.

Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a generative model for text data. It is a probabilistic model that assumes that each document is a mixture of topics, and each topic is a mixture of words.

Now let’s set up our notation.

The easiest way to represent a corpus is to use a matrix. Each row of the matrix represents a document, and each column of the matrix represents a word. The value of the matrix is the number of times the word appears in the document. For instance, the following matrix represents a corpus with two documents, where the first document has 2 words (hello hello) and the second document has 4 words (world, you, you, you). However, we are using the frequency of the words, so we are not using the actual words.

document hello world you
d1 2 0 0
d2 0 1 3

Different from tf-idf, LDA does not care about the frequency of the words. It only cares about the co-occurrence of the words. It is also a generative model, which means that we can generate new documents from the model. To put it simply, LDA is a model about topics.

LDA illustration
Figure 3. The illustration of LDA (you can zoom the image if you want to see the details).

Figure 3 shows the illustration of LDA. It speaks for itself. If you ponder on it for a while, you will understand how LDA works and how it is different from tf-idf, which is a bag-of-words model and only cares the cooccurrence of the words rather than the frequency of the words.

As it is illustrated in Figure 3, we need a prior distribution to describe the probability of the topics. The Dirichlet distribution is a good choice because it is a conjugate prior of the multinomial distribution. Therefore, we could use the Dirichlet distribution to describe the probability of the topics, which is given by

f(pα)=1B(α)i=1kpiαi1=Γ(i=1kαi)i=1kΓ(αi)i=1kpiαi1(44) \begin{aligned} f(p | \alpha) & = \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i - 1} \\ & = \frac{\Gamma(\sum_{i=1}^k \alpha_i)}{\prod_{i=1}^k \Gamma(\alpha_i)} \prod_{i=1}^k p_i^{\alpha_i - 1} \end{aligned} \tag{44}

Remark: the dimension of topic is kk, which is assumed to be known and fixed.

Now, if you look at the Figure 3, where x=(x1,x2,,xk)x = (x_1, x_2, \cdots, x_k) is the the number of words from each topic. However, we do not know what kind of words and how many words are from each topic. That’s why the model was called latent Dirichlet allocation. The words are latent variables. According to the dictionary of Oxford, the word ‘latent’ means ‘hidden’.

How could we transfer our multinomial distribution into a representation of value for words? Instead of using the frequency of the words, we could use the probability of the words. If we align all all the words in the dictionary into a row and align all the topics into a column, then we could assign a probability value to each word-topic pair. For instance, the following table shows the probability of the words from each topic.

topic hello world you
t1 0.2 0.1 0.7
t2 0.3 0.6 0.1

Then the distribution of those probabilities (the random variable now is probability instead of frequency) in row-wise defines a topic, whereas the distribution of those probabilities in column-wise defines the probability of the words in the dictionary (each cell defines the probability of the word in the dictionary given the topic).

Now, let’s put everything together:

β=(p11p12p1Vp21p22p2Vpk1pk2pkV) \beta = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1V} \\ p_{21} & p_{22} & \cdots & p_{2V} \\ \vdots & \vdots & \ddots & \vdots \\ p_{k1} & p_{k2} & \cdots & p_{kV} \end{pmatrix}

LDA illustration2
Figure 4. The illustration of LDA for parameters (you can zoom the image if you want to see the details).

Remark: I love matrix! It is so convenient to represent the model. Everything we have discussed could be summarized as a matrix product process.

[M×K]Topic assignment×[K×V]β[N×V]Corpus \underbrace{ \begin{bmatrix} & & & & \\ & & & & \\ & & M \times K & & \\ & & & & \\ & & & & \end{bmatrix}}_{\text{Topic assignment}} \times \underbrace{\begin{bmatrix} & & & & \\ & & & & \\ & & K \times V & & \\ & & & & \\ & & & & \end{bmatrix}}_{\beta} \approx \underbrace{\begin{bmatrix} & & & & \\ & & & & \\ & & N \times V & & \\ & & & & \\ & & & & \end{bmatrix}}_{\text{Corpus}}

Base on the figure 4, given the parameters α\alpha and β\beta, the joint distribution of a topic mixture θ\theta (a vector of length kk), a topic assignment zz (a vector of length NN), and a document ww (a vector of length NN) is given by

π(θ,z,wα,β)=π(θα)i=1Nπ(ziθ)π(wizi,β)(45) \pi(\theta, z, w | \alpha, \beta) = \pi(\theta | \alpha) \prod_{i=1}^N \pi(z_i | \theta) \pi(w_i | z_i, \beta) \tag{45}

The model is called LDA because the topic assignment zz is latent. We do not know what topic each word belongs to. We only know the topic mixture θ\theta and the probability of the words given the topics β\beta.

For each component of equation (45), we should discuss it in detail.

i=1Nπ(ziθ)π(wizi,β)(46) \prod_{i=1}^N \pi(z_i | \theta) \pi(w_i | z_i, \beta) \tag {46}

If we reorder the equation (45), we could get the following equation:

π(θ,z,wα,β)=i=1Nπ(wizi,β)π(ziθ)π(θα)(47) \pi(\theta, z, w | \alpha, \beta) = \prod_{i=1}^N \pi(w_i | z_i, \beta) \pi(z_i | \theta) \pi(\theta | \alpha) \tag{47}

This means that the topic assignment zz is independent of the topic mixture θ\theta given the document ww and the parameters α\alpha and β\beta. This is a very important property of the LDA model. It means that we could sample the topic assignment zz and the topic mixture θ\theta separately.

This is the reason why we could use Gibbs sampling to sample the topic assignment zz and the topic mixture θ\theta.

The chain of conditions in equation (47) is summarized in the following figure.

LDA illustration3
Figure 5. The Graphical model of LDA to illustrate the conditional independence.

This is the end of this post. In the future, I will discuss how to sample the topic assignment zz and the topic mixture θ\theta using Gibbs sampling.

  1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
  2. Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.