Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm is a Markov chain Monte Carlo (MCMC) algorithm that generates a sequence of random variables from a probability distribution from which direct sampling is difficult.

It is beneficial to have a good understanding of the Metropolis-Hastings algorithm, as it is the basis for many other MCMC algorithms. The Metropolis-Hastings algorithm is a Markov Chain Monte Carlo (MCMC) algorithm that generates a sequence of random variables from a probability distribution from which direct sampling is difficult.

Much of the content in this post is based on the book by Johnson, et al. (Johnson et al., 2022). Compared to the original chapter in the book, my post will be more intuitive and less formal.

The big idea

Considering the following Normal-Normal model with numerical outcome YY that varies Normally around an unknown mean μ\mu and known standard deviation σ=0.75\sigma = 0.75.

Here is the model:

YμN(μ,0.752)μN(0,12) \begin{aligned} Y | \mu &\sim \mathcal{N}(\mu, 0.75^2) \\ \mu &\sim \mathcal{N}(0, 1^2) \end{aligned}

Remark: The Byesian approach is a hierarchical model, which means:

That’s why we call μ\mu a hyperparameter.

Let’s review the Normal-Normal Bayesian model. Let μ\mu be an unknown mean parameter, and YiY_i be a random variable that varies Normally around μ\mu with a known standard deviation σ\sigma. The Normal-Normal model complements the Normal data structure with a Normal prior on μ\mu:

YiμN(μ,σ2)μN(θ,τ2) \begin{aligned} Y_i | \mu &\sim \mathcal{N}(\mu, \sigma^2) \\ \mu &\sim \mathcal{N}(\theta, \tau^2) \end{aligned}

Upon observing data yˉ=1ni=1nyi\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i, we can compute the posterior distribution of μ\mu:

μyˉN(θσ2σ2+nτ2+yˉnτ2σ2+nτ2,σ2τ2σ2+nτ2) \mu | \bar{y} \sim \mathcal{N}\left (\theta \frac{\sigma^2}{\sigma^2 + n \tau^2} + \bar{y} \frac{n \tau^2}{\sigma^2 + n \tau^2}, \frac{\sigma^2 \tau^2}{\sigma^2 + n \tau^2} \right )

With the above mode, we can compute the posterior mean and variance of μ\mu:

μ(Y=6.25)N(4,0.62) \mu | (Y = 6.25) \sim \mathcal{N}(4, 0.6^2)

If we cannot calcualte in closed form, we can approximate it using MCMC simuation.

How could we approximate the posterior distribution of μ\mu? The Metropolis-Hastings algorithm relies on the fact that even if we do not know the posterior model, we do know the likelihood function and we also know that the posterior distribution is proportional to the product of the likelihood and the prior:

f(μyˉ)f(μ)L(μy=yˉ)(1) f(\mu | \bar{y}) \propto f(\mu) L(\mu | y = \bar{y}) \tag{1}

where the likelihood function is:

L(μy=yˉ)=i=1nf(yiμ)=i=1n12πσ2exp[(yiμ)22σ2] L(\mu | y = \bar{y}) = \prod_{i=1}^n f(y_i | \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left [ - \frac{(y_i - \mu)^2}{2 \sigma^2} \right ]

Like we said before, Monte Carlo Markov Chain (MCMC) methods are based on the idea that we can approximate the posterior distribution by simulating from the posterior distribution. It is all about searching. Then, when it comes to searching, we need two things:

First, let’s proposal a uniform distribution for μ\mu with half-width ww:

μμUniform(μw,μ+w) \mu' | \mu \sim \text{Uniform}(\mu - w, \mu + w)

with pdf:

f(μμ)=12w f(\mu' | \mu) = \frac{1}{2w}

With this proposed distribution, we now need to design a mechanism to decide whether to accept or reject the proposed value μ\mu', which in the end gives us the direction of the search. The principle of Metropolis-Hastings algorithm is to accept the proposed value μ\mu' with probability that the chain should spend more time exploring areas of high posterior plausibility but shouldn’t get stuck there forever.

How can we come up a measurement that make sure the chain should spend more time exploring areas of high posterior plausibility? The answer is to use the posterior distribution as a weight, which is given by the equation (1).

Here are the steps of the Metropolis-Hastings algorithm:

  1. Start with an initial value μ0\mu_0.
  2. Propose a new value μ\mu' from the proposal distribution.
  3. Compute the acceptance probability
    • if the (unnormalized) posterior probability of μ\mu' is greater than the (unnormalized) posterior probability of μ0\mu_0, then accept μ\mu' with probability 1, meaning we need to check

    f(μ)L(μy=yˉ)>f(μ0)L(μ0y=yˉ) f(\mu')L(\mu' | y = \bar{y}) > f(\mu_0)L(\mu_0 | y = \bar{y})

    • otherwise, accept μ\mu' with probability (maybe go there with a small probability)

With the above steps, we will implement the Metropolis-Hastings algorithm in the coming sections.

The Metropolis-Hastings algorithm

Before, we present the Metropolis-Hastings algorithm, let’s review a formula that we will use in the algorithm:

P(AB)=P(A)P(BA)=P(B)P(AB)(2) \mathbb{P}(A \cap B) = \mathbb{P}(A) \mathbb{P}(B | A) = \mathbb{P}(B) \mathbb{P}(A | B) \tag{2}

With this simple formula, we can derive the acceptance probability and implement the Metropolis-Hastings algorithm.

Metropolis-Hastings algorithm. Conditioned on data yy, let parameter μ\mu has a prior distribution f(μ)f(\mu) and a likelihood function L(μy)L(\mu | y). Then, we know that the posterior distribution is proportional to the product of the likelihood and the prior:

f(μy)f(μ)L(μy) f(\mu | y) \propto f(\mu) L(\mu | y)

A Metropolis-Hastings Markov Chain for f(μy)f(\mu | y), {μ0,μ1,μ2,μN}\{ \mu_0, \mu_1, \mu_2, \dots \mu_N \}, evolves as follows. Let μi\mu_i be the current value of the chain, and let μi+1\mu_{i+1} be a proposed value, then we update the chain as follows:

  1. Generate a proposed value μ\mu' from the proposal distribution q(μμi)q(\mu' | \mu_i). (We are not updating the chain yet.)

  2. Decide whether to accept μ\mu' or not by:

    • calculating the acceptance probability α\alpha:

    α=min{1,f(μ)L(uy)f(μ)L(μy)q(μμ)q(μμ)}(3) \alpha = \min \left \{ 1, \frac{f(\mu')L(u'|y)}{f(\mu)L(\mu |y)}\frac{q(\mu | \mu')}{q(\mu' | \mu)} \right \} \tag{3}

    • figuratively speaking, flip a weighted coin with probability α\alpha to decide whether to accept μ\mu' or not:

    μi+1={μwith probability αμiwith probability 1α(4) \mu_{i+1} = \begin{cases} \mu' & \text{with probability } \alpha \\ \mu_i & \text{with probability } 1 - \alpha \end{cases} \tag{4}

For the distribution q(μμi)q(\mu' | \mu_i), we can use a uniform distribution with half-width ww. Since the proposal distribution is symmetric, we can have:

q(μμ)=q(μμ)={12wwhen μ and μ are with w units of each other 0otherwise q(\mu | \mu') = q(\mu' | \mu) = \begin{cases} \frac{1}{2w} & \text{when $\mu$ and $\mu'$ are with $w$ units of each other }\\ 0 & \text{otherwise} \end{cases}

This symmetry means that the chance of proposing a chain move from μ\mu to μ\mu' is the same as the chance of proposing a chain move from μ\mu' to μ\mu. Therefore, we can have:

α=min{1,f(μ)L(uy)f(μ)L(μy)q(μμ)q(μμ)}=min{1,f(μ)L(uy)f(μ)L(μy)} \alpha = \min \left \{ 1, \frac{f(\mu')L(u'|y)}{f(\mu)L(\mu |y)}\frac{q(\mu | \mu')}{q(\mu' | \mu)} \right \} = \min \left \{ 1, \frac{f(\mu')L(u'|y)}{f(\mu)L(\mu |y)} \right \}

Now, apply the formula (2) to the acceptance probability:

α=min{1,f(μ)L(uy)f(μ)L(μy)}=min{1,f(μ)L(uy)/f(y)f(μ)L(μy)/f(y)}=min{1,f(μy)f(μy)}(5) \begin{aligned} \alpha = \min \left \{ 1, \frac{f(\mu')L(u'|y)}{f(\mu)L(\mu |y)} \right \} & = \min \left \{ 1, \frac{f(\mu')L(u'|y)/ f(y)}{f(\mu)L(\mu |y)/f(y)} \right \} \\ & = \min \left \{ 1, \frac{f(\mu'|y)}{f(\mu|y)} \right \} \tag{5} \end{aligned}

Equation (5) states that although we cannot compute the posterior distribution f(μy),f(μy)f(\mu | y), f(\mu'|y), their ratio is equivalent to that of the unnormalized posterior pdfs (which we can calculate).

Based on equation (5), we can see there are two cases:

Scenario 1 is easy to understand. Scenario 2 is a little bit tricky. We will use an example to illustrate this scenario.

# scenario 2
y_data <- 6.25
current_u <- 3

set.seed(8)
# set width = 1 and draw from uniform distribution
proposal_u <- runif(1, min = current_u - 1, max = current_u + 1)
# proposal_u = 2.93259

# calculate likelihood
# likelihood is just the probability of the data given the parameter
likelihood_current <- dnorm(y_data, mean = current_u, sd = 0.75)
likelihood_proposal <- dnorm(y_data, mean = proposal_u, sd = 0.75)
# calculate bayesian product = likelihood * prior
bayes_prod_current <- likelihood_current * dnorm(current_u, mean = 0, sd = 1)
bayes_prod_proposal <- likelihood_proposal * dnorm(proposal_u, mean = 0, sd = 1)

# calculate alpha
alpha <- min(1, bayes_prod_proposal / bayes_prod_current)

# accept or reject
next_u <- sample(c(proposal_u, current_u), size = 1,
                        prob = c(alpha, 1 - alpha))


# make a table to show the results
mh_table <- data.frame(
    current_u = current_u,
    proposal_u = proposal_u,
    likelihood_current = likelihood_current,
    likelihood_proposal = likelihood_proposal,
    bayes_prod_current = bayes_prod_current,
    bayes_prod_proposal = bayes_prod_proposal,
    alpha = alpha,
    next_u = next_u
)

# transpose the table
mh_table %>%
    t() %>%
    kable("pipe")
Parameter Value
current_u 3.0000000
proposal_u 2.9325905
likelihood_current 0.0000445
likelihood_proposal 0.0000300
bayes_prod_current 0.0000002
bayes_prod_proposal 0.0000002
alpha 0.8240205
next_u 2.9325905

This is merely one of countless possible outcomes for a single iteration of the Metropolis-Hastings algorithm for our Normal posterior. The Metropolis-Hastings algorithm is a stochastic algorithm, so the outcome of each iteration is random. The outcome of the algorithm depends on the initial value of the chain, the proposal distribution, and the data. The outcome of the algorithm is not deterministic.

Now, we will wrap the code above into a function and run the algorithm with different seeds to see how the chain moves.

## write a function to simulate the MH algorithm
one_mh_iteration <- function(w, current, y_data) {
    # a function to simulate one iteration of the MH algorithm
    # w is the width of the proposal distribution
    # current is the current value of the parameter
    # we are using unif(w) to draw from the proposal distribution
    # prior is normal(0, 1)
    # likelihood is normal(y_data, sd = 0.75)

    # draw from the proposal distribution
    proposal <- runif(1, min = current - w, max = current + w)

    # update the parameter based on the bayesian product
    # calculate the likelihood
    likelihood_current <- dnorm(y_data, mean = current, sd = 0.75)
    likelihood_proposal <- dnorm(y_data, mean = proposal, sd = 0.75)

    # calculate the bayesian product
    bayes_prod_current <- likelihood_current * dnorm(current,
                                                            mean = 0,
                                                            sd = 1)
    bayes_prod_proposal <- likelihood_proposal * dnorm(proposal,
                                                            mean = 0,
                                                            sd = 1)

    # calculate alpha
    alpha <- min(1, bayes_prod_proposal / bayes_prod_current)

    # accept or reject
    next_u <- sample(c(proposal, current), size = 1,
                        prob = c(alpha, 1 - alpha))

    return(data.frame(proposal, alpha, next_u))

}

set.seed(8)
one_mh_iteration(1, 3, 6.25)

set.seed(83)
one_mh_iteration(1, 3, 6.25)

seed_vec <- c(8, 83, 7) # set seed
foo <- data.frame()
for (s in seed_vec) {
    set.seed(s)
    temp <- one_mh_iteration(1, 3, 6.25)
    foo <- rbind(foo, temp)
}

foo$seed <- seed_vec

# reorder the columns
foo[, c("seed", "proposal", "alpha", "next_u")] %>%
    kable("pipe", digits = 3, align = 'c')
seed proposal alpha next_u
8 2.933 0.824 2.933
83 2.018 0.017 3.000
7 3.978 1.000 3.978

Now, we are ready to run the Metropolis-Hastings algorithm. We will run the algorithm for N iterations and plot the chain.

mh_sim <- function(n, w) {

    # set up the initial values
    current_u <- 3

    # initialize mu vector
    mu <- rep(0, n)

    # simulate N iterations
    for (i in 1:n) {
        # simulate one iteration of the MH algorithm
        temp <- one_mh_iteration(w, current_u, 6.25)

        # update the current value
        current_u <- temp$next_u

        # store the current value
        mu[i] <- current_u
    }

    # return data.frame
    return(data.frame(iteration = c(1:n), mu = mu))
}

set.seed(84735)
mh_simulate1 <- mh_sim(5000, 1)

# plot the results
options(repr.plot.width = 10, repr.plot.height = 5)
par(mfrow = c(1, 2))
mh_simulate1 %>% 
    with(plot(iteration, mu, type = "l", lwd = 1,
                    col = gray(0.1, 0.7),
                    main = "Trace of mu")) %>%
    with(hist(mh_simulate1$mu, breaks = 50, prob = TRUE,
                    xlab = "mu", main = "Histogram of mu")) %>%
    with(curve(dnorm(x, mean = 4, sd = 0.6), add = TRUE,
                    col = "red", lwd = 2)) %>%
    with(legend("topleft", legend = "(4, 0.6)", cex = 0.8,
                    bg = "transparent", box.col = "transparent",
                    col = "red", lwd = 2))
Inequality bounds compare
Figure 1. The plot of MCMC chain and the histogram of the posterior distribution based on Metropolis-Hastings algorithm.

As it is shown in Figure 1, the chain converges to the posterior distribution. The posterior distribution is centered at 4 and has a standard deviation of 0.6. The posterior distribution is close to the true posterior distribution as we derived in the previous section.

If you look at the above code, you should realize that two things are important for the Metropolis-Hastings algorithm to work. First, the proposal distribution should be symmetric. Second, the proposal distribution should be wide enough to cover the posterior distribution. If the proposal distribution is too narrow, the chain will not be able to explore the posterior distribution. If the proposal distribution is too wide, the chain will not be able to converge to the posterior distribution. The proposal distribution should be wide enough to cover the posterior distribution but not too wide to cover the posterior distribution.

Second, the initial value of the chain is important. If the initial value is too far away from the posterior distribution, the chain will not be able to converge to the posterior distribution. If the initial value is too close to the posterior distribution, the chain will not be able to explore the posterior distribution.

How could we choose the initial value? We can use the prior distribution and the evidence (data) we have to guide us. For instance, in the above example, we have y=6.25y = 6.25, then choose the initial value to be 33, which is not very far away from our data.

A Beta-Binomial example

For extra practice, let’s try to use the Metropolis-Hastings algorithm to estimate a Beta-Binomial model. Let n=2n = 2 be number of trials and y=1y = 1 be the number of successes. The prior distribution is Beta(α=2,β=3)Beta(\alpha = 2, \beta = 3) and the model is

YθBinomial(n,θ)θBeta(α=2,β=3) \begin{aligned} Y | \theta &\sim Binomial(n, \theta) \\ \theta &\sim Beta(\alpha = 2, \beta = 3) \end{aligned}

We know the closed form of the posterior distribution is

θyBeta(α=2+y,β=3+ny)=Beta(α=3,β=4) \theta | y \sim Beta(\alpha = 2 + y, \beta = 3 + n - y) = Beta(\alpha = 3, \beta = 4)

Instead of using the closed form of the posterior distribution, we will use the Metropolis-Hastings algorithm to estimate the posterior distribution.

This time, we will implement a special version of the Metropolis-Hastings algorithm , which is referred to as the independent Metropolis-Hastings algorithm. In the equation (3), the pdf of the proposal distribution is a function of conditional probability. In the independent Metropolis-Hastings algorithm, the pdf of the proposal distribution is a function of the current value of the chain.

This means we should have

α=min{1,f(θ)L(θy)f(θ)L(θy)q(θ)q(θ)}(6) \alpha = \min \left \{ 1, \frac{f(\theta') L(\theta'|y)}{f(\theta)L(\theta | y)} \frac{q(\theta)}{q(\theta')} \right \} \tag{6}

Here is the code.

bb_one_iteration <- function(a, b, current) {

    # simulate one iteration of the MH algorithm

    # step 1 - draw from the proposal distribution
    proposal <- rbeta(1, shape1 = a, shape2 = b)

    # step 2 - calculate the likelihood
    likelihood_current <- dbinom(1, size = 2, prob = current)
    likelihood_proposal <- dbinom(1, size = 2, prob = proposal)

    # step 3 - calculate the bayesian product with the prior
    # prior = dbeta(, 2, 3)
    bayes_prod_current <- likelihood_current * dbeta(current, 2, 3)
    bayes_prod_proposal <- likelihood_proposal * dbeta(proposal, 2, 3)

    # calculate independent pdf of the proposal distribution
    pdf_current <- dbeta(current, a, b)
    pdf_proposal <- dbeta(proposal, a, b)

    # calculate alpha
    alpha <- min(1, bayes_prod_proposal / bayes_prod_current *
                    pdf_current / pdf_proposal)

    # accept or reject
    next_theta <- sample(c(proposal, current), size = 1,
                        prob = c(alpha, 1 - alpha))
    
    return(data.frame(proposal, alpha, next_theta))
}


bb_sim <- function(n, a, b) {

    # initialize the current value
    current_theta <- 0.5

    # initialize theta vector
    theta <- rep(0, n)

    # simulate N iterations
    for (i in 1:n) {
        # simulate one iteration of the MH algorithm
        temp <- bb_one_iteration(a, b, current_theta)

        # update the current value
        current_theta <- temp$next_theta

        # store the current value
        theta[i] <- current_theta
    }

    # return data.frame
    return(data.frame(iteration = c(1:n), theta = theta))

}



set.seed(84735)
bb_simulate1 <- bb_sim(5000, 1, 1)

# plot the results
options(repr.plot.width = 10, repr.plot.height = 5)
par(mfrow = c(1, 2))
bb_simulate1 %>% 
    with(plot(iteration, theta, type = "l", lwd = 1,
                    col = gray(0.1, 0.7),
                    main = "Trace of theta")) %>%
    with(hist(bb_simulate1$theta, breaks = 50, prob = TRUE,
                    xlab = "theta", main = "Histogram of theta")) %>%
    with(curve(dbeta(x, 3, 4), add = TRUE,
                    col = "red", lwd = 2)) %>%
    with(legend("topleft", legend = "Beta(3, 4)", cex = 0.8,
                    bg = "transparent", box.col = "transparent",
                    col = "red", lwd = 2))
Inequality bounds compare
Figure 1. The plot of MCMC chain and the histogram of the posterior distribution based on Metropolis-Hastings algorithm for the Beta-Binomial model.

We finish the post here. Hope you now know how to use the Metropolis-Hastings algorithm to estimate the posterior distribution.

  1. Johnson, A. A., Ott, M. Q., & Dogucu, M. (2022). Bayes rules!: an introduction to applied Bayesian modeling. CRC Press.