The Beta-Binomial Bayesian Model

With more data generating day by day, I believe Bayesian statistics is the way to go. That's why I'm writing this series of posts on Bayesian statistics. In this post, I'll introduce the Beta-Binomial Bayesian model again. I'll also show how two communities (Python and R) have implemented this model.

In one of previous posts, I introduced the Beta-Binomial Bayesian model and its application in sports recruiting and text mining (Latent Dirichlet Allocation). I am very impressed by the power of this model. It is so simple and yet so powerful. It is also very easy to implement. In this post, I’ll introduce the Beta-Binomial Bayesian model again. I’ll also show how two communities (Python and R) have implemented this model.

I strongly believe that Bayesian statistics will play a much bigger role in the future. That’s why I am investing a lot of time in learning Bayesian statistics. I am also writing this series of posts on Bayesian statistics. I hope that this series of posts will help you to learn Bayesian statistics as well.

Philosophy of Bayesian Statistics

Almost everyone learns linear regression models in their statistics or machine learning or business analytics courses. From estimation to inference, much of the theory and practice of linear regression rely on Maximum Likelihood Estimation (MLE) and Law of Large Numbers (LLN). In other words, we assume that the data are generated by a fixed distribution. We then estimate the parameters of the distribution by maximizing the likelihood function. We also assume that the sample size is large enough so that the sample mean is close to the population mean. This is the frequentist philosophy.

However, in the real world, we often do not know the distribution of the data. We also do not know the sample size. Very often we only have a small sample. Or even we have a big sample, but the size of sub-samples decreases as the data becomes heterogeneous (this happens very often because of the complexity of the real world). In these cases, we cannot use the frequentist philosophy. We need to use the Bayesian philosophy.

If you want to have a deep understanding of Bayesian statistics, I highly recommend you to read my previous post - Conjugate Priors - Binomial Beta Pair. I think the Latent Dirichlet Allocation (LDA) model is a good example of the Bayesian philosophy, which shows the power of the Bayesian philosophy.

Before we start, let’s set up the notations we will use in this post:

The Beta distribution

Let XX be a random variable that takes values in the interval [0,1][0,1]. The Beta distribution is a continuous probability distribution with two parameters α\alpha and β\beta. The probability density function of the Beta distribution is given by:

f(x)=xα1(1x)β1B(α,β)for x[0,1] f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} \quad \text{for } x \in [0,1]

where B(α,β)B(\alpha, \beta) is the beta function:

B(α,β)=01xα1(1x)β1dx=Γ(α)Γ(β)Γ(α+β) B(\alpha, \beta) = \int_0^1 x^{\alpha-1}(1-x)^{\beta-1} dx = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}

It is easy to verify we have f(x)=1f(x) = 1 when α=β=1\alpha = \beta = 1. The Beta distribution is symmetric when α=β\alpha = \beta. The Beta distribution is skewed to the left when α>β\alpha > \beta. The Beta distribution is skewed to the right when α<β\alpha < \beta.

Inequality bounds compare
Figure 1. The plot of the beta distribution with different values of α\alpha and β\beta.

Now, let’s list some properties of the Beta distribution.

E[X]=xf(x)dx=αα+βVar[X]=(xE(x))2f(x)dx=αβ(α+β)2(α+β+1)Mode[X]=arg maxxf(x)=α1α+β2 \begin{aligned} \text{E}[X] &= \int x \cdot f(x) dx = \frac{\alpha}{\alpha + \beta} \\ \text{Var}[X] &= \int (x - \mathrm{E}(x))^2 \cdot f(x) dx = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \\ \text{Mode}[X] &= \argmax_x f(x) = \frac{\alpha - 1}{\alpha + \beta - 2} \end{aligned}

It is important to know that α,β\alpha, \beta could be any real numbers with α>0\alpha > 0 and β>0\beta > 0.

Inequality bounds compare
Figure 2. The plot of the beta distribution.

A simple example

Let’s consider a simple example. Suppose you are basketball coach. Now, you are recruiting a new player. You ask this player to shoot 10 free throws. Assume that the player’s probability of making a free throw is pp. We can model this process as a Binomial distribution with n=10n = 10 and pp.

XBinomial(n=10,p)=(10x)px(1p)10x X \sim \text{Binomial}(n=10, p) = \binom{10}{x} p^x (1-p)^{10-x}

Now, let’s assume three different scenarios:

We plot the distribution of XX in the three scenarios in Figure 3.

Inequality bounds compare
Figure 3. The plot of binomial distribution with different values of pp.

Now, image that we ask this player to shoot 10 free throws. We observe that he makes 6 free throws. Can you tell which scenario is more likely? From figure 3, we can make

This means the likelihood of Scenario 3 is much smaller than the likelihood of Scenario 1 and Scenario 2. Therefore, we can conclude that the player is more likely to be a rookie or a veteran.

But, how confident are we about this conclusion? Right now, we only have one observation. We do not know the distribution of XX. We do not know the sample size. We do not know the true value of pp. We only know that the player is more likely to be a rookie or a veteran.

There are two ways to solve this problem. The first way is to use frequentist statistics. The second way is to use Bayesian statistics.

With frequentist statistics, we can ask our friend to shoot 10 free throws again and again (let’s say 5 round). Then, we can calculate the average of likelihood and make a conclusion.

With Bayesian statistics, we can use the Beta distribution to model the distribution of pp. Then, we can use the posterior distribution to make a conclusion.

When should one use frequentist statistics and when should one use Bayesian statistics? The answer is that it depends on the problem. In this example, we can use either frequentist statistics or Bayesian statistics.

But for many problems, we can only use Bayesian statistics. For example, if we want to estimate the probability of a person having a disease, we can only use Bayesian statistics. We cannot use frequentist statistics because we do not have a large sample size.

You get the idea. Bayesian statistics is more flexible than frequentist statistics. But, Bayesian statistics is also more complicated than frequentist statistics, which needs a prior distribution and a posterior distribution.

Tuning the hyperparameters for prior distribution

According to a post from Wikipedia, the average free throw percentage in the NBA is 0.750.75. This means the average value of pp is 0.750.75. The share of players who make more than 0.750.75 is rare. This means the distribution of pp is skewed to the left. Therefore we have to choose a beta distribution with α>β\alpha > \beta and

E[p]=αα+β=0.75;α=3β \mathrm{E}[p] = \frac{\alpha}{\alpha + \beta} = 0.75; \Rightarrow \alpha = 3 \beta

Inequality bounds compare
Figure 4. The plot of beta distribution with different values of α\alpha and β\beta.

Based on Figure 4, we choose α=15\alpha = 15 and β=5\beta = 5. Then, we can calculate the posterior distribution of pp.

First, let’s review the formula of the Bayesian update. The posterior distribution is

f(θdata)=f(dataθ)f(θ)f(data)posteriorlikelihoodpriorevidence \begin{aligned} f(\theta | \text{data}) & = \frac{f(\text{data} | \theta) f(\theta)}{f(\text{data})} \\ \text{posterior} & \varpropto \frac{\text{likelihood} \cdot \text{prior}}{\text{evidence}} \end{aligned}

where f(data)f(\text{data}) is the evidence, which is the sum of the likelihood and the prior:

f(data)=f(dataθ)f(θ)dθorf(data)=i=1nf(dataθi)f(θi) f(\text{data}) = \int f(\text{data} | \theta) f(\theta) d\theta \quad \text{or} \quad f(\text{data}) = \sum_{i=1}^n f(\text{data} | \theta_i) f(\theta_i)

We can calculate the posterior distribution of pp as follows (assume our player makes 6 free throws):

π(px=6,n=10,alpha=15,beta=5)=Beta(pα+x,β+nx)=Beta(p15+6,5+106)=Beta(p21,9) \begin{aligned} \pi(p |x=6, n = 10, alpha = 15, beta = 5) & = \mathrm{Beta}(p | \alpha + x, \beta + n - x) \\ & = \mathrm{Beta}(p | 15 + 6, 5 + 10 - 6) \\ & = \mathrm{Beta}(p | 21, 9) \end{aligned}

Calculating the posterior distribution

Now, we can calculate the posterior distribution of pp.

Inequality bounds compare
Figure 5. The plot of beta distribution and the posterior distribution of pp.

From Figure 5, we can see that the posterior distribution shifts to the left. This means the player is more likely to be a rookie or a veteran. How could interpret the posterior distribution?

If the blue curve is the prior distribution, which gives the probability of pp based on our prior knowledge. The red curve is the posterior distribution, which gives the probability of pp based on our prior knowledge and the observation.

In this example, the prior distribution gives the distribution of pp based on our observation of all NBA players. The posterior distribution gives the distribution of pp based on our observation of all NBA players and the observation of our player.

The posterior distribution could be used to estimate the value of this basketball player. As it not just gives this player’s free throw percentage, but also gives the uncertainty of this player’s free throw percentage.

# %%
import os
import itertools
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


def plot_beta_dist():
    
    hyperparams = [1, 3, 10]
    alpha_beta = list(itertools.product(hyperparams, hyperparams))
    
    # random variable
    x = np.linspace(0, 1, 1000)
    
    fig, axes = plt.subplots(3, 3, figsize=(9, 8))
    axes = axes.flatten()
    for idx, (alpha, beta) in enumerate(alpha_beta):
        y = sp.stats.beta.pdf(x, alpha, beta)
        axes[idx].plot(x, y, "k-")
        axes[idx].set_ylim(0, 6)
        axes[idx].set_title(f"alpha={alpha}, beta={beta}")
    fig.subplots_adjust(hspace=0.5)
    # plt.savefig('./docs/math/images/beta_dist.png', dpi=300, bbox_inches='tight')
    

def free_throws():
    """Plot binomial distribution for free throws"""
    p_vals = [0.5, 0.7, 0.9]
    markers = ["ko-", "ko--", "ko:"]
    n = 10
    x = np.arange(0, n+1, 1)
    fig, ax = plt.subplots(figsize=(7, 3.5))
    for idx in range(len(p_vals)):
        p = p_vals[idx]
        y = sp.stats.binom.pmf(x, n, p)
        mk = markers[idx]
        ax.plot(x, y, mk, label=f"p={p}")
    # add a vertical line at x=6
    ax.axvline(x=6, color="b", linestyle="-.")
    ax.set_title("Binomial distribution for free throws")
    plt.legend()
    plt.savefig('./docs/math/images/free_throws.png', dpi=300,
                                    bbox_inches='tight')
        
def tuning_beta():
    """plot beta distribution for tuning parameters
    alpha = 3 beta 
    """
    hyperparams = [(3, 1), (15, 5), (30, 10)]
    fig, axes = plt.subplots(1, 3, figsize=(8.5, 3))
    axes = axes.flatten()
    for idx, (alpha, beta) in enumerate(hyperparams):
        x = np.linspace(0, 1, 1000)
        y = sp.stats.beta.pdf(x, alpha, beta)
        axes[idx].plot(x, y, "k-")
        axes[idx].set_ylim(0, 6)
        axes[idx].set_title(f"alpha={alpha}, beta={beta}")
    plt.savefig('./docs/math/images/tuning_beta.png', dpi=300,
                                    bbox_inches='tight')


def plot_posterior():
    x = np.linspace(0, 1, 1000)
    hyperparams = [(15, 5), (21, 9)]
    fig, ax = plt.subplots(figsize=(7, 3.5))
    for alpha, beta in hyperparams:
        y = sp.stats.beta.pdf(x, alpha, beta)
        ax.plot(x, y, label=f"alpha={alpha}, beta={beta}")
    ax.set_title("Posterior distribution for free throws")
    ax.set_xlabel("p")
    ax.set_ylabel("Density")
    plt.legend()
    fig.savefig('./docs/math/images/free_throws_posterior.png', dpi=300,
                                    bbox_inches='tight')


if __name__ == "__main__":
    print(os.getcwd())
    # plt.style.use('default')
    # plt.style.use('seaborn')
    # plot_beta_dist()
    # free_throws()
    # tuning_beta()
    plot_posterior()

Baseball example

Let’s take another example with the baseball data. To be honest, I do not know much about baseball. But I think this example is interesting. I guess in baseball, you use batting average to measure the performance of a player. The batting average is the number of hits divided by the number of at bats. For example, if a player has 100 at bats and 30 hits, then the batting average is 0.30.3. Again, I have no idea what those terms mean.

The data is from a R package called Lahman. The data contains the batting average of all players in the MLB from 1871 to 2016.

name H AB(at bats) average
Hank Aaron 3771 12364 0.305
Tommie Aaron 216 944 0.229
Andy Abad 2 21 0.095
John Abadie 11 49 0.224
Ed Abbaticchio 772 3044 0.254

Now, if we want to use the batting average to measure the performance of a player, we will face the issue of proportion. For example, if we want to measure the performance of a player who has 100 at bats and 30 hits, we will get a batting average of 0.30.3. But if we want to measure the performance of a player who has 1000 at bats and 300 hits, we will get a batting average of 0.30.3 again. But which player is better? For instance, the following table gives the top 5 players with the highest batting average.

name H AB average
Jeff Banister 1 1 1
Doc Bass 1 1 1
Steve Biras 2 2 1
C. B. Burns 1 1 1
Jackie Gallagher 1 1 1

If you look at the table, you can see that we could not say Jeff Banister is the best player.

The intuition is that we need to take into account the number of at bats because one could get lucky and get a lot of hits with a small number of at bats.

Get a distribution as a prior

Inequality bounds compare
Figure 6. The plot of the batting average of all players in the MLB from 1871 to 2016 and its fitted beta distribution.

As it is shown in Figure 6, by fitting a beta distribution to the batting average, we can get a distribution as a prior. The parameters of the beta distribution are α=79.43\alpha=79.43 and β=228\beta=228, which means that the success probability is 0.25840.2584 and the number of trials is 307307.

Now we can use this prior to update the posterior distribution after we get new data. For example, if we get a new player who has 100 at bats and 30 hits, we can update the posterior distribution by using the following formula:

30+79.43100+79.43+2280.2686 \frac{30 + 79.43}{100 + 79.43 + 228} \approx 0.2686

How about the player who has 10 at bats and 4 hits? We can update the posterior distribution by using the following formula:

4+79.4310+79.43+2280.2628 \frac{4 + 79.43}{10 + 79.43 + 228} \approx 0.2628

Therefore, we can see that the posterior distribution is updated by the new data. Even though 4/10>30/1004/10 > 30/100, the posterior distribution is updated to a lower value.

The following table gives the ranking of the top 5 players with the highest batting average after we update the posterior distribution.

name H AB average posterior
Rogers Hornsby 2930 8173 0.358 0.355
Shoeless Joe Jackson 1772 4981 0.356 0.350
Ed Delahanty 2597 7510 0.346 0.342
Billy Hamilton 2164 6283 0.344 0.340
Harry Heilmann 2660 7787 0.342 0.338

The following table gives the ranking of the worst 5 players with the lowest batting average after we update the posterior distribution.

name H AB average posterior
Bill Bergen 516 3028 0.170 0.178
Ray Oyler 221 1265 0.175 0.191
John Vukovich 90 559 0.161 0.195
John Humphries 52 364 0.143 0.195
George Baker 74 474 0.156 0.196

Notice that in each of these cases, empirical Bayes didn’t simply pick the players who had 1 or 2 at-bats. It found players who batted well, or poorly, across a long career. What a load off our minds: we can start using these empirical Bayes estimates in downstream analyses and algorithms, and not worry that we’re accidentally letting 0/1 or 1/1 cases ruin everything.

Inequality bounds compare
Figure 7. The plot of the batting average of all players and its estimated average after we update the posterior distribution.

Figure 7 tells us those who have high AB (at bats) serves as a benchmark for the performance of a player, which is used as a prior to update the posterior distribution. That’s why dots with red color are located around diagonal line. Those who have low AB but high H (hits) are outliers. They (green dots) are shrinked to the blue dashed line, which makes them close to the average of the posterior distribution.

The idea behind the figure 7 is that we’ve moved all our estimates towards the average. How much it moves these estimates depends on how much evidence we have: if we have very little evidence (4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out of 1000) we move it only a little. This is why we call this kind of process “shrinking towards the prior”.

Conclusion

In this post we have discussed the concept of empirical Bayes and how to use it to update the posterior distribution. We have also discussed the application of empirical Bayes in baseball.

One can see that the key step in empirical Bayes is to get a prior distribution. That’s why some scholars criticize empirical Bayes method subjectively. However, I think it is a good method to update the posterior distribution. It is easy to implement and it is easy to understand.

with more and more data, the posterior distribution will be more and more accurate. One could also construct the prior distribution by modeling different factors, such as the age of a player, the league of a player, etc. This will give us Bayesian hierarchical model, which is a powerful tool to analyze data.

library(pacman)
p_load(
    tidyverse, data.table, dtplyr, reshape2,
    kableExtra, janitor, readxl, png, Cairo, rbenchmark,
    httr, jsonlite, fedmatch, patchwork,
    corrplot, tidygraph, ggraph, igraph,
    treemap, splitstackshape, stringr, lubridate,
    poweRlaw, voronoiTreemap, ggridges,
    DescTools, stringi, kit, pheatmap
)
# set option
options(dplyr.summarise.inform = FALSE)
options(jupyter.plot_mimetypes = "image/png")
options(repr.plot.width = 8, repr.plot.height = 5)

gray_scale <- c("#F3F4F8", "#D2D4DA",  "#bcc0ca",
                "#D3D3D3", "#B3B5BD", "#838383",
                "#9496A1", "#7d7f89", "#777986",
                "#656673", "#5B5D6B", "#4d505e",
                "#404352", "#2b2d3b", "#2B2B2B", "#282A3A",
                "#1b1c2a", "#191a2b", "#141626", "#101223")
color_scale <- c("#BB4444", "#EE9988", "#FFFFFF",
                            "#77AADD", "#4477AA")
blue_scale <- c("#DBE7F5", "#95BCE3",
                "#699BCB", "#6597C7", "#4879A9",
                "#206BAD", "#11549B")

cor_col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF",
                            "#77AADD", "#4477AA"))

# check working directory
getwd()

peepsample <- function(dt) {
    dt %>%
    .[sample(.N, 5)] %>%
    kable("pipe")
}

peephead <- function(dt, x=5) {
    dt %>%
    head(x) %>%
    kable("pipe")
}

### ------- get to know the data ------- ###
# use a package called Lahman
library(Lahman)
Batting %>% as.data.table %>% str()
Pitching %>% as.data.table %>% str()
People %>% as.data.table %>% str()

# create a subset of the data
career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(Pitching, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

career <- People %>%
  as_tibble() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID") %>%
  select(-playerID)

career %>% str()

career %>% head(5) %>% kable("pipe", digits = 3)


career %>% as.data.table() -> career

career %>%
    .[order(-average)] %>% peephead(5)


# fit with beta distribution
career %>%
    .[AB >= 500] %>%
    with(MASS::fitdistr(average, "beta",
                start = list(shape1 = 1, shape2 = 10))) %>%
    pluck("estimate") %>% round(2)

p_seq = seq(0.1, 0.4, 0.01)

options(repr.plot.width = 9, repr.plot.height = 6)
career %>%
    .[AB >= 500] %>%
    with(hist(average, breaks = 20, freq = FALSE,
              xlab = "Career Batting Average",
    main = "Histogram of Career Batting Average")) %>%
    with(lines(p_seq, dbeta(p_seq, 79, 228), col = "red")) %>%
    with(legend("topright", legend = c("data", "fitted beta"),
                col = c("#242222", "red"), lty = 1, lwd = 2))


# calculate the posterior distribution

career %>%
    .[, posterior := (79 + H) / (79 + 228 + AB)] %>%
    .[order(posterior)] %>%
    head(5) %>% kable("pipe", digits = 3)


career %>%
    ggplot(aes(x = average, y = posterior, color = AB)) +
    geom_point() +
    theme_bw() +
    geom_abline(intercept = 0.259, slope = 0, lty = 2,
                    col = "blue") +
    labs(x = "Batting Average", y = "Posterior Estimated Average",
         title = "Posterior Probability of Baseball Batting Average") +
    scale_color_gradient(low = "#848588", high = "#e03a0c") +
    # hightlight selected observations
    geom_point(data = career[average == 1.0], color = "#238420",
                                        size = 2.5) +
    geom_point(data = career[average == 0.0], color = "#238420",
                                        size = 2.5) 

References

  1. Bayes Rules
  2. David Robinson’s post