I was preparing materials for a lab session to the graduate students at NYU, when I realized that there is no straightforward way to calculate uncertainty estimates for predicted probabilities from ordered logit models in R. The predict method for MASS::polr objects, for example, returns the predicted probabilities but not the estimated standard errors (try using the se = TRUE option; it won’t work.)

This created a problem. Due to traveling, I was not able to prepare in advance; so, I was sitting in front of my labtop at 2 am the day before class thinking about what to do. In the end, I coded up a short function that calculates confidence intervals for predictions of ordered logit models based on Monte Carlo simulation and added it to the appendix of my teaching materials. The model, code, and a baby Monte Carlo study to check whether the coverage rate of the calculated confidence intervals are close to the nominal rate, will be the main topic of the current post.

The Ordered Logit Model

The theory to calculate confidence intervals for predicted probabilities is quite straightforward. We assume that we have a latent variable $Y_i^\ast$, which is generated by the DGP

where $\epsilon_i^\ast \sim \text{Logistic}(0,1)$. Unfortunately, we don’t observe $Y_i^\ast$ but only a truncated version of it. The link between $Y_i^\ast$ and the observed but truncated outcome variable $Y_i$ is given as

where $K$ is the number of response categories, $\mathbb I_A(x) = 1$ if $x\in A$ and $0$ otherwise, and where $\boldsymbol\tau = [\tau_0, \tau_1,…,\tau_{K-1},\tau_{K}]’$ are threshold parameters with $\tau_0 = -\infty$ and $\tau_K= \infty$ so that only $\{\tau_1,…,\tau_{K-1}\}$ are estimands.

As the latent error term has a standard Logistic distribution, the probability of the event $\{Y_i = k\}$ is

where $\Lambda_{a, b}(\cdot)$ is the Logistic CDF with location and scale parameters, respectively, equal to $a$ and $b$. The likelihood of an iid sample of size $n$ under the model is

Maximizing $\mathcal L(\boldsymbol\theta)$ with respect to $\boldsymbol\theta =[\boldsymbol\beta, \boldsymbol\tau]’$ gives us the MLE

where $\boldsymbol\Theta$ is the parameter space. Under the usual regularity conditions, we can rely on the Central Limit Theorem to show that

in distribution as $n \rightarrow \infty$, where $\mathbf Z \sim \text{Normal}(\mathbf 0, \mathcal I (\boldsymbol\theta)^{-1})$ and where $\mathcal I(\boldsymbol\theta)$ is the Fisher Information. Hence, in large samples, we might approximate the distribution of $\hat{\boldsymbol\theta}_\text{MLE}$ with a $\text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta}_\text{MLE})^{-1})$ distribution.

Quantifying Uncertainty in Predicted Probabilities

After estimating the parameter $\hat{\boldsymbol\theta} = [\hat{\boldsymbol\beta}, \hat{\boldsymbol\tau}]’$ (we’ve dropped the MLE subscript for notational clarity), the predicted probabilities can be calculated straightforwardly as

for $i = 1,…,n$ and $k = 1,…, K$, where we define, again, $\hat\tau_0 = -\infty$ and $\hat\tau_K = \infty$.

The uncertainty in these estimates might be calculated in several ways. Two often used methods are Monte Carlo simulations, following the dictum that everything of a distribution can be learned by simulating from it, and the delta method (also called the method of propagation of error).

Confidence Intervals via the Delta Method

The delta method relies on the first-order Taylor Expension at a the MLE to approximate the distribution of a function of the MLE. Here we are interested in the distribution of the predicted probability at a point $\mathbf X = \mathbf x_\ast$. As $\hat{\boldsymbol\theta}$ is a consistent estimator of $\boldsymbol\theta$, and the mapping $\boldsymbol\theta \mapsto \Lambda_{\mathbf x_\ast’\boldsymbol\beta}(\tau_k)$ is continuous, it follows from the Continuous Mapping Theorem that

in probability as $n\rightarrow \infty$, where we have made the dependence of the predicted probabilities on the parameter vector explicit and dropped the superscript $\{(k)\}$ for clarity. Taylor-expanding $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ about $\boldsymbol\theta$, we obtain

where $R(\hat{\boldsymbol\theta}, \boldsymbol\theta)$ is the remainder. As $\hat{\boldsymbol\theta} \rightarrow \boldsymbol\theta$ in probability, $R \rightarrow 0$ in probability as well; so, by Slutsky’s Theorem,

in distribution as $n\rightarrow\infty$. But as $\sqrt{n}(\hat{\boldsymbol\theta} - \boldsymbol\theta)$ converges to a $\text{Normal}(0, \mathcal I(\boldsymbol\theta)^{-1})$ distribution, $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ converges in distribution to the random variable

as $n\rightarrow\infty$.

The matrix sandwiched in the middle of the gradients is just the covariance matrix of the MLE, which can be estimated in the usual way. The gradient vector has the form

where $J$ is the number of predictors (not including the constant; this model has no constant). Further,

where $\lambda_{a,b}$ is the Logistic pdf with location parameter $a$ and scale parameter $b$, and $x_{\ast, j}$ the $j$th element of the vector $\mathbf x_\ast$. Notice that all partial derivatives are continuous functions of $\boldsymbol\theta$. So we might rely on the Continuous Mapping Theorem once again, and use the MLE-based plug-in estimators to estimate these quantities.

The way to calculate delta method based standard errors, which can then be used to calculate confidence intervals, in R is shows below:

pred_polr_delta = function(
    m,      # fitted model
    cat,    # outcome to predict
    newdata) # data for which preds should be made
{
    
    if (class(m) != "polr")
        stop("fitted model is not of class polr")
    
    if (
        sum(
            c("data.frame", 
              "data.table", 
              "matrix") %in% class(newdata) 
        ) == 0
    ) 
        stop("newdata is not a data.table, data.frame, or matrix")
          
    # get estimates
    beta = coef(m)
    V = vcov(m)
    tau_a = c(-Inf, m$zeta, Inf)
    
    if (!is.matrix(newdata))
        X = as.matrix(newdata)
    
    # linear predictor
    xb = X %*% beta
    
    # beta gradient
    grad_beta = do.call(
        "rbind",
        lapply(
            seq_along(xb), 
            function(w) {
                
                (dlogis(tau_a[cat], xb[w], 1.0) - 
                     dlogis(tau_a[cat + 1], xb[w], 1.0)) * X[w, ]
                
            }
        )
    )
    
    # tau gradient
    if (cat == 1) {
        
        grad_tau = cbind(
            dlogis(tau_a[cat + 1], xb, 1.0), 0
        )
        
    } else if (cat == (length(tau_a) - 1)) {
        
        grad_tau = cbind(
            0,
            -1.0 * dlogis(tau_a[cat], xb, 1.0)
        )
        
    } else {
        
        grad_tau = cbind(
            -1.0 * dlogis(tau_a[cat], xb, 1.0),
            dlogis(tau_a[cat + 1], xb, 1.0)
        )
        
    }
    
    # gradient
    grad_xb = cbind(grad_beta, grad_tau)
    
    # get est. s.e.
    sqrt(
        apply(grad_xb, 1, function(w) {
            w %*% V %*% w
        })
    )
    
}

Confidence Intervals via Simulation

Another way to estimate confidence intervals is to draw samples from the large sample distribution of the MLE. To obtain 95% confidence intervals for our predictions, the procedure is as follows:

  1. We know (using somewhat sloppy notation) that in large samples. So, we might draw $S$ simulation draws from this distribution, which we denote by $\boldsymbol\theta^{(1)}, \boldsymbol\theta^{(2)},…, \boldsymbol\theta^{(S)}$.

  2. For each simulated $\boldsymbol\theta^{(s)}, s= 1,2,…,S$, we calculate the predicted probabilities at $\mathbf x_\ast$. The predicted logit for the $s$th simulation draw is

    and the predicted probability of outcome $k$ at $\mathbf x_\ast$

  3. This will give us $S$ simulated values of $\hat\pi(\mathbf x_\ast)^{(k),(s)}, k = 1,2,…,K$. For each outcome $k$, we can then summarize uncertainty using these draws. For example, to obtain the 95% confidence intervals for the predicted probability of the $k$th outcome, we can calculate the $0.025$th and $0.975$th quantile of the distribution of $\hat\pi(\mathbf x_{\ast})^{(k),(s)}$ across the simulation draws $s = 1,2,…,S$.

The following function implements these steps:

sim_ci_pred_polr = function(
    m,                  # the model
    newdata,            # data for predictions
    level,              # confidence level
    n_sim = 1000,       # no of simulation draws
    return_fit = TRUE,  # return fitted values
    return_list = TRUE) # return list 
{
    
    if (!requireNamespace("abind"))
        stop("Install package abind to use this function")
    
    if (class(m) != "polr")
        stop("fitted model is not of class polr")
    
    # transform new_dat into matrix
    X = as.matrix(newdata)
    
    # get coefs
    beta = coef(m)
    n_beta = length(beta)
    
    # get thresholds
    tau = m$zeta
    n_cat = length(tau) + 1

    # get variance-covariance matrix
    Sigma = vcov(m)
    
    message(
        paste0("Calculating ", 100 * level, 
               "% confidence intervals with ",
               n_sim, 
               " simulation draws ... ")
    )

    
    # simulate coefficients from multivariate normal
    sim_res = MASS::mvrnorm(
        n_sim,
        mu = c(beta, tau),
        Sigma = Sigma)
    
    # for each of simulated coefs, make predictions
    pred = lapply(1:nrow(sim_res), function(w) {
        
        # linear predictor
        xb = X %*% sim_res[w, 1:n_beta]
        aug_tau = c(-Inf, sim_res[w, (n_beta + 1):ncol(sim_res)], Inf)
        
        P = matrix(NA, nrow = nrow(X), ncol = n_cat)
        
        for (i in 2:length(aug_tau)) {
            
            P[, i - 1] = plogis(aug_tau[i], xb) - 
                plogis(aug_tau[i - 1], xb)
                                
        }
        
        return(P)
        
    })
    
    # transform to array
    pred_arr = simplify2array(pred)
    
    # get lower and upper bound probs
    lo = 0.5 * (1.0 - level)
    hi = 1.0 - lo
    
    # calculate CIs
    ci = apply(pred_arr, 1:2, quantile, probs = c(lo, hi))
    
    if (return_fit) {
        
        # linear predictor
        xb = X %*% beta
        aug_tau = c(-Inf, tau, Inf)
        
        P = matrix(NA, nrow = nrow(X), ncol = n_cat)
        
        for (i in 2:length(aug_tau)) {
            
            P[, i - 1] = plogis(aug_tau[i], xb) - 
                plogis(aug_tau[i - 1], xb)
                                
        
        }
        
        # add fit and re-arrange
        res = aperm(
            abind::abind(P, ci, along = 1),
            c(2, 1, 3)
        )
        
        # add dimension names
        dimnames(res) = list(
            NULL,
            c("fit", "lower", "upper"),
            paste0("outcome_", 1:n_cat)
        )

        
    } else {
        
        # re-arrange
        res = aperm(ci, c(2, 1, 3))
        
        # add dimension names
        dimnames(res) = list(
            NULL,
            c("lower", "upper"),
            paste0("outcome_", 1:n_cat)
        )
    
    }
    
    if (!return_list) 
        return(res)
    
    # transform back to list    
    res_list = lapply(
        seq.int(dim(res)[3]), function(w) res[ , , w]
    )
    
    # add names
    names(res_list) = dimnames(res)[[3]]
    
    return(res_list)
    
}

A Short Simulation

We might compare the coverage of the estimated 95% confidence interval with the nominal level. First, we load the needed packages and set up a cluster for parallel computing:

library(doParallel)
library(doRNG)
registerDoParallel(cores = 4)

Next, we set the seed, set up the “true” parameter values, and generate the fixed part of the dataset:

set.seed(12562)

# obs
n = 3000

# coefficients
true_beta = c(.8, .25)
true_tau = c(-.5, 2)

# predictors (dummy & continuous)
x1 = sample(c(0, 1), n, replace = T)
x2 = runif(n, -2, 2)

# linear predictor
xb = cbind(x1, x2) %*% true_beta

# data to make predictions for
pred_dat = data.frame(x1 = 1, x2 = c(-1, 0, 1))

Lastly, we run the simulation. We start with the simulation method:

res = foreach(ii = 1:1000) %dorng% {

    # generate latent outcome
    ystar = xb + rlogis(n)

    # generate obseved outcome
    y = ifelse(
        ystar < true_tau[1], 1,
            ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
                   2, 3)
    )

    # fit model
    fit = MASS::polr(factor(y) ~ x1 + x2, Hess = T)

    # calculate 95% CI
    s = sim_ci_pred_polr(
        m = fit,
        newdata = pred_dat,
        level = .95,
        n_sim = 1000,
        return_fit = FALSE,
        return_list = TRUE)

    # return predictions
    cbind(do.call(rbind, s),
          x = rep(c(-1, 0, 1), 3),
          outcome = rep(1:3, each = 3)
    )

}

# true predicted probabilities
true_xb = cbind(1, c(-1, 0, 1)) %*% true_beta
true_p = c(
    plogis(true_tau[1], true_xb),
    plogis(true_tau[2], true_xb) - plogis(true_tau[1], true_xb),
    plogis(true_tau[2], true_xb, lower.tail = F)
)

# check whether CIs contain true parameter
in_range = do.call(
    rbind,
    lapply(res, function(w) true_p > w[,1] & true_p < w[,2])
)

# coverage rate
cov_rate = matrix(colMeans(in_range), nrow = 3, byrow = T)
rownames(cov_rate) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate) = paste0("outcome = ", 1:3)

Next, we check the confidence intervals created with the delta method:

res_delta = foreach(ii = 1:1000) %dorng% {
    
    # generate latent outcome
    ystar = xb + rlogis(n)
    
    # generate obseved outcome
    y = ifelse(
        ystar < true_tau[1], 1,
            ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
                   2, 3)
    )

    # fit model
    fit = MASS::polr(factor(y) ~ x1 + x2, 
                     Hess = T)
    
    # calculate 95% CI
    se = do.call(
        `c`,
        lapply(
            1:3, 
            function(cat) pred_polr_delta(
                m = fit, 
                cat = cat, 
                newdata = pred_dat
            )
        )
    )
    
    # point estimates
    p = c(predict(fit, newdata = pred_dat, type = "p"))
    
    # return intervals
    z = -1.0 * qnorm(0.025)
    cbind(p - z * se, p + z * se)
    
}

# resolve cluster
stopImplicitCluster()

# check whether CIs contain true parameter
in_range_delta = do.call(
    rbind,
    lapply(
        res_delta, 
        function(w) true_p > w[,1] & true_p < w[,2])
)

# coverage rate
cov_rate_delta = matrix(
    colMeans(in_range_delta), nrow = 3, byrow = T
)
rownames(cov_rate_delta) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate_delta) = paste0("outcome = ", 1:3)

The results of the simulation are as follows:

# print results (simulation)
print(cov_rate)
##         outcome = 1 outcome = 2 outcome = 3
## x2 = -1       0.958       0.961       0.949
## x2 = 0        0.952       0.943       0.941
## x2 = 1        0.944       0.945       0.953
# print results (delta method)
print(cov_rate_delta)    
##         outcome = 1 outcome = 2 outcome = 3
## x2 = -1       0.958       0.962       0.957
## x2 = 0        0.951       0.954       0.945
## x2 = 1        0.942       0.936       0.958

It’s hard to be certain with only a handful of simulation runs. But the coverage rate seems to be quite close to the nominal level for both methods.