Barum Park | Predicted Probabilities from Ordered Logit Models

DISCLAIMER: This is a blog post from my graduate school years. The blog is no longer maintained and the material might include typos and errors.

I was preparing materials for a lab session to the graduate students at NYU, when I realized that there is no straightforward way to calculate uncertainty estimates for predicted probabilities from ordered logit models in R. The predict method for MASS::polr objects, for example, returns the predicted probabilities but not the estimated standard errors (try using the se = TRUE option; it won’t work.)

This created a problem. Due to traveling, I was not able to prepare in advance; so, I was sitting in front of my labtop at 2 am the day before class thinking about what to do. In the end, I coded up a short function that calculates confidence intervals for predictions of ordered logit models based on Monte Carlo simulation and added it to the appendix of my teaching materials. The model, code, and a baby Monte Carlo study to check whether the coverage rate of the calculated confidence intervals are close to the nominal rate, will be the main topic of the current post.

The Ordered Logit Model

The theory to calculate confidence intervals for predicted probabilities is quite straightforward. We assume that we have a latent variable $Y_i^\ast$, which is generated by the DGP

\[Y_i^\ast = \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\]

where $\epsilon_i^\ast \sim \text{Logistic}(0,1)$. Unfortunately, we don’t observe $Y_i^\ast$ but only a truncated version of it. The link between $Y_i^\ast$ and the observed but truncated outcome variable $Y_i$ is given as

\[Y_i = k\times \mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast), \qquad k = 1,2,...,K\]

where $K$ is the number of response categories, $\mathbb I_A(x) = 1$ if $x\in A$ and $0$ otherwise, and where $\boldsymbol\tau = [\tau_0, \tau_1,…,\tau_{K-1},\tau_{K}]’$ are threshold parameters with $\tau_0 = -\infty$ and $\tau_K= \infty$ so that only $\{\tau_1,…,\tau_{K-1}\}$ are estimands.

As the latent error term has a standard Logistic distribution, the probability of the event $\{Y_i = k\}$ is

\[\begin{aligned} \pi^{(k)}(\mathbf x_i) &= \Pr[Y_i = k \,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast) = 1 \,\vert\, \mathbf X_i = \mathbf x_i]\\ &= \Pr[\tau_{k-1} < Y_i^\ast \le \tau_k\,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\tau_{k-1}< \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\le \tau_k]\\ &= \Pr[\tau_{k-1} - \mathbf x_i'\boldsymbol\beta < \epsilon_i^\ast \le \tau_{k} - \mathbf x_i'\boldsymbol\beta ]\\ &= \Lambda_{0,1}(\tau_k - \mathbf x_i'\boldsymbol\beta) - \Lambda_{0,1}(\tau_{k-1} - \mathbf x_i'\boldsymbol\beta) \\ &= \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_k) - \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_{k-1}), \end{aligned}\]

where $\Lambda_{a, b}(\cdot)$ is the Logistic CDF with location and scale parameters, respectively, equal to $a$ and $b$. The likelihood of an iid sample of size $n$ under the model is

\[\mathcal L(\boldsymbol\theta) = \prod_{i=1}^n \prod_{k=1}^K\Big[\pi^{(k)}(\mathbf x_i)\Big]^{\mathbb I_{\{k\}}(y_i)}.\]

Maximizing $\mathcal L(\boldsymbol\theta)$ with respect to $\boldsymbol\theta =[\boldsymbol\beta, \boldsymbol\tau]’$ gives us the MLE

\[\hat{\boldsymbol\theta}_\text{MLE} = \text{argmax}_{\boldsymbol\theta \in \boldsymbol\Theta}\mathcal L(\boldsymbol\theta)\]

where $\boldsymbol\Theta$ is the parameter space. Under the usual regularity conditions, we can rely on the Central Limit Theorem to show that

\[\sqrt{n} (\hat{\boldsymbol\theta}_\text{MLE} - \boldsymbol\theta) \longrightarrow \mathbf Z\]

in distribution as $n \rightarrow \infty$, where $\mathbf Z \sim \text{Normal}(\mathbf 0, \mathcal I (\boldsymbol\theta)^{-1})$ and where $\mathcal I(\boldsymbol\theta)$ is the Fisher Information. Hence, in large samples, we might approximate the distribution of $\hat{\boldsymbol\theta}_\text{MLE}$ with a $\text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta}_\text{MLE})^{-1})$ distribution.

Quantifying Uncertainty in Predicted Probabilities

After estimating the parameter $\hat{\boldsymbol\theta} = [\hat{\boldsymbol\beta}, \hat{\boldsymbol\tau}]’$ (we’ve dropped the MLE subscript for notational clarity), the predicted probabilities can be calculated straightforwardly as

\[\hat{\pi}(\mathbf x_i)^{(k)} = \Lambda_{\mathbf x_i'\hat{\boldsymbol\beta}, 1}(\hat{\tau}_k) - \Lambda_{\mathbf x_i' \hat{\boldsymbol\beta}, 1}(\hat{\tau}_{k-1})\]

for $i = 1,…,n$ and $k = 1,…, K$, where we define, again, $\hat\tau_0 = -\infty$ and $\hat\tau_K = \infty$.

The uncertainty in these estimates might be calculated in several ways. Two often used methods are Monte Carlo simulations, following the dictum that everything of a distribution can be learned by simulating from it, and the delta method (also called the method of propagation of error).

Confidence Intervals via the Delta Method

The delta method relies on the first-order Taylor Expension at a the MLE to approximate the distribution of a function of the MLE. Here we are interested in the distribution of the predicted probability at a point $\mathbf X = \mathbf x_\ast$. As $\hat{\boldsymbol\theta}$ is a consistent estimator of $\boldsymbol\theta$, and the mapping $\boldsymbol\theta \mapsto \Lambda_{\mathbf x_\ast’\boldsymbol\beta}(\tau_k)$ is continuous, it follows from the Continuous Mapping Theorem that

\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_k) - \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_{k-1})\quad \longrightarrow\quad \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_k) - \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_{k-1})=\pi(\boldsymbol\theta; \mathbf x_\ast)\]

in probability as $n\rightarrow \infty$, where we have made the dependence of the predicted probabilities on the parameter vector explicit and dropped the superscript $\{(k)\}$ for clarity. Taylor-expanding $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ about $\boldsymbol\theta$, we obtain

\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta) + R(\hat{\boldsymbol\theta}, \boldsymbol\theta)\]

where $R(\hat{\boldsymbol\theta}, \boldsymbol\theta)$ is the remainder. As $\hat{\boldsymbol\theta} \rightarrow \boldsymbol\theta$ in probability, $R \rightarrow 0$ in probability as well; so, by Slutsky’s Theorem,

\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) \longrightarrow \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta)\]

in distribution as $n\rightarrow\infty$. But as $\sqrt{n}(\hat{\boldsymbol\theta} - \boldsymbol\theta)$ converges to a $\text{Normal}(0, \mathcal I(\boldsymbol\theta)^{-1})$ distribution, it follows that the asymptotic distribution $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ is

\[W \sim \text{Normal}\Big(\pi(\boldsymbol\theta; \mathbf x_\ast),\; \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)' [\mathcal I(\boldsymbol\theta)^{-1}/n]\nabla \pi(\boldsymbol\theta; \mathbf x_\ast)\Big).\]

The matrix sandwiched in the middle of the gradients is just the covariance matrix of the MLE, which can be estimated in the usual way. The gradient vector has the form

\[\nabla \pi(\boldsymbol\theta; \mathbf x_\ast) = \left[\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_1}, ..., \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_J}, \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_1},...,\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{K-1}}\right]'\]

where $J$ is the number of predictors (not including the constant; this model has no constant). Further,

\[\begin{aligned} \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_j} &= \big[ \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}) -\lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_k)\big] x_{\ast, j}, \qquad j = 1,2,...,J \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_k} &= \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k}), \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{k-1}} &= - \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}),\\ \end{aligned}\]

where $\lambda_{a,b}$ is the Logistic pdf with location parameter $a$ and scale parameter $b$, and $x_{\ast, j}$ the $j$th element of the vector $\mathbf x_\ast$. Notice that all partial derivatives are continuous functions of $\boldsymbol\theta$. So we might rely on the Continuous Mapping Theorem once again, and use the MLE-based plug-in estimators to estimate these quantities.

The way to calculate delta method based standard errors, which can then be used to calculate confidence intervals, in R is shows below:

pred_polr_delta = function(
    m,      # fitted model

    cat,    # outcome to predict

    newdata) # data for which preds should be made

{
    
    if (class(m) != "polr")
        stop("fitted model is not of class polr")
    
    if (
        sum(
            c("data.frame", 
              "data.table", 
              "matrix") %in% class(newdata) 
        ) == 0
    ) 
        stop("newdata is not a data.table, data.frame, or matrix")
          
    # get estimates

    beta = coef(m)
    V = vcov(m)
    tau_a = c(-Inf, m$zeta, Inf)
    
    if (!is.matrix(newdata))
        X = as.matrix(newdata)
    
    # linear predictor

    xb = X %*% beta
    
    # beta gradient

    grad_beta = do.call(
        "rbind",
        lapply(
            seq_along(xb), 
            function(w) {
                
                (dlogis(tau_a[cat], xb[w], 1.0) - 
                     dlogis(tau_a[cat + 1], xb[w], 1.0)) * X[w, ]
                
            }
        )
    )
    
    # tau gradient

    if (cat == 1) {
        
        grad_tau = cbind(
            dlogis(tau_a[cat + 1], xb, 1.0), 0
        )
        
    } else if (cat == (length(tau_a) - 1)) {
        
        grad_tau = cbind(
            0,
            -1.0 * dlogis(tau_a[cat], xb, 1.0)
        )
        
    } else {
        
        grad_tau = cbind(
            -1.0 * dlogis(tau_a[cat], xb, 1.0),
            dlogis(tau_a[cat + 1], xb, 1.0)
        )
        
    }
    
    # gradient

    grad_xb = cbind(grad_beta, grad_tau)
    
    # get est. s.e.

    sqrt(
        apply(grad_xb, 1, function(w) {
            w %*% V %*% w
        })
    )
    
}

Confidence Intervals via Simulation

Another way to estimate confidence intervals is to draw samples from the large sample distribution of the MLE. To obtain 95% confidence intervals for our predictions, the procedure is as follows:

We know (using somewhat sloppy notation) that $\hat{\boldsymbol \theta} \sim \text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta})^{-1})$ in large samples. So, we might draw $S$ simulation draws from this distribution, which we denote by $\boldsymbol\theta^{(1)}, \boldsymbol\theta^{(2)},…, \boldsymbol\theta^{(S)}$.
For each simulated $\boldsymbol\theta^{(s)}, s= 1,2,…,S$, we calculate the predicted probabilities at $\mathbf x_\ast$. The predicted logit for the $s$th simulation draw is
\[\nu_\ast^{(s)} = \mathbf x_\ast'\boldsymbol\beta^{(s)}\]
and the predicted probability of outcome $k$ at $\mathbf x_\ast$
\[\hat{\pi}(\mathbf x_\ast)^{(k), (s)} = \Lambda_{\nu_\ast^{(s)}, 1}(\tau_k^{(s)}) - \Lambda_{\nu_\ast^{(s)}, 1}(\tau_{k-1}^{(s)})\]
This will give us $S$ simulated values of $\hat\pi(\mathbf x_\ast)^{(k),(s)}, k = 1,2,…,K$. For each outcome $k$, we can then summarize uncertainty using these draws. For example, to obtain the 95% confidence intervals for the predicted probability of the $k$th outcome, we can calculate the $0.025$th and $0.975$th quantile of the distribution of $\hat\pi(\mathbf x_{\ast})^{(k),(s)}$ across the simulation draws $s = 1,2,…,S$.

The following function implements these steps:

sim_ci_pred_polr = function(
    m,                  # the model

    newdata,            # data for predictions

    level,              # confidence level

    n_sim = 1000,       # no of simulation draws

    return_fit = TRUE,  # return fitted values

    return_list = TRUE) # return list 

{
    
    if (!requireNamespace("abind"))
        stop("Install package abind to use this function")
    
    if (class(m) != "polr")
        stop("fitted model is not of class polr")
    
    # transform new_dat into matrix

    X = as.matrix(newdata)
    
    # get coefs

    beta = coef(m)
    n_beta = length(beta)
    
    # get thresholds

    tau = m$zeta
    n_cat = length(tau) + 1

    # get variance-covariance matrix

    Sigma = vcov(m)
    
    message(
        paste0("Calculating ", 100 * level, 
               "% confidence intervals with ",
               n_sim, 
               " simulation draws ... ")
    )

    
    # simulate coefficients from multivariate normal

    sim_res = MASS::mvrnorm(
        n_sim,
        mu = c(beta, tau),
        Sigma = Sigma)
    
    # for each of simulated coefs, make predictions

    pred = lapply(1:nrow(sim_res), function(w) {
        
        # linear predictor

        xb = X %*% sim_res[w, 1:n_beta]
        aug_tau = c(-Inf, sim_res[w, (n_beta + 1):ncol(sim_res)], Inf)
        
        P = matrix(NA, nrow = nrow(X), ncol = n_cat)
        
        for (i in 2:length(aug_tau)) {
            
            P[, i - 1] = plogis(aug_tau[i], xb) - 
                plogis(aug_tau[i - 1], xb)
                                
        }
        
        return(P)
        
    })
    
    # transform to array

    pred_arr = simplify2array(pred)
    
    # get lower and upper bound probs

    lo = 0.5 * (1.0 - level)
    hi = 1.0 - lo
    
    # calculate CIs

    ci = apply(pred_arr, 1:2, quantile, probs = c(lo, hi))
    
    if (return_fit) {
        
        # linear predictor

        xb = X %*% beta
        aug_tau = c(-Inf, tau, Inf)
        
        P = matrix(NA, nrow = nrow(X), ncol = n_cat)
        
        for (i in 2:length(aug_tau)) {
            
            P[, i - 1] = plogis(aug_tau[i], xb) - 
                plogis(aug_tau[i - 1], xb)
                                
        
        }
        
        # add fit and re-arrange

        res = aperm(
            abind::abind(P, ci, along = 1),
            c(2, 1, 3)
        )
        
        # add dimension names

        dimnames(res) = list(
            NULL,
            c("fit", "lower", "upper"),
            paste0("outcome_", 1:n_cat)
        )

        
    } else {
        
        # re-arrange

        res = aperm(ci, c(2, 1, 3))
        
        # add dimension names

        dimnames(res) = list(
            NULL,
            c("lower", "upper"),
            paste0("outcome_", 1:n_cat)
        )
    
    }
    
    if (!return_list) 
        return(res)
    
    # transform back to list    

    res_list = lapply(
        seq.int(dim(res)[3]), function(w) res[ , , w]
    )
    
    # add names

    names(res_list) = dimnames(res)[[3]]
    
    return(res_list)
    
}

A Short Simulation

We might compare the coverage of the estimated 95% confidence interval with the nominal level. First, we load the needed packages and set up a cluster for parallel computing:

library(doParallel)
library(doRNG)
registerDoParallel(cores = 4)

Next, we set the seed, set up the “true” parameter values, and generate the fixed part of the dataset:

set.seed(12562)

# obs

n = 3000

# coefficients

true_beta = c(.8, .25)
true_tau = c(-.5, 2)

# predictors (dummy & continuous)

x1 = sample(c(0, 1), n, replace = T)
x2 = runif(n, -2, 2)

# linear predictor

xb = cbind(x1, x2) %*% true_beta

# data to make predictions for

pred_dat = data.frame(x1 = 1, x2 = c(-1, 0, 1))

Lastly, we run the simulation. We start with the simulation method:

res = foreach(ii = 1:1000) %dorng% {

    # generate latent outcome

    ystar = xb + rlogis(n)

    # generate obseved outcome

    y = ifelse(
        ystar < true_tau[1], 1,
            ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
                   2, 3)
    )

    # fit model

    fit = MASS::polr(factor(y) ~ x1 + x2, Hess = T)

    # calculate 95% CI

    s = sim_ci_pred_polr(
        m = fit,
        newdata = pred_dat,
        level = .95,
        n_sim = 1000,
        return_fit = FALSE,
        return_list = TRUE)

    # return predictions

    cbind(do.call(rbind, s),
          x = rep(c(-1, 0, 1), 3),
          outcome = rep(1:3, each = 3)
    )

}

# true predicted probabilities

true_xb = cbind(1, c(-1, 0, 1)) %*% true_beta
true_p = c(
    plogis(true_tau[1], true_xb),
    plogis(true_tau[2], true_xb) - plogis(true_tau[1], true_xb),
    plogis(true_tau[2], true_xb, lower.tail = F)
)

# check whether CIs contain true parameter

in_range = do.call(
    rbind,
    lapply(res, function(w) true_p > w[,1] & true_p < w[,2])
)

# coverage rate

cov_rate = matrix(colMeans(in_range), nrow = 3, byrow = T)
rownames(cov_rate) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate) = paste0("outcome = ", 1:3)

Next, we check the confidence intervals created with the delta method:

res_delta = foreach(ii = 1:1000) %dorng% {
    
    # generate latent outcome

    ystar = xb + rlogis(n)
    
    # generate obseved outcome

    y = ifelse(
        ystar < true_tau[1], 1,
            ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
                   2, 3)
    )

    # fit model

    fit = MASS::polr(factor(y) ~ x1 + x2, 
                     Hess = T)
    
    # calculate 95% CI

    se = do.call(
        `c`,
        lapply(
            1:3, 
            function(cat) pred_polr_delta(
                m = fit, 
                cat = cat, 
                newdata = pred_dat
            )
        )
    )
    
    # point estimates

    p = c(predict(fit, newdata = pred_dat, type = "p"))
    
    # return intervals

    z = -1.0 * qnorm(0.025)
    cbind(p - z * se, p + z * se)
    
}

# resolve cluster

stopImplicitCluster()

# check whether CIs contain true parameter

in_range_delta = do.call(
    rbind,
    lapply(
        res_delta, 
        function(w) true_p > w[,1] & true_p < w[,2])
)

# coverage rate

cov_rate_delta = matrix(
    colMeans(in_range_delta), nrow = 3, byrow = T
)
rownames(cov_rate_delta) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate_delta) = paste0("outcome = ", 1:3)

The results of the simulation are as follows:

# print results (simulation)

print(cov_rate)

##         outcome = 1 outcome = 2 outcome = 3
## x2 = -1       0.958       0.961       0.949
## x2 = 0        0.952       0.943       0.941
## x2 = 1        0.944       0.945       0.953

# print results (delta method)

print(cov_rate_delta)

##         outcome = 1 outcome = 2 outcome = 3
## x2 = -1       0.958       0.962       0.957
## x2 = 0        0.951       0.954       0.945
## x2 = 1        0.942       0.936       0.958

It’s hard to be certain with only a handful of simulation runs. But the coverage rate seems to be quite close to the nominal level for both methods.