Lecture 3

GLMs: Ordinal Logistic Regression

(Please, sign in on iClicker)

Today’s Learning Goals

By the end of this lecture, you should be able to:

  • Outline the modelling framework of the Ordinal Logistic regression.
  • Explain the concept of proportional odds.
  • Fit and interpret Ordinal Logistic regression.
  • Use the Ordinal Logistic regression for prediction.
  • Assess the assumption of proportional odds via the Brant-Wald test.
  • Contrast the Ordinal Logistic regression models under proportional and non-proportional odds assumptions.

Outline

  1. The Need for an Ordinal Model
  2. College Juniors Dataset
  3. Exploratory Data Analysis
  4. Data Modelling Framework
  5. Estimation
  6. Inference
  7. Coefficient Interpretation
  8. Predictions
  9. Model Selection
  10. Non-proportional Odds

1. The Need for an Ordinal Model

  • We will continue with Regression Analysis on categorical responses.


1.1. Reminder

  • Recall that a discrete response is considered as categorical if it is of factor-type with more than two categories:
  1. Nominal. We have categories that do not follow any specific order—for example, the type of dwelling according to the Canadian census: single-detached house, semi-detached house, row house, apartment, and mobile home.
  1. Ordinal. The categories, in this case, follow a specific order—for example, a Likert scale of survey items: strongly disagree, disagree, neutral, agree, and strongly agree.

1.2. A question to start with …

  • Why not just using Mutinomial Logistic regression in the case of an ordinal response?
  • The previous Multinomial Logistic model is suboptimal for ordinal responses.
  • Suppose one does not consider the order in the response levels when necessary. In that case, we might risk losing valuable information when making inference or predictions.
  • This valuable information is in the ordinal scale, which could be set up via cumulative probabilities.

1.3. What to do then?

  • An alternative logistic regression that suits ordinal responses.
  • Hence, the existence of the Ordinal Logistic regression model.
  • We will expand the regression mindmap to include this new model.
  • Note there is a key characteristic called proportional odds which is reflected on the data modelling framework.

2. College Juniors Dataset

  • To illustrate the use of this generalized linear model (GLM), let us introduce an adequate dataset.
  • The data frame college_data was obtained from the webpage of the UCLA Statistical Consulting Group.
  • It contains the results of a survey applied to \(n = 400\) college juniors regarding the factors that influence their decision to apply to graduate school.

2.1. Read data in R

2.2. A quick look at data!

  • The dataset contains the following variables:
    • decision: how likely the student will apply to graduate school, a discrete and ordinal response with three increasing categories (unlikely, somewhat likely, and very likely).
    • parent_ed: whether at least one parent has a graduate degree, a discrete and binary explanatory variable (Yes and No).
    • GPA: the student’s current GPA, a continuous explanatory variable.

2.3. A quick data wrangling!

  • Variable decision will be our response of interest.
  • Before proceeding with our regression analysis, let us set it up as an ordered factor via function as.ordered().

2.4. Main Statistical Inquiries

  • Let us suppose we want to assess the following:
    • Is decision statistically related to parent_ed and GPA?
    • How can we interpret these statistical relationships (if there are any!)?

3. Exploratory Data Analysis (EDA)

  • Since decision is a discrete ordinal response, we have to be careful about the plots to be used in the EDA.
  • To explore the relationship between decision and GPA , we will use these side-by-side plots: boxplots and/or violin plots.
  • GPA is on the \(y\)-axis, and all categories of decision are on the \(x\)-axis.
  • The side-by-side violin plots show the mean of GPA by decision category in red points.

Plots!

How can we visually explore the relationship between decision and parent_ed?

  • In the case of two categorical explanatory variables, we can use stacked bar charts.
  • Each bar will show the percentages (depicted on the \(y\)-axis) of college students falling on each ordered category of decision with the categories found in parent_ed on the \(x\)-axis.

Plots!

In-class Question

  • What can we see descriptively from the previous plots?

4. Data Modelling Framework

  • Let us suppose that a given discrete ordinal response \(Y_i\) (for \(i = 1, \dots, n\)) has categories \(1, 2, \dots, m\) in a training set of size \(n\).

Heads-up: Categories \(1, 2, \dots, m\) implicate an ordinal scale here, i.e., \(1 < 2 < \dots < m\).

  • Also, note there is more than one class of Ordinal Logistic regression.
  • We will review the proportional odds model (a cumulative logit model).

When \(m = 3\)

  • We will set up an Ordinal Logistical regression model for this specific dataset with \(m = 3\) for the levels of \(Y_i\), i.e.,

\[\begin{gather*} 1 = \texttt{unlikely} \\ 2 = \texttt{somewhat likely} \\ 3 = \texttt{very likely} \end{gather*}\]

The Regressors


  • For the \(i\)th observation, we have the continuous \(X_{i, \texttt{GPA}}\) and the dummy variable

\[ X_{i, \texttt{parent_ed}} = \begin{cases} 1 \; \; \; \; \mbox{at least one parent has a graduate degree},\\ 0 \; \; \; \; \mbox{neither of the parents has a graduate degree.} \end{cases} \]

Then…


  • The model will indicate how each one of the regressors affects the cumulative logarithm of the odds in decision for the following \(2\) situations:

\[\begin{gather*} \text{Level } \texttt{somewhat likely} \text{ or any lesser degree versus level } \texttt{very likely} \\ \text{Level } \texttt{unlikely} \text{ versus level } \texttt{somewhat likely} \text{ or any higher degree} \end{gather*}\]

These are our cumulative log-odds!

  • We have the following system of \(2\) equations:

\[\begin{align*} \eta_i^{(\texttt{somewhat likely})} &= \log\left[\frac{P(Y_i \leq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i = \texttt{very likely} \mid X_{i, \texttt{GPA}}, X_{i, \texttt{parent_ed}})}\right] \\ &= \beta_0^{(\texttt{somewhat likely})} - \beta_1 X_{i, \texttt{GPA}} - \beta_2 X_{i, \texttt{parent_ed}} \end{align*}\] \[\begin{align*} \eta_i^{(\texttt{unlikely})} &= \log\left[\frac{P(Y_i = \texttt{unlikely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i \geq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}\right] \\ &= \beta_0^{(\texttt{unlikely})} - \beta_1 X_{i, \texttt{GPA}} - \beta_2 X_{i, \texttt{parent_ed}}. \end{align*}\]

Note the following…

  • The system has \(2\) intercepts but only \(2\) regression coefficients.
  • Check the signs of the common \(\beta_1\) and \(\beta_2\), which are minuses. This is a fundamental parameterization of the proportional odds model.

Some algebraic manipulation…

  • To make coefficient intepretation easier, we could re-expresss the equations of \(\eta_i^{(\texttt{somewhat likely})}\) and \(\eta_i^{(\texttt{unlikely})}\) as follows:

\[\begin{gather*} \frac{P(Y_i = \texttt{very likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i \leq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})} = \\ \exp\big(-\beta_0^{(\texttt{somewhat likely})}\big) \exp\big(\beta_1 X_{i, \texttt{GPA}}\big) \exp\big(\beta_2 X_{i, \texttt{parent_ed}}\big) \end{gather*}\] \[\begin{gather*} \frac{P(Y_i \geq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i = \texttt{unlikely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})} = \\ \exp\big(-\beta_0^{(\texttt{unlikely})}\big) \exp\big(\beta_1 X_{i, \texttt{GPA}}\big) \exp\big(\beta_2 X_{i, \texttt{parent_ed}}\big). \end{gather*}\]

Further Insights

  • Both regression coefficients (\(\beta_1\) and \(\beta_2\)) have a multiplicative effect on the odds that we have on the left-hand side.
  • Moreover, given the proportional odds assumptions, the regressor effects are the same for both cumulative odds.


What is the math for the general case with \(m\) response categories and \(k\) regressors?

  • This model will indicate how each one of the \(k\) regressors \(X_{i,1}, \dots, X_{i,k}\) affects the cumulative logarithm of the odds in the ordinal response (with \(m\) levels) for the following \(m - 1\) situations:

\[\begin{gather*} \text{Level } m - 1 \text{ or any lesser degree versus level } m\\ \text{Level } m - 2 \text{ or any lesser degree versus level } m - 1 \text{ or any higher degree}\\ \vdots \\ \text{Level } 2 \text{ or any lesser degree versus level } 3 \text{ or any higher degree}\\ \text{Level } 1 \text{ versus level } 2 \text{ or any higher degree}\\ \end{gather*}\]

Cumulative Probabilities

  • Note that the previous system has \(m - 1\) intercepts but only \(k\) regression coefficients.
  • In general, the previous \(m - 1\) equations can be generalized for levels \(j = m - 1, \dots, 1\) as follows:

\[\begin{gather*} \eta_i^{(j)} = \log\left[\frac{P(Y_i \leq j \mid X_{i,1}, \ldots, X_{i,k})}{P(Y_i > j \mid X_{i,1}, \ldots, X_{i,k})}\right] = \beta_0^{(j)} - \beta_1 X_{i, 1} - \ldots - \beta_k X_{i, k} \\ \; \; \; \; \; \; \; \; \Rightarrow \; P(Y_i \leq j \mid X_{i,1}, \ldots, X_{i,k}) = \frac{\exp\left(\beta_0^{(j)} - \beta_1 X_{i, 1} - \ldots - \beta_k X_{i, k}\right)}{1 + \exp\left(\beta_0^{(j)} - \beta_1 X_{i, 1} - \ldots - \beta_k X_{i, k}\right)}. \end{gather*}\]

Standalone Probabilities

  • The standalone probability that \(Y_i\) will fall in the category \(j\) is computed as follows: \[ p_{i,j} = P(Y_i = j \mid X_{i,1}, \ldots, X_{i,k}) \\ = P(Y_i \leq j \mid X_{i,1}, \ldots, X_{i,k}) - P(Y_i \leq j - 1 \mid X_{i,1}, \ldots, X_{i,k}), \]

  • Hence: \[ P(Y_i = 1) = p_{i,1} \;\;\;\; P(Y_i = 2) = p_{i,2} \;\; \dots \;\; P(Y_i = m) = p_{i,m}, \]

  • Where: \[ \sum_{j = 1}^m p_{i,j} = p_{i,1} + p_{i,2} + \dots + p_{i,m} = 1. \]

iClicker Question

  • Suppose we have a case with an ordered response of five levels and four continuous regressors.
  • How many link functions, intercepts, and coefficients will our Ordinal Logistic regression model have under a proportional odds assumption?

A. Four link functions, four intercepts, and sixteen regression coefficients.

B. Five link functions, five intercepts, and five regression coefficients.

C. Five link functions, five intercepts, and twenty regression coefficients.

D. Four link functions, four intercepts, and four regression coefficients.


A Graphical Intuition of the Proportional Odds Assumption

  • There is an optional section in the notes that shows mathematically what the proportional odds assumption stands for in the Ordinal Logistic regression model with \(k\) regressors and a training set size of \(n\), using a response of \(m\) levels in general.
  • That said, let us check the graphical intuition with an example of an ordinal response of \(5\) levels and a single continuous regressor \(X_{i , 1}\) (which is unbounded!).

System of Cumulative Logit Equations

\[\begin{gather*} \eta_i^{(4)} = \log\left[\frac{P(Y_i \leq 4 \mid X_{i,1})}{P(Y_i = 5 \mid X_{i,1})}\right] = \beta_0^{(4)} - \beta_1 X_{i, 1} \\ \eta_i^{(3)} = \log\left[\frac{P(Y_i \leq 3 \mid X_{i,1})}{P(Y_i > 3 \mid X_{i,1})}\right] = \beta_0^{(3)} - \beta_1 X_{i, 1} \\ \eta_i^{(2)} = \log\left[\frac{P(Y_i \leq 2 \mid X_{i,1})}{P(Y_i > 2 \mid X_{i,1})}\right] = \beta_0^{(2)} - \beta_1 X_{i, 1} \\ \eta_i^{(1)} = \log\left[\frac{P(Y_i = 1 \mid X_{i,1})}{P(Y_i > 1 \mid X_{i,1})}\right] = \beta_0^{(1)} - \beta_1 X_{i, 1}. \end{gather*}\]

Some algebraic manipulation…

\[\begin{gather*} P(Y_i \leq 4 \mid X_{i,1}) = \frac{\exp\left(\beta_0^{(4)} - \beta_1 X_{i, 1} \right)}{1 + \exp\left(\beta_0^{(4)} - \beta_1 X_{i, 1}\right)} \\ P(Y_i \leq 3 \mid X_{i,1}) = \frac{\exp\left(\beta_0^{(3)} - \beta_1 X_{i, 1} \right)}{1 + \exp\left(\beta_0^{(3)} - \beta_1 X_{i, 1}\right)} \\ P(Y_i \leq 2 \mid X_{i,1}) = \frac{\exp\left(\beta_0^{(2)} - \beta_1 X_{i, 1} \right)}{1 + \exp\left(\beta_0^{(2)} - \beta_1 X_{i, 1}\right)} \\ P(Y_i \leq 1 \mid X_{i,1}) = \frac{\exp\left(\beta_0^{(1)} - \beta_1 X_{i, 1} \right)}{1 + \exp\left(\beta_0^{(1)} - \beta_1 X_{i, 1}\right)}. \\ \end{gather*}\]

Note following…

  • There are four intercepts \(\left( \beta_0^{(1)}, \beta_0^{(2)}, \beta_0^{(3)}, \beta_0^{(4)} \right)\) and a single coefficient \(\beta_1\).
  • Let us assume we estimate these five parameters to be the following: \[\begin{gather*} \hat{\beta_0}^{(1)} = 4 \quad \quad \hat{\beta_0}^{(2)} = 5.5 \\ \hat{\beta_0}^{(3)} = 6.3 \quad \quad \hat{\beta_0}^{(4)} = 10 \\ \hat{\beta_1} = 12. \end{gather*}\]

The sigmoid functions!

5. Estimation

  • Estimates are obtained through maximum likelihood, where we also assume a Multinomial joint probability mass function of the \(n\) responses \(Y_i\) via cumulative probabilities in the joint likelihood function.
  • To fit the model with the package MASS, we use polr(). The argument Hess = TRUE is required to compute the Hessian matrix of the log-likelihood function, which is used to obtain the standard errors of the estimates.

6. Inference

  • We also use the Wald statistic \(z_j = \frac{\hat{\beta}_j}{\mbox{se}(\hat{\beta}_j)}\) to test the hypotheses: \[\begin{gather*} H_0: \beta_j = 0\\ H_a: \beta_j \neq 0. \end{gather*}\]

  • Provided the sample size \(n\) is large enough, \(z_j\) has an approximately Standard Normal distribution under \(H_0\).

Code in R!

  • Terms unlikely|somewhat likely and somewhat likely|very likely refer to \(\hat{\beta_0}^{(\texttt{unlikely})}\) and \(\hat{\beta_0}^{(\texttt{somewhat likely})}\), respectively.

Using \(\alpha = 0.05\)

  • We see that those coefficients associated with GPA and parent_ed are statistically significant with \(\alpha = 0.05\).
  • Function confint() can provide the 95% confidence intervals for the estimates contained in ordinal_model.

7. Coefficient Interpretation

  • The interpretation of the Ordinal Logistic regression coefficients will change since we are modelling cumulative probabilities.
  • Let us interpret the association of decision with GPA and parent_ed.

Interpretations

  • By using the column exp.estimate, along with the model equations on the original scale of the cumulative odds, we interpret the two regression coefficients above by each odds as follows:
  • For the odds:

\[\frac{P(Y_i = \texttt{very likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i \leq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})} = \\ \exp\big(-\beta_0^{(\texttt{somewhat likely})}\big) \exp\big(\beta_1 X_{i, \texttt{GPA}}\big) \exp\big(\beta_2 X_{i, \texttt{parent_ed}}\big).\]

We have …

\(\beta_1\): “for each one-unit increase in the GPA, the odds that the student is very likely versus somewhat likely or unlikely to apply to graduate school increase by \(\exp \left( \hat{\beta}_1 \right) = 1.82\) times (while holding parent_ed constant).”

\(\beta_2\): “for those respondents whose parents attended to graduate school, the odds that the student is very likely versus somewhat likely or unlikely to apply to graduate school increase by \(\exp \left( \hat{\beta}_2 \right) = 2.86\) times (when compared to those respondents whose parents did not attend to graduate school and holding GPA constant).”

And…

  • For the odds:

\[\frac{P(Y_i \geq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i = \texttt{unlikely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})} = \\ \exp\big(-\beta_0^{(\texttt{unlikely})}\big) \exp\big(\beta_1 X_{i, \texttt{GPA}}\big) \exp\big(\beta_2 X_{i, \texttt{parent_ed}}\big).\]

We have …

\(\beta_1\): “for each one-unit increase in the GPA, the odds that the student is very likely or somewhat likely versus unlikely to apply to graduate school increase by \(\exp \left( \hat{\beta}_1 \right) = 1.82\) times (while holding parent_ed constant).”

\(\beta_2\): “for those respondents whose parents attended to graduate school, the odds that the student is very likely or somewhat likely versus unlikely to apply to graduate school increase by \(\exp \left( \hat{\beta}_2 \right) = 2.86\) times (when compared to those respondents whose parents did not attend to graduate school and holding GPA constant).”

8. Predictions

  • Aside from our main statistical inquiries, using the function predict() with the object ordinal_model, we can obtain the estimated probabilities to apply to graduate school (associated to unlikely, somewhat likely, and very likely) for a student with a GPA of 3.5 whose parents attended to graduate school.

Thus…

  • We can see that it is somewhat likely to apply to graduate school with a probability of 0.48.
  • Thus, we could classify the student in this ordinal category.

9. Model Selection

  • To perform model selection, we can use the same techniques from lecture2. That said, some R coding functions might not be entirely available for the polr() models.
  • Still, these statistical techniques and metrics can be manually coded.

10. Non-proportional Odds

  • Now, we might wonder:

What happens if we do not fulfil the proportional odds assumption in our Ordinal Logistic regression model?


10.1. The Brant-Wald Test

  • The Ordinal Logistic model under the proportional odds assumption is the first step when performing Regression Analysis on an ordinal response.
  • Once this model has been fitted, it is possible to assess whether it fulfils this strong assumption statistically.
  • We can do it via the Brant-Wald test.
  • This tool performs statistically assesses whether our model globally fulfils this assumption: \[\begin{gather*} H_0: \text{Our Ordinal Logistic regression model globally} \\ \text{fulfils the proportional odds assumption.} \\ H_a: \text{Otherwise}. \end{gather*}\]

Regressor-specifc Tests


  • It also performs further hypothesis testing on each regressor.
  • With \(k\) regressors for \(j = 1, \dots, k\); we have the following hypotheses: \[\begin{gather*} H_0: \text{The } j \text{th regressor in our Ordinal Logistic regression} \\ \text{model fulfils the proportional odds assumption.} \\ H_a: \text{Otherwise}. \end{gather*}\]

10.2. Code in R!

  • Function brant() from package brant implements this tool, which can be used in our polr() object ordinal_model:

Notes on Output

  • The row Omnibus represents the global model, while the other two rows correspond to our two regressors: parent_ed and GPA.
  • With \(\alpha = 0.05\), we are completely fulfilling the proportional odds assumption (the column probability delivers the corresponding \(p\)-values).

Heads-up: The Brant Wald test essentially compares this basic Ordinal Logistic regression model of \(m - 1\) cumulative logit functions (i.e., the ordinal response has \(m\) categories) versus a set of \(m − 1\) correlated Binary Logistic regressions.

10.3. A Non-proportional Odds Model

  • Suppose that our example case does not fulfil the proportional odds assumption according to the Brant Wald test.
  • It is possible to have a model under a non-proportional odds assumption:

\[\begin{align*} \eta_i^{(\texttt{somewhat likely})} &= \log\left[\frac{P(Y_i \leq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i = \texttt{very likely} \mid X_{i, \texttt{GPA}}, X_{i, \texttt{parent_ed}})}\right] \\ &= \beta_0^{(\texttt{somewhat likely})} - \beta_1^{(\texttt{somewhat likely})} X_{i, \texttt{GPA}} - \\ & \qquad \beta_2^{(\texttt{somewhat likely})} X_{i, \texttt{parent_ed}} \end{align*}\] \[\begin{align*} \eta_i^{(\texttt{unlikely})} &= \log\left[\frac{P(Y_i = \texttt{unlikely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}{P(Y_i \geq \texttt{somewhat likely} \mid X_{i, \texttt{GPA}},X_{i, \texttt{parent_ed}})}\right] \\ &= \beta_0^{(\texttt{unlikely})} - \beta_1^{(\texttt{unlikely})} X_{i, \texttt{GPA}} - \beta_2^{(\texttt{unlikely})} X_{i, \texttt{parent_ed}}. \end{align*}\]

Generalized Ordinal Logistic Regression Model


  • This class of models can be fit via the function cumulative() from package VGAM.
  • Note that estimation, inference, coefficient interpretation, and predictions are conducted in a similar way compared to the proportional odds model.

In-class Question

  • Why would we have to be cautious when fitting a non-proportional odds model in the context of a response with a considerable number of ordinal levels and regressors?


11. Wrapping Up on Categorical Regression Models

  • Multinomial and Ordinal Logistic regression models address cases where our response is discrete and categorical.
  • These models are another class GLMs.
  • Multinomial Logistic regression approaches nominal responses. It heavily relies on a baseline response level and the odds’ natural logarithm.
  • Ordinal Logistic regression approaches ordinal responses. It relies on cumulative odds on an ordinal scale.
  • We have to be careful with coefficient interpretations in these models, especially with the cumulative odds. This will imply an effective communication of these estimated models for inferential matters.