Lecture 2

GLMs: Model Selection and Multinomial Logistic Regression

(Please, sign in on iClicker)

Today’s Learning Goals

By the end of this lecture, you should be able to:

  • Perform likelihood-based model selection through analysis of deviance, Akaike Information Criterion, and Bayesian Information Criterion.
  • Extend the link function concept of the generalized linear models (GLMs) to other discrete categorical responses.
  • Outline the modelling framework of the Multinomial Logistic regression.
  • Fit and interpret the Multinomial Logistic regression.
  • Use the Multinomial Logistic regression for prediction.

Outline

  1. Likelihood-based Model Selection
  2. Categorical Type Responses
  3. Multinomial Logistic Regression

1. Likelihood-based Model Selection


  • In DSCI 561, you learned about model selection in Ordinary Least-squares (OLS) via Mallow’s \(C_p\), Akaike information criterion (AIC), and Bayesian information criterion (BIC).
  • These metrics are also helpful tools to perform model selection in GLMs. Additionally, it is essential to highlight that many GLMs are estimated via maximum likelihood.

1.1. The Crabs Dataset

  • The data frame crabs (Brockmann, 1996) is a dataset detailing the counts of satellite male crabs residing around a female crab nest.

1.2. Estimation

To perform model selection, let us estimate two Poisson regression models with n_males as a response:

\[\begin{align*} \textbf{Model 1:} & \\ & h(\lambda_i) = \log(\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i}. \\ \textbf{Model 2:} & \\ & h(\lambda_i) = \log (\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i} + \beta_2 X_{\texttt{color_darker}_i} + \\ & \qquad \qquad \qquad \quad \beta_3 X_{\texttt{color_light}_i} + \beta_4 X_{\texttt{color_medium}_i}. \end{align*}\]

Main Statistical Inquiry

  • We want to determine which Poisson regression model fits the data better: Model 1 or Model 2.
  • Then, via glm() and crabs, we obtain our regression estimates.

1.3. Goodness of Fit Test

  • The deviance (\(D_k\)) criterion can be used to compare a given model with \(k\) regressors with that of a baseline model.

Heads-up: The usual baseline model is the saturated or full model, which perfectly fits the data because it allows a distinct Poisson mean \(\lambda_i\) for the \(i\)th observation in the training dataset (\(i = 1, \dots, n\)), unrelated to the \(k\) regressors.

  • The maximized likelihood of this full model is denoted as \(\hat{\mathscr{l}}_f\).
  • Let \(\hat{\mathscr{l}}_k\) be the value of the maximized likelihood computed from our dataset of \(n\) observations with \(k\) regressors.

Comparing the fits

  • We can compare the fits provided by these two models by the deviance \(D_k\) given by

\[D_k = -2 \log \left( \frac{\hat{\mathscr{l}}_k}{\hat{\mathscr{l}}_f} \right) = -2 \left[ \log \left( \hat{\mathscr{l}}_k \right) - \log \left( \hat{\mathscr{l}}_f \right) \right].\]

\(D_k\) expresses how much our given model deviates from the full model on log-likelihood scale:

  • Large values of \(D_k\) arise when \(\hat{\mathscr{l}}_k\) is small relative to \(\hat{\mathscr{l}}_f\), indicating that our given model fits the data poorly compared to the baseline model.
  • Small values of \(D_k\) arise when \(\hat{\mathscr{l}}_k\) is similar to \(\hat{\mathscr{l}}_f\), indicating that our given model provides a good fit to the data compared to the baseline model.

Formula for \(D_k\) in Poisson regression


  • Specifically for Poisson regression with \(k\) regressors, it can be shown that \(D_k\) is:

\[\begin{gather} \hat{\lambda}_i = \exp{\left( \hat{\beta_0} + \hat{\beta}_1 x_{i,1} + \dots + \hat{\beta}_k x_{i,k} \right)} \\ D_k = 2 \sum_{i = 1}^n \left[ y_i \log \left( \frac{y_i}{\hat{\lambda}_i} \right) - \left( y_i - \hat{\lambda}_i \right) \right] \end{gather}\]

  • Note \(y_i\) is the \(i\)th observed response in the training set of size \(n\).
  • When \(y_i = 0\) counts, then \(\log \left( \frac{y_i}{\hat{\lambda}_i} \right)\) is assumed as \(0\).

Intuition behind \(D_k\)

  • \(D_k\) depicts the agreement of our model with \(k\) regressors to the observed data.
  • Hence, it can be used to test the goodness of fit; i.e., whether our fitted model fits the data better than the saturated model, which makes it correctly specified (with a level of significance \(\alpha\)!):

\[\begin{gather*} H_0: \text{Our}\textbf{ Model with $k$ regressors} \\ \text{ fits the data better than the } \textbf{Saturated Model.} \\ H_a: \text{Otherwise.} \end{gather*}\]

Testing poisson_model_2!

\[\begin{align*} \textbf{Model 2:} & \\ h(\lambda_i) &= \log (\lambda_i) \\ &= \beta_0 + \beta_1 X_{\texttt{width}_i} + \beta_2 X_{\texttt{color_darker}_i} + \\ & \qquad \beta_3 X_{\texttt{color_light}_i} + \beta_4 X_{\texttt{color_medium}_i}. \end{align*}\]

Asymptotic distribution of \(D_k\)

  • Column deviance provides \(D_k\) (residual deviance), which is the test statistic. Asymptotically, we have the following null distribution: \[ D_k \sim \chi^2_{n - (k + 1)}. \]
  • The degrees of freedom (column df.residual in our glance() output) are the difference between the training set size \(n\) and the number of regression parameters in our model (including the intercept \(\beta_0\)).

Computing \(p\)-value via R

  • Let us obtain the corresponding \(p\text{-value}\) for this test. We can do it using pchisq():
  • We obtain a \(p\text{-value} < .001\), which gives statistical evidence to state that our poisson_model_2 is not correctly specified when compared to the saturated model.

1.4. Analysis of Deviance for Nested Models

  • We can use analysis of deviance for model selection when two models are nested.
  • Hence, we will test our two Poisson models:

\[\begin{align*} \textbf{Model 1:} & \\ & h(\lambda_i) = \log(\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i}. \\ \textbf{Model 2:} & \\ & h(\lambda_i) = \log (\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i} + \beta_2 X_{\texttt{color_darker}_i} + \\ & \qquad \qquad \qquad \quad \beta_3 X_{\texttt{color_light}_i} + \beta_4 X_{\texttt{color_medium}_i}. \end{align*}\]

Hypothesis testing

  • This specific model selection will involve a hypothesis testing. The hypotheses are:

\[\begin{gather*} H_0: \textbf{Model 1} \text{ fits the data better than } \textbf{Model 2} \\ H_a: \textbf{Model 2} \text{ fits the data better than } \textbf{Model 1}. \end{gather*}\]

Analysis of deviance


  • Let \(D_2\) be the deviance (column Resid. Dev) for Model 2 (poisson_model_2) in row 2 and \(D_1\) (column Resid. Dev) the deviance for Model 1 (poisson_model) in row 1.
  • The test statistic \(\Delta_D\) (column Deviance) for the analysis of deviance is given by \(\Delta_D = D_1 - D_2 \sim \chi^2_{3}\).
  • \(\Delta_D\) is asymptotically (i.e., \(n \rightarrow \infty\)) is Chi-squared distributed with \(3\) degrees of freedom (column Df) under \(H_0\) for this specific case.

What is the conclusion?


  • In the context of model selection, we would choose poisson_model_2, that also includes the color of the female prosoma.

Heads-up: The degrees of freedom are the regression parameters of difference between both models. Formally, this nested hypothesis testing is called the likelihood-ratio test.

1.5. Akaike Information Criterion (AIC)

  • Analysis of deviance only allows to test nested regression models.
  • Thus, we need alternatives for model selection.
  • AIC makes possible to compare models that are either nested or not.
  • For a model with \(k\) terms and a deviance \(D_k\), AIC is defined as: \[ \mbox{AIC}_k = D_k + 2k. \]

What are the AIC rules?

  • Models with smaller values of \(\mbox{AIC}_k\) are preferred.
  • That said, \(\mbox{AIC}_k\) favours models with small values of \(D_k\).

Heads-up: \(\mbox{AIC}_k\) penalizes for including more regressors in the model. Hence, it discourages overfitting. This is why we select that model with the smallest \(\mbox{AIC}_k\).

AIC in R!

  • The function glance() shows us the \(\mbox{AIC}_k\) by model.

Following the results of the AIC column, we choose poisson_model_2 over poisson_model.

1.6. Bayesian Information Criterion

  • An alternative to AIC is the Bayesian Information Criterion (BIC).
  • The BIC also makes possible to compare models that are either nested or not.
  • For a model with \(k\) terms, a deviance \(D_k\), and fitted via a training set of size \(n\); BIC is defined as: \[ \mbox{BIC}_k = D_k + k \log (n). \]

What are the BIC rules?

  • Models with smaller values of \(\mbox{BIC}_k\) are preferred.
  • That said, \(\mbox{BIC}_k\) also favours models with small values of \(D_k.\)

Heads-up: The differences between AIC and BIC will be more pronounced in datasets with large sample sizes \(n\). As the BIC penalty of \(k \log (n)\) will always be larger than the AIC penalty of \(2k\) when \(n > 7\), BIC tends to select models with fewer regressors than AIC.

BIC in R!

Following the results of the AIC column, we already chose poisson_model_2 over poisson_model. Nonetheless, the BIC is penalizing the poisson_model_2 for having more model parameters, so poisson_model would be chosen under this criterion.

What is the statistical rationale behind both metrics?

  • AIC: From a given set of models (either nested or non-nested) fitted via a common training dataset, this metric aims to select the model with the best predictive performance even though this model chosen might not be the true one that generates the training data.
  • BIC: From a given set of models (either nested or non-nested) fitted via a common training dataset, this metric will tend to select the true model that is generating the training data as \(n\) approaches infinity and as long as the true model is within this given set of models.

2. Categorical Type Responses

  • Now, we will cover those discrete responses with more than two categories:

Nominal: We have categories that do not follow any specific order.

Ordinal: The categories, in this case, follow a specific order.

  • Let us pave the way to two new GLMs to address this matter: Multinomial and Ordinal Logistic regressions.

3. Multinomial Logistic Regression


Let us check the mind map in the lecture notes

  • Multinomial Logistic regression is a maximum likelihood-based GLM that addresses inferential (and predictive!) inquiries where the response is categorical and nominal.

3.1. The Spotify Dataset

A sub-sample of the Spotify song data originally collected by Kaylin Pavlik (kaylinquest) and distributed through the R for Data Science TidyTuesday project.

Copyright © 2025 Spotify AB. Spotify is a registered trademark of the Spotify Group.

Let us take a look at the data!

  • The spotify dataset is a sample of \(n = 350\) songs with different variables by column.
  • This dataset has 23 variables in total.
  • The \(n = 350\) songs belong to 44 different artists.

iClicker Question

  • Having checked the description of the spotify dataset, suppose we want to conduct an inferential study.
  • What would be the primary nature of our conclusions?

A. We could draw causation conclusions across all music platforms.

B. We could draw association conclusions on the Spotify platform.

C. We could draw association conclusions across all music platforms.

D. We could draw causation conclusions on the Spotify platform.

Main Statistical Inquiries

  • I am big fan of the electronic dance music (edm) catalogue on Spotify. Nevertheless, I do not have access to their whole song database.
  • Still, I would like to:

    • Statistically quantify how how danceable edm is compared to other genres on the platform, and by how much.
    • Statistically quantify how euphoric edm is compared to other genres on the platform, and by how much.

3.2. Data Wrangling and Summary

  • We will extract these key variables by song from the spotify dataset:
    1. genre: genre of the playlist where the song belongs (a categorical and nominal variable).
    2. danceability: a numerical score from 0 (not danceable) to 100 (danceable) based on features such as tempo, rhythm, etc.
    3. valence: a numerical score from 0 (the song is more negative, sad, angry) to 100 (the song is more positive, happy, euphoric).

Preparing our training dataset!

Checking all different genres!


Heads-up: Note that factor-type genre has six levels (edm refers to electronic dance music and r&b to rhythm and blues). Moreover, from a coding perspective, edm is the baseline level (it is located on the left-hand side in the levels() output).

Creating a dataset summary!

  • Let us summarize the data by genre.
  • Having checked this genre composition, we have an acceptable number of songs to draw inference from.
  • Moreover, an imbalanced classification will be dealt with the distributional assumption of the Multinomial Logistic regression.

3.3. Exploratory Data Analysis

  • As done in our previous lecture, let us take a look at the data first.
  • Note that genre is a discrete nominal response.
  • Hence, we could use the following side-by-side plots for both danceability and valence:
    • Boxplots.
    • Violin plots.

Plots!

In-class Question

  • What can we see descriptively from the above plots?

3.4. Data Modelling Framework

  • A Multinomial Logistic Regression model is a suitable approach to our statistical inquiries given that genre is categorical and nominal (our response of interest) subject to the numerical regressors danceability and valence.
  • Moreover, its corresponding regression estimates will allow us to measure variable association.

3.4.1. Primary Intuition of the Multinomial Logistic Regression Model

  • Digging into the Multinomial regression model will require checking Binary Logistic regression first.
  • Thus, suppose we are initially interested in two genres: edm and latin.

A note on the baseline level!


  • edm is the baseline level (i.e., it appears on the left-hand side in the levels() output).


Reminder on Binary Logistic Modelling


  • Assuming we have a training set of \(n\) elements, recall that the \(i\)th response (for \(i = 1, \dots, n\)) in a Binary Logistic regression is

\[ Y_i = \begin{cases} 1 \; \; \; \; \mbox{if the genre is } \texttt{latin},\\ 0 \; \; \; \; \mbox{if the baseline genre is } \texttt{edm}. \end{cases} \]

  • Therefore, \(Y_i \sim \text{Bernoulli}(p_i)\), whose probability of success is \(p_i\).

Binary Logistic Regression Estimation via R

  • Recall that this model can be fitted as follows:


Inference in Binary Logistic Regression

  • Note that only valence is statistically significant for the response genre (\(p\text{-value} < .01\)).


Coefficient Interpretation in Binary Logistic Regression

  • Then, we make the corresponding coefficient interpretations (the previous output already provides the estimate on the odds scale via exponentiate = TRUE):


For each unit increase in the valence score in the Spotify catalogue, the song is 1.05 times more probable to be latin than edm.

3.4.2. More than One Logit Function

What if our response is nominal and has more than two categories?

  • In that case, let us suppose that a given discrete nominal response \(Y_i\) has categories \(1, 2, \dots, m\).

Heads-up: Categories \(1, 2, \dots, m\) are merely labels here. Thus, they do not implicate an ordinal scale.

  • This regression approach assumes a Multinomial distribution where \(p_{i,1}, p_{i,2}, \dots, p_{i,m}\) are the probabilities that \(Y_i\) will belong to categories \(1, 2, \dots, m\) respectively; i.e.,

\[ P(Y_i = 1) = p_{i,1} \;\;\;\; P(Y_i = 2) = p_{i,2} \;\; \dots \;\; P(Y_i = m) = p_{i,m} \quad \text{where} \]

\[ \sum_{j = 1}^m p_{i,j} = p_{i,1} + p_{i,2} + \dots + p_{i,m} = 1. \]

Is one logarithm of the odds enough?

  • The Multinomial Logistic regression also models the logarithm of the odds.
  • However, only one logarithm of the odds (or logit) will not be enough anymore.
  • Here is what we can do:
    1. Pick one of the categories to be the baseline. For example, the category “\(1\)” (or edm in our context).
    2. For each of the other categories, we model the logarithm of the odds to the baseline category.

Checking response levels!

  • Let us set the Multinomial Logistic regression model with genre as a response subject to danceability and valence as continuous regressors.
  • Level edm will be the baseline:

Log-odds and odds ratio


\[\begin{align*} \eta_i^{(\texttt{latin},\texttt{edm})} &= \log\left[\frac{P(Y_i = \texttt{latin} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}\right] \\ &= \beta_0^{(\texttt{latin},\texttt{edm})} + \beta_1^{(\texttt{latin},\texttt{edm})} X_{i, \texttt{dance}} + \beta_2^{(\texttt{latin},\texttt{edm})} X_{i, \texttt{val}} \end{align*}\]


\[\begin{align*} \frac{P(Y_i = \texttt{latin}\mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})} &= \exp\left[\beta_0^{(\texttt{latin},\texttt{edm})}\right] \times \exp\left[\beta_1^{(\texttt{latin},\texttt{edm})} X_{i,\texttt{dance}}\right] \times \\ & \qquad \exp\left[\beta_2^{(\texttt{latin},\texttt{edm})} X_{i,\texttt{val}}\right] \end{align*}\]

Probability


\[\begin{gather*} p_{i,\texttt{edm}} = P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{1}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)} \end{gather*}\]

Similarly …


\[\begin{align*} \eta_i^{(\texttt{pop},\texttt{edm})} &= \log\left[\frac{P(Y_i = \texttt{pop} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}\right] \\ &= \beta_0^{(\texttt{pop},\texttt{edm})} + \beta_1^{(\texttt{pop},\texttt{edm})} X_{i, \texttt{dance}} + \beta_2^{(\texttt{pop},\texttt{edm})} X_{i, \texttt{val}} \\ \eta_i^{(\texttt{r&b},\texttt{edm})} &= \log\left[\frac{P(Y_i = \texttt{r&b} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}\right] \\ &= \beta_0^{(\texttt{r&b},\texttt{edm})} + \beta_1^{(\texttt{r&b},\texttt{edm})} X_{i, \texttt{dance}} + \beta_2^{(\texttt{r&b},\texttt{edm})} X_{i, \texttt{val}} \\ \eta_i^{(\texttt{rap},\texttt{edm})} &= \log\left[\frac{P(Y_i = \texttt{rap} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}\right] \\ &= \beta_0^{(\texttt{rap},\texttt{edm})} + \beta_1^{(\texttt{rap},\texttt{edm})} X_{i, \texttt{dance}} + \beta_2^{(\texttt{rap},\texttt{edm})} X_{i, \texttt{val}} \end{align*}\]

Similarly (cont’d)


\[\begin{align*} \eta_i^{(\texttt{rock},\texttt{edm})} &= \log\left[\frac{P(Y_i = \texttt{rock} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i, \texttt{dance}}, X_{i, \texttt{val}})}\right] \\ &= \beta_0^{(\texttt{rock},\texttt{edm})} + \beta_1^{(\texttt{rock},\texttt{edm})} X_{i, \texttt{dance}} + \beta_2^{(\texttt{rock},\texttt{edm})} X_{i, \texttt{val}}. \end{align*}\]

Odds


\[\begin{align*} \frac{P(Y_i = \texttt{pop}\mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}& = \exp\left[\beta_0^{(\texttt{pop},\texttt{edm})}\right] \times \exp\left[\beta_1^{(\texttt{pop},\texttt{edm})} X_{i,\texttt{dance}}\right] \times \\ & \qquad \exp\left[\beta_2^{(\texttt{pop},\texttt{edm})} X_{i,\texttt{val}}\right] \\ \frac{P(Y_i = \texttt{r&b}\mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})} &= \exp\left[\beta_0^{(\texttt{r&b},\texttt{edm})}\right] \times \exp\left[\beta_1^{(\texttt{r&b},\texttt{edm})} X_{i,\texttt{dance}}\right] \times \\ & \qquad \exp\left[\beta_2^{(\texttt{r&b},\texttt{edm})} X_{i,\texttt{val}}\right] \end{align*}\]

Odds (cont’d)


\[\begin{align*} \frac{P(Y_i = \texttt{rap}\mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})} &= \exp\left[\beta_0^{(\texttt{rap},\texttt{edm})}\right] \times \exp\left[\beta_1^{(\texttt{rap},\texttt{edm})} X_{i,\texttt{dance}}\right] \times \\ & \qquad \exp\left[\beta_2^{(\texttt{rap},\texttt{edm})} X_{i,\texttt{val}}\right] \\ \frac{P(Y_i = \texttt{rock}\mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})}{P(Y_i = \texttt{edm} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}})} &= \exp\left[\beta_0^{(\texttt{rock},\texttt{edm})}\right] \times \exp\left[\beta_1^{(\texttt{rock},\texttt{edm})} X_{i,\texttt{dance}}\right] \times \\ & \qquad \exp\left[\beta_2^{(\texttt{rock},\texttt{edm})} X_{i,\texttt{val}}\right]. \end{align*}\]

Probabilities

\[\begin{gather*} p_{i,\texttt{latin}} = P(Y_i = \texttt{latin} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{\exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big)}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)} \\ p_{i,\texttt{pop}} = P(Y_i = \texttt{pop} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{\exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big)}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)} \end{gather*}\]

\[\begin{gather*} p_{i,\texttt{r&b}} = P(Y_i = \texttt{r&b} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{\exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big)}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)} \\ p_{i,\texttt{rap}} = P(Y_i = \texttt{rap} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{\exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big)}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)} \end{gather*}\]

Probabilities (cont’d)


\[\begin{gather*} p_{i,\texttt{rock}} = P(Y_i = \texttt{rock} \mid X_{i,\texttt{dance}}, X_{i,\texttt{val}}) = \\ \frac{\exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)}{1 + \exp\big(\eta_i^{(\texttt{latin},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{pop},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{r&b},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rap},\texttt{edm})}\big) + \exp\big(\eta_i^{(\texttt{rock},\texttt{edm})}\big)}. \end{gather*}\]

iClicker Question

  • Answer TRUE or FALSE:

Now that we removed the restriction of only one link function, we also removed the distributional assumption for \(Y_i\). Hence, our model has no distributional assumption.

A. TRUE

B. FALSE

iClicker Question

  • Answer TRUE or FALSE:

From the statistical point of view, the model is now non-parametric.

A. TRUE

B. FALSE

In-class Question

  • Suppose that, given our inquiries, a peer suggests we fit two separate OLS with genre as a regressor and danceability OR valence as the responses.
  • Can you think of some drawbacks of this suggested approach?

3.5. Estimation

  • To fit the model with the package nnet, we use the function multinom(), which obtains the corresponding estimates.
  • The estimates are obtained through maximum likelihood, where we assume a Multinomial joint probability mass function of the \(n\) responses \(Y_i\).

The code!


3.6. Inference

  • We also use the Wald statistic \(z_j^{(u, v)} = \frac{\hat{\beta}_j^{(u, v)}}{\mbox{se}\left(\hat{\beta}_j^{(u, v)}\right)}\) to test the hypotheses \[\begin{gather*} H_0: \beta_j^{(u, v)} = 0\\ H_a: \beta_j^{(u, v)} \neq 0. \end{gather*}\]
  • Provided the sample size \(n\) is large enough, \(z_j\) has an approximately Standard Normal distribution under \(H_0\).

Confidence Intervals (CIs) for Odds

  • We can also use the function tidy() from the broom package along with argument conf.int = TRUE to get the 95% CIs by default.

Filtering those significant estimates…

What can we conclude?


  • Since our baseline response is edm, we can conclude the following with \(\alpha = 0.05\) on the Spotify platform:
    • There is a statistical difference in danceability in edm versus r&b and rock.
    • There is a statistical difference in valence in edm versus latin.

3.7. Coefficient Interpretation

\(\beta_2^{(\texttt{latin},\texttt{edm})}\): for each unit increase in the valence score in the Spotify catalogue, the song is \(1.05\) times more likely to be latin than edm.

\(\beta_1^{(\texttt{r&b},\texttt{edm})}\): for each unit increase in the danceability score in the Spotify catalogue, the odds for a song for being r&b decrease by \((1 - 0.963) \times 100\% = 3.7\%\) compared to edm.

\(\beta_1^{(\texttt{rock},\texttt{edm})}\): for each unit increase in the danceability score in the Spotify catalogue, the odds for a song for being rock decrease by \((1 - 0.918) \times 100\% = 8.2\%\) compared to edm.

3.8. Predictions

  • Suppose we want to predict the genre probabilities for a given song with a danceability score of 27.5 and valence of 30.

3.9. Model Selection

  • To perform model selection, we can use the same techniques from the Binary Logistic regression model.
  • That said, some R coding functions might not be entirely available for the multinom() models.
  • Still, these statistical techniques and metrics can be manually coded.