Lecture 1

GLMs: Link Functions and Count Regression

(Please, sign in on iClicker)

Hello and welcome!

Teaching Team

Section 1: Payman Nickchi

Section 2: G. Alexi Rodríguez-Arelis

High-Level Goals of this Course

Describe the risk and value of making parametric assumptions in regression.
Fit model functions that represent probabilistic quantities besides the mean.
Identify situations where Ordinary Least-squares (OLS) regression is sub-optimal, and apply alternative regression methods that better address the situation.

Today’s Learning Goals

By the end of this lecture, you should be able to:

Explain the Data Science workflow in Regression Analysis.
Recall the basics of OLS regression.
Identify cases where OLS regression is not suitable.
Distinguish what makes a regression model “linear.”
Explain the concept of generalized linear models (GLMs).
Explore the concept of the link function.

Today’s Learning Goals (cont’d)

Outline the modelling framework of count regression.
Fit and interpret count regression.
Use count regression for prediction.
Explain and test overdispersion on count-type data.

Outline

A Preamble on our Data Science Workflow
Review of Ordinary Least-squares Regression
Link Function
Poisson Regression
Overdispersion
Negative Binomial Regression

1. A Preamble on our Data Science Workflow

We have structured this regression course in three big pillars:

The use of an ordered Data Science workflow,
Choosing the proper workflow flavour according to either an inferential or predictive paradigm.
The correct use of an appropriate regression model based on the response of interest.

What is this workflow?

Stages

Study design.
Data collection and wrangling.
Exploratory data analysis.
Data modelling.
Estimation.
Results.
Storytelling.

Data Modelling

In Regression Analysis, we can model an unbounded continuous response in terms of their global conditioned mean using OLS for inference or prediction.

2. Review of Ordinary Least-squares Regression

In DSCI 561, you learned comprehensive material about OLS regression. The term ordinary refers to a linear model with a response of continuous nature : a variable can take on an infinite number of real values in a given range.

This response might be subject to regressors of a continuous or discrete nature. When the regressors are discrete and factor-type (i.e., with different categories), they could be: nominal or ordinal .

Conceptually…

A linear regression model can be expressed as:

\[\begin{equation*} \mbox{Response} = \mbox{Systematic Component} + \mbox{Random Component}. \end{equation*}\]

For the \(i\)th observation in our sample or training data \((i = 1, \dots, n)\), the conceptual model is mathematically represented as:

\[ \underbrace{Y_i}_\text{Response} = \underbrace{\beta_0 + \beta_1 g_1(X_{i, 1}) + \ldots + \beta_k g_k(X_{i,k})}_\text{Systematic Component} + \underbrace{\varepsilon_i}_\text{Random Component} \]

Note the following!

The response \(Y_i\) is equal to the sum of \(k + 2\) terms.
The systematic component is the sum of:
- An unknown intercept \(\beta_0\) and
- \(k\) regressor functions \(g_j(X_{i,j})\) multiplied by their respective unknown regression coefficient \(\beta_j\) \((j = 1, \dots, k)\).

Moreover…

\(\varepsilon_i\) is the random noise under the following assumptions: \[\begin{gather*} \mathbb{E}(\varepsilon_i) = 0 \\ \text{Var}(\varepsilon_i) = \sigma^2 \\ \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \\ \varepsilon_i \perp \!\!\! \perp \varepsilon_k \; \; \; \; \text{for} \; i \neq k \; \; \; \; \text{(independence)}. \end{gather*}\]
Hence, each \(Y_i\) is also assumed to be independent and normally distributed: \[\begin{equation*} Y_i \mid X_{i, j} \sim \mathcal{N} \big( \beta_0 + \beta_1 g_1(X_{i, 1}) + \ldots + \beta_p g_p(X_{i,p}), \sigma^2 \big). \end{equation*}\]

2.1. Can We Apply Linear Regression Here?

Having briefly discussed the components of the OLS model, let us explore three 2-\(d\) examples made with simulated data.

iClicker Question

In what example(s) can we apply linear regression?

A. Examples 1 and 2.

B. Example 1.

C. Examples 1 and 3.

D. Examples 1, 2, and 3.

2.2. What makes a regression model “linear”?

Let us retake the Example 3 from our previous activity.

Data Generation

Each synthetic data point was generated from a sinusoidal curve plus some normally distributed random noise: \[\begin{align*} Y_i &= \beta_0 + \beta_1 g(X_i) + \varepsilon_i \\ &= \beta_0 + \beta_1 \sin(X_i) + \varepsilon_i. \end{align*}\]

Hence, this example can be modeled as an OLS model if we transform each observed \(x_i\) in the training set as \(\sin(x_i)\).

In general…

OLS implicates the identity function \(g_j(X_{i, j}) = X_{i, j}\), leading to:

\[\begin{align*} Y_i &= \beta_0 + \beta_1 g_1(X_{i, 1}) + \ldots + \beta_k g_k(X_{i,}) + \varepsilon_i \\ &= \beta_0 + \beta_1 X_{i,1} + \ldots + \beta_k X_{i,} + \varepsilon_i. \end{align*}\]

3. Link Function

OLS regression models the following:

\[ \begin{align*} \mu_i &= \mathbb{E}(Y_i \mid X_{i,1}, \ldots, X_{i,k}) \\ &= \beta_0 + \beta_1 X_{i,1} + \ldots + \beta_k X_{i,k} \\ & \qquad \qquad \qquad \qquad \qquad \text{since} \; \; \mathbb{E}(\varepsilon_i) = 0. \end{align*} \]

Nonetheless…

Modelling the mean \(\mu_i\) of a discrete-type response is not straightforward.
Hence, we rely on a monotonic and differentiable function \(h(\mu_i)\) called the link function:

\[ \begin{align*} h(\mu_i) &= \eta_i \\ &= \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \ldots + \beta_k X_{i,k} \\ & \qquad \qquad \qquad \qquad \qquad \text{for} \; i = 1, \ldots, n. \end{align*} \]

Why monotonic and differentiable?

The link function needs to be monotonic so we can allow putting the systematic component \(\eta_i\) in terms of the corresponding mean \(\mu_i\), i.e.:

\[\mu_i = h^{-1}(\eta_i).\]

It needs to be differentiable since we rely on maximum likelihood estimation to obtain \(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_k\).

Here comes the GLM!

A GLM has the components of the conceptual regression model in a training set of \(n\) elements as:

Random component: Each response \(Y_1,\ldots,Y_n\) is a random variable with mean \(\mu_i\).
Systematic component: \[\eta_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \ldots + \beta_k X_{i,k} \; \; \; \; \text{for} \; i = 1, \ldots, n.\]
Link function: \[ h(\mu_i) = \eta_i. \]

4. Poisson Regression

4.1. The Crabs Dataset

The data frame crabs (Brockmann, 1996) is a dataset detailing the counts of satellite male crabs residing around a female crab nest.

Let us take a look at the dataset!

Beforehand, we do some quick data wrangling.
Then, we display the first eight rows.

data(crabs)
crabs <- crabs |>
  rename(n_males = satell) |>
  dplyr::select(-y)
head(crabs, n = 8)

   color  spine width n_males weight
1 medium    bad  28.3       8   3050
2   dark    bad  22.5       0   1550
3  light   good  26.0       9   2300
4   dark    bad  24.8       0   2100
5   dark    bad  26.0       4   2600
6 medium    bad  23.8       0   2100
7  light   good  26.5       0   2350
8   dark middle  24.7       0   1900

Main Statistical Inquiries

Let us suppose we want to assess the following:

Whether n_males and female carapace width are statistically associated and by how much.
Whether n_males and female color of the prosoma are statistically associated and by how much.

4.2. Exploratory Data Analysis

iClicker Question

By checking the below plot, do you think n_males increases with the female carapace width?

A. Of course!

B. Not really.

C. I am pretty uncertain about an increasing or decreasing pattern.

Let us proceed with the data wrangling!

group_avg_width <- crabs|>
  mutate(intervals = cut(crabs$width, breaks = 10)) |>
  group_by(intervals) |>
  summarise(mean = mean(n_males), n = n()) 
group_avg_width

# A tibble: 10 × 3
   intervals    mean     n
   <fct>       <dbl> <int>
 1 (21,22.2]    0        2
 2 (22.2,23.5]  1       14
 3 (23.5,24.8]  1.77    26
 4 (24.8,26]    2.98    43
 5 (26,27.2]    2.53    32
 6 (27.2,28.5]  4.15    33
 7 (28.5,29.8]  4       13
 8 (29.8,31]    4.86     7
 9 (31,32.2]    3        2
10 (32.2,33.5]  7        1

Then, some plotting…

More plotting…

Response n_males is count-type. So, a more appropriate standalone plot is a bar chart.
Note the right-skewness.

4.3. Data Modelling Framework

Poisson regression is the primary resource when it comes to modelling counts.
This model also fits into the GLM class.

A Poisson random variable refers to discrete data with non-negative integer values that count something.

These counts could happen during a given timeframe or even a space such as a geographic unit!

Distributional Insights

A Poisson regression model assumes a random sample of \(n\) count observations \(Y_i\)s, hence independent (but not identically distributed!).

\[Y_i \sim \text{Poisson}(\lambda_i)\]

Note \(\mathbb{E}(Y_i) = \lambda_i > 0\) and \(\text{Var}(Y_i) = \lambda_i > 0\).

Heads-up: Given that \(\mathbb{E}(Y_i) = \text{Var}(Y_i) = \lambda_i\), we also assume equidispersion. Furthermore, \(\lambda_i\) is continuous!

Going back to the `crabs` dataset…

The events are the number of male crabs (n_males) around a space: the female breeding nest.

Let \(X_{\texttt{width}_i}\) be the \(i\)th value for the regressor width in our dataset. We could use identity link function as in \[ h(\lambda_i) = \lambda_i = \beta_0 + \beta_1 X_{\texttt{width}_i}. \]

However…

We have a response range issue!
Instead, we will use the natural logarithm of the mean \(\log(\lambda_i)\).
In the crabs dataset with width as a regressor, the model is depicted via this systematic component: \[ h(\lambda_i) = \log(\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i}. \]
The random component is implicitly contained in \[Y_i \sim \text{Poisson}(\lambda_i).\]

4.4. Estimation

Under a general framework with \(k\) regressors, the regression parameters \(\beta_0, \beta_1, \dots, \beta_k\) in the model are unknown.
We will use glm() to get the estimates \(\hat{\beta}_0, \hat{\beta}_1, \dots \hat{\beta}_k\).
These estimates are obtained through maximum likelihood where we assume a Poisson joint probability mass function of the \(n\) responses \(Y_i\).

4.5. Inference

We can determine whether a regressor is statistically associated with the logarithm of the response’s mean through hypothesis testing for \(\beta_j\).
We use \(\hat{\beta}_j\) and its standard error, \(\mbox{se}\left(\hat{\beta}_j\right)\).

Then, we test the below hypotheses via the Wald statistic \(z _j= \frac{\hat{\beta}_j}{\mbox{se}(\hat{\beta}_j)}\) as in: \[\begin{align*} H_0: \beta_j &= 0 \\ H_a: \beta_j &\neq 0. \end{align*}\]

Solving one part of our initial inquiries!

Is female caparace width statistically associated to the logarithm of mean of the counts of satellite male crabs residing around a female crab nest?

We have evidence to state that female carapace width is statistically associated to the logarithm of the mean of n_males.

Plotting our estimated regression function!

Is there a positive relationship between carapace width and the original scale of the response n_males?

4.6. Coefficient Interpretation

Let us fit a second model with two regressors: width (\(X_{\texttt{width}_i}\)) and color (\(X_{\texttt{color_darker}_i}\), \(X_{\texttt{color_light}_i}\), and \(X_{\texttt{color_medium}_i}\)): \[ h(\lambda_i) = \log (\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i} + \beta_2 X_{\texttt{color_darker}_i} + \\ \beta_3 X_{\texttt{color_light}_i} + \beta_4 X_{\texttt{color_medium}_i}. \]

Using `glm()`and `tidy()`

Firstly…

Let us focus on the coefficient corresponding to carapace width, while keeping color constant.
Consider an observation with a given value \(X_{\texttt{width}} = \texttt{w}\) cm, then:

\[\begin{align*} \log \left( \lambda_{\texttt{width}} \right) &= \beta_0 + \beta_1 \overbrace{\texttt{w}}^{X_{\texttt{width}}} + \\ & \qquad \underbrace{\beta_2 X_{\texttt{color_darker}} + \beta_3 X_{\texttt{color_light}} + \beta_4 X_{\texttt{color_medium}}}_{\text{Constant}}. \end{align*}\]

Now…

For another observation with a given \(X_{\texttt{width + 1}} = \texttt{w} + 1\) cm, we have that:

\[\begin{align*} \log \left( \lambda_{\texttt{width + 1}} \right) &= \beta_0 + \beta_1 \overbrace{\texttt{w} + 1}^{X_{\texttt{width + 1}}} + \\ & \qquad \underbrace{\beta_2 X_{\texttt{color_darker}} + \beta_3 X_{\texttt{color_light}} + \beta_4 X_{\texttt{color_medium}}}_{\text{Constant}}. \end{align*}\]

Playing around with the math!

If we set up the difference \(\log \left( \lambda_{\texttt{width}} \right) - \log \left( \lambda_{\texttt{width + 1}} \right)\), along with some logarithm properties and exponentiation, we end up with the following expression: \[\frac{\lambda_{\texttt{width + 1}} }{\lambda_{\texttt{width}}} = e^{\beta_1}.\]
This expression indicates that the response varies in a multiplicative way when increased 1 cm in carapace width.

Solving another part of our initial inquiries!

\(\frac{\hat{\lambda}_{\texttt{width + 1}} }{\hat{\lambda}_{\texttt{width}}} = e^{\hat{\beta}_1} = 1.16\) indicates the mean count of male crabs (n_males) around a female breeding nest increases by \(16\%\) when increasing the female carapace width by \(1\) cm, while keeping color constant.

Then…

Consider two observations: one \(\lambda_{\texttt{D}}\) with dark color of the prosoma (the baseline) and another \(\lambda_{\texttt{L}}\) with light color:

\[\begin{align*} \log \left( \lambda_{\texttt{D}} \right) = \beta_0 + \overbrace{\beta_1 X_{\texttt{width}}}^{\text{Constant}} + \beta_2 \overbrace{X_{\texttt{color_darker}_{\texttt{D}}}}^0 + \\ \beta_3 \underbrace{X_{\texttt{color_light}_{\texttt{D}}}}_0 + \beta_4 \underbrace{X_{\texttt{color_medium}_{\texttt{D}}}}_0 \end{align*}\] \[\begin{align*} \log \left( \lambda_{\texttt{L}} \right) = \beta_0 + \overbrace{\beta_1 X_{\texttt{width}}}^{\text{Constant}} + \beta_2 \overbrace{X_{\texttt{color_darker}_{\texttt{L}}}}^0 + \\ \beta_3 \underbrace{X_{\texttt{color_light}_{\texttt{L}}}}_1 + \beta_4 \underbrace{X_{\texttt{color_medium}_{\texttt{L}}}}_0. \end{align*}\]

Playing around with the math!

If we set up the difference \(\log \left( \lambda_{\texttt{L}} \right) - \log \left( \lambda_{\texttt{D}} \right)\), along with some logarithm properties and exponentiation, we end up with the following expression:

\[ \frac{\lambda_{\texttt{L}} }{\lambda_{\texttt{D}}} = e^{\beta_3}. \]

This expression indicates that the response varies in a multiplicative way when the color of the prosoma changes from dark to light.

Solving another part of our initial inquiries!

\(\frac{\hat{\lambda}_{\texttt{L}} }{\hat{\lambda}_{\texttt{D}}} = e^{\hat{\beta}_3} = 1.55\) indicates that the mean count of male crabs (n_males) around a female breeding nest increases by \(55\%\) when the color of the prosoma changes from dark to light, while keeping the female carapace width constant.

4.7. Predictions

Suppose we want to predict the mean count of male crabs (n_males) around a female breeding nest with a female carapace width of \(27.5\) cm and light color of the female prosoma.
We could use the model poisson_model_2 as follows:

Interpretation

The previous prediction can be interpreted in plain words as follows:

Given a female carapace width of \(27.5\) cm and light color of the female prosoma, we predict \(4.29\) male crabs around that female crab nest on average.

5. Overdispersion

Population variances of some random variables could be a function of their respective means:
- If \(X \sim \text{Exponential}(\lambda)\), then \(\mathbb{E}(X) = \lambda\) and \(\text{Var}(X) = \lambda^2\).
- If \(X \sim \text{Binomial}(n , p)\), then \(\mathbb{E}(X) = n p\) and \(\text{Var}(X) = n p (1 - p)\).
- If \(X \sim \text{Poisson}(\lambda)\), then \(\mathbb{E}(X) = \lambda\) and \(\text{Var}(X) = \lambda\).

How does equidispersion affect our Poisson regression model?

Heteroscedasticity

We must clarify that GLMs naturally deal with some types of heteroscedasticity.
For example, note that each observation \(i = 1, \dots, n\) in our training set (used to estimate a Poisson regression model) is assumed as: \[Y_i \sim \text{Poisson}(\lambda_i).\]

In many cases…

The variance of our data is sometimes larger than the variance considered by our model as in the basic Poisson regression.

Heads-up: When the variance is larger than the mean in a random variable, we have overdispersion.

Going back to `poisson_model_2`!

Let us go back to our poisson_model_2 with the log link function from the crabs dataset with two regressors: width (\(X_{\texttt{width}_i}\)) and color (\(X_{\texttt{color_darker}_i}\), \(X_{\texttt{color_light}_i}\), and \(X_{\texttt{color_medium}_i}\)) for the \(i\)th observation:

\[ \begin{align*} h(\lambda_i) &= \log (\lambda_i) = \beta_0 + \beta_1 X_{\texttt{width}_i} + \\ & \qquad \beta_2 X_{\texttt{color_darker}_i} + \beta_3 X_{\texttt{color_light}_i} + \beta_4 X_{\texttt{color_medium}_i}. \end{align*} \]

Testing for overdispersion!

We will test whether there is overdispersion in this Poisson regression model via function dispersiontest(), (from package {AER}).
Let \(Y_i\) be the \(i\)th Poisson response in the count regression model.

Ideally in the presence of equidispersion, \(Y_i\) has the following parameters: \[\begin{gather*} \mathbb{E}(Y_i) = \lambda_i \\ \text{Var}(Y_i) = \lambda_i. \end{gather*}\]

Test Specifics

The test uses the following mathematical expression: \[ \text{Var}(Y_i) = \overbrace{(1 + \gamma)}^\text{Dispersion Factor} \lambda_i, \]

Moreover, we have the following hypotheses: \[\begin{gather*} H_0: 1 + \gamma = 1 \\ H_a: 1 + \gamma > 1. \end{gather*}\]
When there is evidence of overdispersion in our data, we will reject \(H_0\) under a pre-specified significance level \(\alpha\).

Running the test on `poisson_model_2`

With \(\alpha = 0.05\), we reject \(H_0\) since the \(p\text{-value} < .001\). Hence, the poisson_model_2 has overdispersion.

Therefore, what modelling alternative do we have?

6. Negative Binomial Regression

Let \[Y_i \sim \text{Negative Binomial} (m, p_i) \quad \text{for} \quad i = 1, \dots, n.\]
This distribution has the following mean and variance: \[\begin{gather*} \mathbb{E}(Y_i) = \frac{m(1 - p_i)}{p_i} \\ \text{Var}(Y_i) = \frac{m(1 - p_i)}{p_i^2} \end{gather*}\]

6.1. Reparametrization

Let us reparametrize the previous mean and variance: \[\begin{equation*} \lambda_i = \frac{m (1 - p_i)}{p_i} \qquad \Rightarrow \qquad p_i = \frac{m}{m + \lambda_i}, \end{equation*}\]

Hence:

\[\begin{gather*} \mathbb{E}(Y_i) = \lambda_i \\ \text{Var}(Y_i) = \lambda_i \left( 1 + \frac{\lambda_i}{m} \right). \end{gather*}\]

6.2. General Modelling Framework

As in the case of Poisson regression, the Negative Binomial case is a GLM with the following link function: \[\begin{equation*} h(\lambda_i) = \log(\lambda_i) = \beta_0 + \beta_1 X_{i,1} + \dots + \beta_k X_{i,k}. \end{equation*}\]

Let \(\theta = \frac{1}{m}\). Then, Negative Binomial regression will assume the following variance: \[\begin{align*} \text{Var}(Y_i) = \lambda_i \left( 1 + \frac{\lambda_i}{m} \right) = \lambda_i + \theta \lambda_i^2. \end{align*}\]

6.3. Estimation

Via a training set of size \(n\) whose responses are independent counts \(Y_i\) (\(i = 1, \dots, n\)), we use maximum likelihood estimation to obtain \(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_k, \hat{\theta}\).
To fit a Negative Binomial regression via R, we can use the function glm.nb() from package {MASS}.

Checking summaries!

iClicker Question

Under the context of this crab problem with overdispersion as assessed by the overdispersion test, suppose we use poisson_model_2 to perform inference on the response and regressors.
What is the consequence of using these underestimated standard errors compared to the ones from negative_binomial_model?

A. We are more prone to commit Type I errors (i.e., rejecting \(H_a\) when in fact it is true).

B. We are more prone to commit Type I errors (i.e., rejecting \(H_0\) when in fact it is true).

C. We are more prone to commit Type II errors (i.e., rejecting \(H_a\) when in fact it is true).

D. We are more prone to commit Type II errors (i.e., rejecting \(H_0\) when in fact it is true).

Lecture 1

Hello and welcome!

High-Level Goals of this Course

Today’s Learning Goals

Today’s Learning Goals (cont’d)

Outline

1. A Preamble on our Data Science Workflow

What is this workflow?

Stages

Data Modelling

2. Review of Ordinary Least-squares Regression

Conceptually…

Note the following!

Moreover…

2.1. Can We Apply Linear Regression Here?

iClicker Question

2.2. What makes a regression model “linear”?

Data Generation

In general…

3. Link Function

Nonetheless…

Why monotonic and differentiable?

Here comes the GLM!

4. Poisson Regression

4.1. The Crabs Dataset

Let us take a look at the dataset!

Main Statistical Inquiries

4.2. Exploratory Data Analysis

iClicker Question

Let us proceed with the data wrangling!

Then, some plotting…

More plotting…

4.3. Data Modelling Framework

Distributional Insights

Going back to the crabs dataset…

However…

4.4. Estimation

4.5. Inference

Solving one part of our initial inquiries!

Plotting our estimated regression function!

4.6. Coefficient Interpretation

Using glm()and tidy()

Firstly…

Now…

Playing around with the math!

Solving another part of our initial inquiries!

Then…

Playing around with the math!

Solving another part of our initial inquiries!

4.7. Predictions

Interpretation

5. Overdispersion

Heteroscedasticity

In many cases…

Going back to poisson_model_2!

Testing for overdispersion!

Test Specifics

Running the test on poisson_model_2

6. Negative Binomial Regression

6.1. Reparametrization

6.2. General Modelling Framework

6.3. Estimation

Checking summaries!

iClicker Question

Going back to the `crabs` dataset…

Using `glm()`and `tidy()`

Going back to `poisson_model_2`!

Running the test on `poisson_model_2`