Frequently Asked Questions (FAQs)

This appendix provides answers to common questions students ask about the course content, assignments, and tools. Each entry includes a clear explanation intended to help you clarify concepts quickly and resolve recurring points of confusion. The goal is to make it easy to find reliable explanations in one place.

Importantly, several of these items highlight key ideas you are expected to know and that may be assessed on quizzes (and, when relevant, in labs). If you would like to discuss any of these questions in more detail, feel free to reach out to the course instructors.

The Data Science Workflow

Question 1

Question: Take a look at the paper here and in particular the Model specification section. Do you spot any problem?

Answer: The authors appear to indicate a fitting via an Ordinary Least-squares (OLS) regression model to a binary response in Model specification. Specifically, their \(Y_i\) indicates whether a first marriage ended in divorce or separation (a Bernoulli outcome encoded as 1 for “success” and 0 for “failure,” where “success” is defined as the marriage ending in divorce). This setup is both conceptually atypical and statistically problematic: OLS treats the response as continuous, can yield fitted values outside \([0,1]\), and relies on assumptions (e.g., constant variance and normally distributed errors) that are violated for Bernoulli data. In short, this raises the same concerns that arise when OLS is used to model probabilities directly (see the discussion in this linked note).

Later, in the Estimation and hypothesis testing section, the authors refer to “logit regression” (i.e., Binary Logistic regression), which would be the appropriate modelling choice for a Bernoulli outcome. However, the earlier reference to OLS creates ambiguity about what was actually estimated, and the paper would benefit from explicit clarification (e.g., the precise likelihood/model, link function, and correct estimation method).

Finally, the reporting of results is also unclear. The coefficient tables are labeled “Coefficient,” but it is not explicit whether these values are on the log-odds scale (as in standard Binary Logistic regression output) or exponentiated for the odds scale. That distinction matters for interpretation, and the authors should state clearly whether coefficients have been exponentiated and, ideally, report standard errors and confidence intervals on the same scale.

Count Regression

Question 2 (Not to be evaluated in the quiz)

Question: Why do we use the Wald test statistic when testing the significance of an estimated parameter in a Poisson regression model? Is this related to the link function or the GLM setup used for Poisson regression?

Answer: The use of the Wald test is not specific to Poisson regression, nor is it directly related to the log link function.

In Poisson regression, model parameters \(\beta_0, \beta_1, \dots, \beta_k\) (with \(k\) regressors) are estimated using maximum likelihood estimation (MLE). Under standard regularity conditions, MLE-based estimators are asymptotically normally distributed as the sample size becomes large enough (i.e., the sample size \(n \rightarrow \infty\)).

For example, an estimated coefficient \(\hat{\beta}_j\) (\(j = 1, \dots, k\)) is approximately normally distributed around the true (unknown) population parameter \(\beta_j\), with an estimated standard error. This approximate normality allows us to construct the Wald test statistic to test hypotheses such as

\[ H_0: \beta_j = 0 \quad \text{versus} \quad H_a: \beta_j \neq 0 \]

Because this reasoning relies on MLE and asymptotic normality (i.e., as the sample size \(n \rightarrow \infty\))), the Wald test can be used in many models beyond Poisson regression, including Binary Logistic regression, Negative Binomial regression, and other generalized linear models (GLMs). The detailed theoretical justification is beyond the scope of this course. You are welcome to discuss these ideas further during office hours with the instructors.

Takeaway: The Wald test is justified by MLE and asymptotic normality not by the log link function or Poisson regression specifically.

Question 3

Question: Why do we need to check the Poisson’s assumption on equidispersion (i.e., the response’s mean is equal to its variance) after fitting a Poisson regression? What happens if we do not?

Answer: Checking this assumption on equidispersion is important because violating it can lead to misleading results:

  • Incorrect standard errors: If the variance is larger than the mean in our observed data, you will encounter overdispersion. Hence, the estimated standard errors for regression parameters (coming from the Poisson regression fitting) may be underestimated. This can make \(p\)-values and confidence intervals unreliable.

  • Invalid inference: Underestimated standard errors can lead to false positives when testing hypotheses about regressors.

Question 4

Question: Why do not we use a Negative Binomial model from the very beginning instead of Poisson regression when modelling count-type responses?

Answer: Although Negative Binomial regression can accommodate overdispersion (i.e., variance greater than the mean), there are good reasons to begin with a Poisson regression model:

  • Simplicity: Poisson regression is a simpler starting point. If the Poisson assumptions are reasonable, it can provide adequate inference without introducing an additional dispersion parameter (i.e., \(\theta\) in the Negative Binomial model).
  • Fewer parameters: Poisson regression estimates only an intercept and \(k\) regression coefficients, whereas Negative Binomial regression estimates those plus a dispersion parameter \(\theta\). All else equal, estimating fewer parameters is preferable as long as the model provides an acceptable fit (i.e., it has adequate goodness of fit).
  • Model checking: Starting with Poisson regression naturally supports a workflow of diagnose, then adapt. You can fit the Poisson model, assess evidence of overdispersion, and move to a Negative Binomial model only if the equidispersion assumption appears violated. Jumping directly to Negative Binomial may be unnecessary when equidispersion holds.

Takeaway: Fit Poisson regression first, check for overdispersion, and switch to Negative Binomial regression only when the data warrant it.

Question 5 (Not to be evaluated in the quiz)

Question: How does a Negative Binomial distribution converges to a Poisson distribution?

Answer: Let the random variable

\[ Y_r \sim \text{Negative Binomial}(r,p), \]

where \(r > 0\) and \(0 < p < 1\). We use the convention that \(Y_r\) counts the number of failures before the \(r\)-th success. Recall that the probability mass function (PMF) is

\[ P(Y_r = y \mid r, p) = \binom{y-1+r}{y} p^r (1-p)^y \quad \text{for} \quad y = 0,1,2,\dots \]

Equivalently, we can write this PMF using Gamma functions as follows:

\[ P(Y_r = y \mid r, p) = \frac{\Gamma(y+r)}{\Gamma(r)\,y!} \, p^r (1-p)^y. \]

The mean and variance of the distribution are:

\[ \mathbb{E}[Y_r] = \frac{r(1-p)}{p}, \qquad \mathrm{Var}(Y_r) = \frac{r(1-p)}{p^2}. \]

To obtain a Poisson limit, let

\[ p = \frac{r}{r+\lambda}, \qquad \lambda > 0 \quad \text{and fixed}. \]

Then

\[ \frac{r(1-p)}{p} = \lambda, \]

so the mean of the Negative Binomial remains constant as \(r \to \infty\). Under this parametrization

\[ 1 - p = \frac{\lambda}{r+\lambda}, \]

substituting into the PMF gives

\[ P(Y_r = y \mid r, \lambda) = \frac{\Gamma(y+r)}{\Gamma(r)\,y!} \left(\frac{r}{r+\lambda}\right)^r \left(\frac{\lambda}{r+\lambda}\right)^y. \]

We rewrite the PMF as

\[ P(Y_r = y \mid r, \lambda) = \frac{1}{y!} \left[ \frac{\Gamma(y+r)}{\Gamma(r)\,r^y} \right] \left(\frac{r}{r+\lambda}\right)^r \left(\frac{r\lambda}{r+\lambda}\right)^y. \]

This form separates all terms involving \(r\). Now, we take Limits term by term.

Gamma Ratio Term

\[ \frac{\Gamma(y+r)}{\Gamma(r)\,r^y} = \prod_{k=0}^{y-1} \frac{r+k}{r} = \prod_{k=0}^{y-1} \left(1+\frac{k}{r}\right). \]

Since \(y\) is fixed,

\[ \prod_{k=0}^{y-1} \left(1+\frac{k}{r}\right) \;\longrightarrow\; 1 \quad \text{as } r \to \infty. \]

Exponential-Type Term

\[ \left(\frac{r}{r+\lambda}\right)^r = \left(1+\frac{\lambda}{r}\right)^{-r} \;\longrightarrow\; e^{-\lambda}. \]

Power Term

\[ \left(\frac{r\lambda}{r+\lambda}\right)^y = \lambda^y \left(\frac{r}{r+\lambda}\right)^y \;\longrightarrow\; \lambda^y, \] since \(\frac{r}{r+\lambda} \to 1\).

Finally, combining the limits of all components:

\[ P(Y_r = y \mid r, \lambda)\;\longrightarrow\; \frac{1}{y!} \cdot 1 \cdot e^{-\lambda} \cdot \lambda^y = \frac{e^{-\lambda}\lambda^y}{y!}. \]

This is exactly the PMF of a \(\text{Poisson}(\lambda)\) random variable!

Takeaway:

\[ \text{Negative Binomial}\!\left(r,\frac{r}{r+\lambda}\right)\;\xrightarrow[r\to\infty]{d}\;\text{Poisson}(\lambda) \]

where \(\xrightarrow[r\to\infty]{d}\) means: as \(r\to\infty\), then the Negative Binomial distribution converges in distribution (hence, \(d\)) to a Poisson distribution.

Quantile Regression

Question 6 (Not to be evaluated in the quiz)

Question: In quantile regression, why does the fitted line have to pass through some data points, and why is this different from OLS?

Answer: This is due to the different loss functions and constraints between the two methods. In quantile regression, we must satisfy the condition that a proportion \(\tau\) of observations fall below the fitted line and \(1-\tau\) fall above it. With two parameters to estimate (intercept and slope), the optimization requires anchoring the line to two data points. This uses two degrees of freedom and allows the remaining observations to split in the correct proportions above and below the line while optimizing the quantile loss function.

In contrast, OLS minimizes the sum of squared residuals without any constraint on the proportion of points above or below the line. OLS has a direct analytical solution which in the general case can be written as \(\hat \beta = (X^{T}X)^{-1}X^{T}y\), in the form of matrices and vector. Finding this solution does not require the line to pass through specific points. The quantile regression loss function is based on absolute values of error terms and has no closed-form solution, requiring numerical optimization methods that, combined with the proportion constraint result in the line passing through two points for this simple case.

Note: In this formula \(\hat \beta = (X^{T}X)^{-1}X^{T}y\), we have \(X\) as the design matrix with \(n\) rows and \(p\) columns (number of parameters) and \(y\) as the response vector. \(X^{T}\) denotes the transpose of matrix \(X\) and \((\cdot)^{-1}\) indicated the inverse of a matrix.

Missing Data Imputation

Question 7

Question: What is the logic/pattern for how missing values are imputed? Is it finding similar profile houses (like \(k\)-NN regression), or is it using random predictions by default?

Answer: When you run mice () to perform multiple imputation, the algorithm does not simply guess randomly, and it is not purely a \(k\)-NN algorithm either. Instead, it uses a method called chained equations (to which the method predictive mean matching, PMM, belongs).

The idea of multiple imputation is to impute one variable at a time, using the other variables in the dataset as regressors (or predictors) For example, if house price is missing, the algorithm builds a regression model that predicts price from the other observed features (size, number of rooms, location indicators, etc.), using only the rows where price is observed. Then it fills in the missing values for price. After that, it moves to the next variable with missingness and repeats the process. This cycle continues multiple times so that the imputations stabilize.

For numeric variables, the method in mice() is PMM (other variable types use a suitable regression model, such as Binary Logistic regression, while still relying on chained equations). PMM is model-based, but it does not directly plug in the predicted value. Instead, it:

  1. Uses a regression model (i.e., OLS) to compute predicted values,
  2. Finds a small group of observed cases with similar predicted values,
  3. Randomly selects one of those similar observed values,
  4. Copies that observed value into the missing cell.

Thus, it is not raw \(k\)-NN regression in the feature space, but it does use “similarity” (but this similarity is defined by the regression model. It also is not “random prediction” in the sense of blind noise. The randomness is structured and purposeful. Randomness enters in three ways:

  • The regression coefficients are slightly varied to reflect parameter uncertainty,
  • The residual variance is drawn rather than fixed,
  • A donor is randomly selected from a small set of similar cases.

This produces plausible but slightly different imputed values across datasets. For a short toy example in a single imputation stage (i.e., suppose m = 1), you can check the below Question 10.

So, when you set m = 15 in mice(), the entire imputation process is repeated 15 times, each time using new random draws. The result is 15 completed datasets, each representing one plausible version of reality given the observed data.

Question 8

Question: When we use pool() on the 15 imputed datasets (i.e., suppose we use m = 15 when using mice() beforehand), does it average each NA cell (i.e., missing data) across the 15 imputations? Or is it the coefficients that are pulled? How does this work?

Answer: It is important to note that pool() does not average the imputed data cells. Actually, pool() does not look at each missing value and compute the average of its 15 imputed versions. That would collapse the uncertainty and defeat the purpose of multiple imputation.

Instead, here is what actually happens:

  1. You fit your analysis model (for example, a linear regression such as OLS) separately on each of the 15 completed datasets.
  2. This gives you 15 sets of results:
    • 15 sets of coefficient estimates,
    • 15 set of standard errors for these coefficient estimated.
  3. pool() then combines those model results, not the raw data.

Pooling accounts for two sources of uncertainty:

  • Within-imputation uncertainty: The usual regression uncertainty you already know, even complete data would give you standard errors (since this is an observed random sample).
  • Between-imputation uncertainty: The variation in coefficient estimates across the 15 datasets. This reflects uncertainty due to missing data.

If the 15 regression slopes are nearly identical, missingness did not affect inference much. If they vary noticeably, that variability increases the final standard errors.

Takeaway: So what gets averaged are the coefficients and their associated uncertainty, not the missing cells.

Question 9

Question: Why is multiple imputation valuable even though it injects randomness?

Answer: It may feel “weird” that multiple imputation introduces randomness. But that randomness is not “noise” in a careless sense, it represents real uncertainty. Single imputation methods (like the simple mean or regression imputations we saw in class before multiple imputation) behave as if the imputed values were known with certainty. This typically makes the data look cleaner than it truly is and leads to underestimated standard errors. Multiple imputation instead says:

“We do not know these missing values. Let us generate several plausible versions and see how much our conclusions change.”

If conclusions barely change across imputations, then missing data were not influential. If conclusions vary substantially, that variation should appear in the final uncertainty. In other words, the additional variability induced by multiple imputation is not a flaw, it is a correction. It prevents us from being overly confident in results that depend on partially unobserved data.

Question 10 (Not to be evaluated in the quiz)

Question: How does mice() work under the hood using multiple imputation via PMM?

Let us illustrate PMM assuming a toy dataset with two columns (numerical variables, \(X\) and \(Y\) ) with missing data (i.e., NAs).

Simple case where only one column has missing data (\(Y\) missing, \(X\) fully observed; i.e, single imputation \(Y | X\))

Data

We observe \(X\) for all rows, but two rows have missing \(Y\):

\[ \begin{array}{c|c|c} \text{Row} & X & Y\\ \hline 1 & 1 & 3\\ 2 & 2 & 5\\ 3 & 3 & 7\\ 4 & 4 & 9\\ 5 & 5 & 10\\ 6 & 6 & 13\\ 7 & 7 & 14\\ 8 & 8 & 16\\ 9 & 6 & \texttt{NA}\\ 10 & 4 & \texttt{NA}\\ \end{array} \tag{1}\]

We will impute \(Y_9\) and \(Y_{10}\) using PMM with \(5\) donors (i.e., using 5 complete rows in this donor pool).

Step 1: Working model

Use the following OLS model for those complete rows (i.e, \(1\) to \(8\)):

\[ Y = \beta_0 + \beta_1 X + \varepsilon \qquad \text{with} \qquad \varepsilon\sim \mathcal{N}(0,\sigma^2). \]

Use the standard Multiple Imputation by Chained Equations (MICE) non-informative prior distribution on the above OLS parameters:

\[ p(\beta_0,\beta_1,\sigma^2)\propto \frac{1}{\sigma^2}. \]

Definition of a non-informative prior distribution

In frequentist statistics, we already believe in:

  • Estimators.
  • Sampling distributions.
  • Asymptotics.
  • Likelihood (i.e., the data we observed).

In Bayesian statistics, a non-informative prior is basically a way of saying:

Let the likelihood dominate. I don’t want to inject real prior information for my parameters of interest.

Step 2. Fit OLS on observed \(Y\) rows

Observed \(Y\) rows are \(1, \dots,8\). Thus:

  • \(n_{\text{obs}} = 8\),
  • \(k = 2\) (intercept + slope),
  • Residual degrees of freedom \(\nu = n_{\text{obs}} - k = 6\).

Compute means:

\[ \begin{gather*} \bar{X}=\frac{1+2+\cdots+8}{8}=\frac{36}{8}=4.5 \\ \bar{Y}=\frac{3+5+7+9+10+13+14+16}{8}=\frac{77}{8}=9.625. \end{gather*} \]

Compute:

\[ \begin{gather*} S_{xx}=\sum_{i=1}^8 (x_i-\bar{x})^2 = 42 \\ S_{xy}=\sum_{i=1}^8 (x_i-\bar{x})(y_i-\bar{y}) = 84.5. \end{gather*} \]

So, the OLS slope and intercept are:

\[ \begin{gather*} \hat{\beta}_1=\frac{S_{xy}}{S_{xx}}=\frac{84.5}{42}=2.0119 \\ \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}=9.625-(2.0119)(4.5)=0.5714. \end{gather*} \]

Thus, the fitted conditional mean is:

\[ \widehat{Y}=0.5714 + 2.0119X. \]

Step 3: Compute sum of squared errors (SSE)

Compute residuals \(e_i=y_i-\hat{y}_i\) for rows \(1,\dots,8\), and SSE:

\[ \text{SSE}=\sum_{i=1}^8 e_i^2 = 1.6131. \]

Step 4: Inverse-gamma posterior for \(\sigma^2\) and its hyperparameters (\(\alpha\) and \(\eta\))
Definition of a posterior distribution

In Bayesian statistics, the posterior distribution of a parameter (e.g., \(\sigma^2\)) is your updated belief about a parameter after you have seen the data (i.e., the likelihood which are complete rows \(1\) to \(8\) in this case).

Before seeing data (i.e., the likelihood), you might have some belief on the parameter (which is a prior distribution, either informative or uninformative). After seeing data (i.e., the likelihood), you update that belief on the parameter. The result of that update is called the posterior distribution of the parameter of interest.

Under the prior distribution \(p(\beta_0, \beta_1,\sigma^2)\propto 1/\sigma^2\), the posterior distribution of \(\sigma^2\) is:

\[ \sigma^2 \mid \text{observed complete data} \sim \text{Inverse Gamma}(\alpha,\eta), \]

with

\[ \alpha=\frac{n_{\text{obs}}-k}{2}=\frac{8-2}{2}=3, \qquad \eta=\frac{\text{SSE}}{2}=\frac{1.6131}{2}=0.80655. \]

Note the Inverse Gamma distribution is another case of a continuous nonnegative distribution. Think about it as one “cousin” of the Gamma distribution.

Important

Is the variance \(\sigma^2\) fixed or random in Bayesian imputation?

In frequentist regression, \(\sigma^2\) is treated as a fixed but unknown constant, and uncertainty is reflected through the sampling distribution of its estimator.

In Bayesian multiple imputation (such as in MICE), \(\sigma^2\) is treated as a random variable with its own probability distribution. We draw a plausible value of the variance at each imputation step, just like we draw regression coefficients.

This allows imputed values to properly reflect uncertainty about model fit, which prevents underestimating variability in the final analysis.

So:

\[ \sigma^2 \mid \text{observed complete data} \sim \text{Inverse Gamma}(3,\,0.80655). \]

Step 5: Draw \(\sigma^{2^*}\)

Draw one variance \(\sigma^{2^*}\) from this posterior distribution:

\[ \sigma^2 \mid \text{observed complete data} \sim \text{Inverse Gamma}(3,0.80655). \]

Important

What does it mean to “draw” a variance \(\sigma^{2^*}\) from an Inverse Gamma distribution?

When we write:

\[ \sigma^2 \mid \text{observed complete data} \sim \text{Inverse Gamma}(3,,0.80655), \]

we mean that we generate one random value \(\sigma^{2^*}\) of the variance from that probability distribution.

In R, there is no built-in rinvgamma() function in base R, but we can generate an Inverse Gamma draw using the Gamma distribution:

  • If \(Z \sim \text{Gamma}(\alpha, \text{rate}=\eta)\),
  • then \(1/Z \sim \text{Inverse Gamma}(\alpha, \eta)\).

So in R we would generate the draw using:

sigma2_star <- 1 / rgamma(1, shape = 3, rate = 0.80655)

This produces one random value \(\sigma^{2^*}\), which we then use in the regression draw step of the imputation algorithm. Each imputation cycle generates a new random draw, allowing parameter uncertainty to propagate properly.

For this numerical walkthrough, take a concrete draw (assume we already executed the above R code):

\[ \sigma^{2^*}=0.40. \]

Step 6: Draw \(\beta_{0}^*, \beta_{1}^* \mid \sigma^{2^*}, \text{observed complete data}\)

Conditional on \(\sigma^{2^*}\) and the observed complete data, we have this other posterior distribution:

\[ \beta_{0}^*, \beta_{1}^* \mid \sigma^{2^*}, \text{observed complete data} \sim \mathcal{N}\!\left((\hat{\beta}_0, \hat{\beta}_1)^\top,\ \sigma^{2*}(X^\top X)^{-1}\right). \]

Before doing the corresponding random draws of \(\beta_{0}^*\) and \(\beta_{1}^*\), we need to compute the parameters of the above multivariate Normal distribution. Hence, for those complete rows \(1,\dots,8\) in the dataset:

\[ \sum_{i = 1}^8 1 = 8,\qquad \sum_{i = 1}^8 X_i = 36,\qquad \sum_{i = 1}^8 X_i^2 = 204, \]

so

\[ X^\top X= \begin{pmatrix} 8 & 36\\ 36 & 204 \end{pmatrix} \quad \text{with} \qquad \det(X^\top X)=8\cdot 204-36^2=336, \]

and

\[ (X^\top X)^{-1} = \frac{1}{336} \begin{pmatrix} 204 & -36\\ -36 & 8 \end{pmatrix}. \]

Multiply by \(\sigma^{2^*}=0.40\):

\[ \text{Var}\left[(\beta_0^*, \beta_1^*)^\top \right]= 0.40\cdot (X^\top X)^{-1} = \begin{pmatrix} 0.2429 & -0.0429\\ -0.0429 & 0.00952 \end{pmatrix}. \]

Therefore, take concrete random draws from the above posterior multivariate Normal distribution via its plugged in corresponding parameters (assume we already used the corresponding R random number generator for a multivariate Normal distribution):

\[ \beta_0^* = 0.7714, \qquad \beta_1^* = 1.9619. \]

Therefore, the draw-based predicted mean used for matching is:

\[ \hat{y_i}^*(x_i)=0.7714 + 1.9619x_i. \]

Step 7: PMM matching with \(5\) donors

Donors are rows with observed \(Y\), i.e., rows \(1,\dots,8\). Compute donor predicted means:

\[ \hat{y}_i^* = 0.7714 + 1.9619x_i. \]

\[ \begin{array}{c|c|c|c} \text{Row} & X & Y\ (\text{observed}) & \hat{y}_i^*\\ \hline 1 & 1 & 3 & 2.7333\\ 2 & 2 & 5 & 4.6952\\ 3 & 3 & 7 & 6.6571\\ 4 & 4 & 9 & 8.6190\\ 5 & 5 & 10 & 10.5809\\ 6 & 6 & 13 & 12.5428\\ 7 & 7 & 14 & 14.5047\\ 8 & 8 & 16 & 16.4666\\ \end{array} \]

We impute row \(9\) (\(X_9=6\)) in the original dataset. The predicted mean for row \(9\) is:

\[ \hat{y}_9^* = 0.7714 + 1.9619(6)=12.5428. \]

The distances are the following for \(i = 1, \dots, 8\):

\[ d_i = \left|\hat{y}_i^*-\hat{y}_9^*\right|. \]

Hence, the \(5\) smallest distances correspond to donor rows:

\[ \{6,5,7,4,8\}. \]

PMM randomly selects one of these 5 donor rows; suppose it selects row \(8\). Then we impute \(Y_9\) with the non-missing value \(Y_8\):

\[Y_9 \leftarrow Y_8 = 16.\]

Now, we impute row \(10\) (\(X_{10}=4\)) in the original dataset. The predicted mean for row \(10\) is:

\[ \hat{y}_{10}^* = 0.7714 + 1.9619(4)=8.6190. \]

The distances are the following for \(i = 1, \dots, 8\):

\[ d_i = \left|\hat{y}_i^*-\hat{y}_{10}^*\right|. \]

Thus, the \(5\) nearest donor rows are:

\[ \{4,3,5,2,6\}. \]

Suppose PMM randomly selects row \(6\). Then we impute \(Y_{10}\) with the non-missing value \(Y_6\)::

\[ Y_{10} \leftarrow Y_6 = 13. \]

The imputation results after the simple PMM step are the following (imputed values are shown in bold numbers in rows \(9\) and \(10\)):

\[ \begin{array}{c|c|c} \text{Row} & X & Y\\ \hline 1 & 1 & 3\\ 2 & 2 & 5\\ 3 & 3 & 7\\ 4 & 4 & 9\\ 5 & 5 & 10\\ 6 & 6 & 13\\ 7 & 7 & 14\\ 8 & 8 & 16\\ 9 & 6 & \textbf{16}\\ 10 & 4 & \textbf{13}\\ \end{array} \]

Mixed-effects Modelling

Question 11 (Not to be evaluated in the quiz)

Note: Assume we have the fixed effect parameters of the Grunfeld’s case from lecture4.

Question: In a linear mixed-effects model, how are the fixed effects (e.g., \(\beta_0, \beta_1, \beta_2\)) and the variance/correlation parameters (e.g., \(\sigma^2, \sigma_0^2, \sigma_1^2, \sigma_2^2, \rho_{uv}\)) actually estimated? What is restricted maximum likelihood (REML) and why do we use it?

Answer: Mixed-effects models are still fit using maximum likelihood ideas, but the likelihood is not a simple product over observations (such as classical maximum likelihood estimation, MLE), because data points within the same group are correlated.

1) Start from the classical MLE picture

In the simplest “introductory MLE” setting for a standalone population parameter \(\theta\), we assume observations \(y_1, \dots, y_n\) are independent so the likelihood looks like

\[ L(\theta | y_1, \dots, y_n) = \prod_{i=1}^{n} f_{Y_i}(y_i\mid \theta), \]

and we choose the MLE \(\hat\theta\) that maximizes \(L(\theta)\) (or the log-likelihood).

2) What changes in a Mixed-effects model: correlation within groups

In lecture4, we indexed years within a firm by \(i\) and firms by \(j\). A random-intercept model is

\[ \texttt{investment}_{i,j} = \beta_0 + \beta_1\,\texttt{marketValue}_{i,j} + \beta_2\,\texttt{capital}_{i,j} + b_{0,j} + \varepsilon_{i,j}, \qquad i = 1,\ldots,n_j, \]

with

\[ b_{0,j} \sim \mathcal{N}(0,\sigma_0^2), \qquad \varepsilon_{i,j} \sim \mathcal{N}(0,\sigma^2), \qquad b_{0,j} \perp\!\!\!\perp \varepsilon_{i,j} \quad \text{(i.e., they are independent)}. \]

Because the same \(b_{0,j}\) is shared by all observations in the \(j\)th firm, responses inside a firm are correlated. In particular, for the same firm \(j\) and two different years \(i\neq i'\),

\[ \mathrm{Cov}(\texttt{investment}_{i,j},\texttt{investment}_{i',j}) = \sigma_0^2, \]

and the marginal variance of each observation is

\[ \mathrm{Var}(\texttt{investment}_{i,j}) = \sigma_0^2 + \sigma^2. \]

So the likelihood is no longer a plain product over all \((i,j)\) observations. Instead, for each firm \(j\), the \(n_j\) outcomes are treated as a jointly Normal set of correlated observations (with the covariance pattern above), and the full likelihood multiplies those firm-level joint densities across firms.

3) What is actually estimated by maximum likelihood in Mixed-effects modelling?

There are two kinds of unknowns in a Mixed-effects model:

  1. Fixed effects: \(\beta_0,\beta_1,\beta_2\) (population-level average relationship).
  2. Variance/correlation parameters: things like \(\sigma^2\) (residual variance) and the random-effect variances/correlations (e.g., \(\sigma_0^2\); and in random-slope models \(\sigma_1^2, \sigma_2^2, \rho_{uv}\)).

A full maximum likelihood (ML) fit chooses values of all these unknowns to make the observed grouped data most likely under the model. In practice, lmer() evaluates the log-likelihood implied by the correlated-Normal structure and maximizes it numerically.

4) Why REML exists (and what it does, in plain MLE language)

REML stands for restricted (or residual) maximum likelihood. The motivation is that ML tends to underestimate variance components in finite samples because it behaves a bit like it “forgets” that we used information to estimate the fixed effects. REML addresses this by constructing a likelihood that depends only on the variance/correlation parameters (like \(\sigma^2,\sigma_0^2,\sigma_1^2,\sigma_2^2,\rho_{uv}\)), using the part of the data that remains after accounting for the fixed-effects trend. Then:

  1. REML chooses the variance/correlation parameters that maximize this restricted likelihood, and
  2. given those variance estimates, the fixed effects \(\beta_0,\beta_1,\beta_2\) are estimated afterward in a way that properly accounts for within-firm correlation.

This is why output often says “fit by REML”: REML is mainly about variance components, and then the fixed effects are computed using those variance estimates.

Question 12 (Not to be evaluated in the quiz)

Question: What are the likelihood and log-likelihood functions in REML?

Answer: The big picture is that

  • ML chooses parameters by maximizing the (marginal) likelihood of the observed data.
  • REML chooses the variance/covariance parameters by maximizing a restricted likelihood that corrects for the fact that we also had to learn the fixed effects from the same data.

1) The simplest Mixed-effects model (random intercept; intercept-only fixed effect)

Important

In the below simple intercept-only model, the regressors part is intentionally removed so we can write a closed-form REML likelihood in scalar notation; once regressors like \(\texttt{marketValue}_{i,j}\) and \(\texttt{capital}_{i,j}\) are included, REML is the same idea but the restricted likelihood is optimized numerically rather than shown in a simple group-mean formula.

We have with a random-intercept model where group \(j\) (i.e., a firm) has its own intercept shift:

\[ \texttt{investment}_{i,j} = \beta_0 + b_{0,j} + \varepsilon_{i,j}, \qquad i = 1,\ldots,n_j,\quad j=1,\ldots,J. \]

The assumptions are the following:

\[ b_{0,j} \sim \mathcal{N}(0,\sigma_0^2), \qquad \varepsilon_{i,j} \sim \mathcal{N}(0,\sigma^2), \qquad b_{0,j} \ \perp\!\!\!\perp\ \varepsilon_{i,j}. \]

We define the usual within-group mean:

\[ \overline{\texttt{investment}}_{\cdot,j} = \frac{1}{n_j}\sum_{i=1}^{n_j}\texttt{investment}_{i,j}. \]

and the within-group sum of squares:

\[ \text{SSW} = \sum_{j=1}^J \sum_{i=1}^{n_j} \Big(\texttt{investment}_{i,j} - \overline{\texttt{investment}}_{\cdot,j}\Big)^2. \]

Also define the marginal variance of the group mean:

\[ \tau_j^2 = \mathrm{Var}\!\left(\overline{\texttt{investment}}_{\cdot,j}\right) = \sigma_0^2 + \frac{\sigma^2}{n_j}, \qquad w_j = \frac{1}{\tau_j^2}. \]

2) The REML (restricted) likelihood, in scalar form

REML is easiest to see by splitting the information in the data into two parts:

  1. Within-group deviations \(\texttt{investment}_{i,j}-\overline{\texttt{investment}}_{\cdot,j}\): These tell you about \(\sigma^2\) (because the random intercept cancels when you subtract the group mean).

  2. Between-group variation in the group means \(\overline{\texttt{investment}}_{\cdot,j}\): These tell you about \(\sigma_0^2\) (because group means move when firms have different random intercepts).

But REML also removes the fixed intercept \(\beta_0\) from this second part. The REML “best” estimate of \(\beta_0\) given \(\sigma^2\) and \(\sigma_0^2\) is the weighted mean of group means:

\[ \widehat{\beta}_0(\sigma^2,\sigma_0^2) = \frac{\sum_{j=1}^{11} w_j\,\overline{\texttt{investment}}_{\cdot,j}} {\sum_{j=1}^{11} w_j}. \]

Then the restricted likelihood for \(\sigma^2\) and \(\sigma_0^2)\) can be written as

\[ \begin{align*} L_R(\sigma^2,\sigma_0^2) &= (2\pi)^{-\frac{n-1}{2}} \; (\sigma^2)^{-\frac{n-11}{2}} \; \left(\prod_{j=1}^{11} (\tau_j^2)^{-\frac{1}{2}}\right) \; \left(\sum_{j=1}^{11} w_j\right)^{-\frac{1}{2}} \\ & \qquad \exp\!\left[ -\frac{\text{SSW}}{2\sigma^2} -\frac{1}{2}\sum_{j=1}^{11} w_j\Big(\overline{\texttt{investment}}_{\cdot,j} - \widehat{\beta}_0\Big)^2 \right], \end{align*} \]

where \(n = \sum_{j=1}^{11} n_j\) is the total sample size (recall we have \(11\) firms in our dataset), and \(\widehat{\beta}_0\) is shorthand for \(\widehat{\beta}_0(\sigma^2,\sigma_0^2)\).

Important

REML uses \(n-1\) effective pieces of information here (because we used the data to learn one fixed effect, \(\beta_0\)), which is why you see \((2\pi)^{-(n-1)/2}\) rather than \((2\pi)^{-n/2}\).

3) The REML log-likelihood

Taking logs, the restricted log-likelihood is

\[ \begin{align*} \log \left[ L (\sigma^2,\sigma_0^2) \right] &= -\frac{n-1}{2}\log(2\pi) -\frac{n-11}{2}\log(\sigma^2) -\frac{1}{2}\sum_{j=1}^{11} \log(\tau_j^2) - \\ & \qquad \frac{1}{2}\log\!\left(\sum_{j=1}^{11} w_j\right) -\frac{\text{SSW}}{2\sigma^2} - \\ & \qquad \quad \frac{1}{2}\sum_{j=1}^{11} w_j\Big(\overline{\texttt{investment}}_{\cdot,j} - \widehat{\beta}_0\Big)^2. \end{align*} \]

Important

REML fitting means: choose \(\hat\sigma^2\) and \(\hat\sigma_0^2\) that maximize \(\log \left[ L (\sigma^2,\sigma_0^2) \right]\).

4) Quick rationale for the full mixed model from class (with regressors and random slopes)

In the Grunfeld-style model from class, we have the full mixed model:

\[ \begin{align*} \texttt{investment}_{i,j} &= \overbrace{\beta_{0,j}}^{\text{Mixed Effect}} + \overbrace{\beta_{1,j}}^{\text{Mixed Effect}} \times \texttt{marketValue}_{i,j} + \\ & \qquad \overbrace{\beta_{2,j}}^{\text{Mixed Effect}} \times \texttt{capital}_{i,j} + \varepsilon_{i,j} \\ &= (\beta_0 + b_{0,j}) + (\beta_1 + b_{1,j}) \times \texttt{marketValue}_{i,j} + \\ & \qquad (\beta_2 + b_{2,j}) \times \texttt{capital}_{i,j} + \varepsilon_{i,j} \\ & \qquad \qquad \qquad \text{for} \; i = 1, \ldots, n_j \; \; \text{and} \; \; j = 1, \ldots, 11; \end{align*} \]

with \((b_{0,j},b_{1,j},b_{2,j})^{T}\) jointly Normal and correlated (via \(\sigma_0^2,\sigma_1^2,\sigma_2^2\) and \(\rho_{uv}\)). What changes conceptually?

  • The same two goals remain: estimate fixed effects and estimate variance/correlation parameters.
  • But the within versus between split is no longer just subtract group means, because the fixed-effects part is now a regression line, not a single mean.
  • REML still works by using the part of the data that is left over after accounting for the fixed-effects regression; that is, it adjusts the likelihood to reflect that we learned \(\beta_0, \beta_1, \beta_2\) from the data first.
  • In practice, the full REML objective becomes more complicated (because the correlation structure within groups depends on \(\sigma^2\) and on all random-effect variances/correlations), so lmer() maximizes the restricted log-likelihood numerically.

Question 13 (Not to be evaluated in the quiz)

Question: In a Mixed-effects regression model, why are not the realized random effects estimated in the same way as fixed effects? If we do not “estimate” them directly, how do we obtain the random-effects values that lmer() reports?

Answer: In our lecture notes, a Mixed-effects model lets the regression line shift (and possibly tilt) from one group to another. For the Grunfeld example, group \(j\) is a firm, and observation \(i\) is one year within that firm. We will use this example to illustrate all the points below.

1) What the random effects are

A mixed-intercept model (random intercept only) is written as

\[ \texttt{investment}_{i,j} = (\beta_0 + b_{0,j}) + \beta_1\,\texttt{marketValue}_{i,j} + \beta_2\,\texttt{capital}_{i,j} + \varepsilon_{i,j}, \qquad i = 1,\ldots,n_j. \]

  • The fixed effects \(\beta_0, \beta_1, \beta_2\) are unknown constants that describe the population-level mean relationship.
  • The random effect \(b_{0,j}\) is the firm-specific deviation from the population intercept, so the firm’s mixed intercept is \(\beta_{0,j} = \beta_0 + b_{0,j}\).
  • We assume \[ b_{0,j} \sim \mathcal{N}(0,\sigma_0^2), \qquad \varepsilon_{i,j} \sim \mathcal{N}(0,\sigma^2), \] and \(b_{0,j}\) is independent of \(\varepsilon_{i,j}\).

A mixed-intercept-and-slopes model (random intercept and random slopes) is written as:

\[ \texttt{investment}_{i,j} = (\beta_0 + b_{0,j}) + (\beta_1 + b_{1,j})\,\texttt{marketValue}_{i,j} + (\beta_2 + b_{2,j})\,\texttt{capital}_{i,j} + \varepsilon_{i,j}. \]

Now \(b_{0,j}, b_{1,j}, b_{2,j}\) are realizations of correlated random variables. In scalar terms, that means we assume (recall the multivariate Normal distribution discussed in class):

\[ \mathbb{E}[b_{0,j}] = \mathbb{E}[b_{1,j}] = \mathbb{E}[b_{2,j}] = 0, \]

with

\[ \mathrm{Var}(b_{0,j}) = \sigma_0^2, \quad \mathrm{Var}(b_{1,j}) = \sigma_1^2, \quad \mathrm{Var}(b_{2,j}) = \sigma_2^2, \]

and correlations (equivalently, covariances)

\[ \mathrm{Corr}(b_{u,j}, b_{v,j}) = \rho_{uv}, \qquad \mathrm{Cov}(b_{u,j}, b_{v,j}) = \sigma_{u,v} = \rho_{uv}\,\sigma_u\,\sigma_v, \qquad u\neq v. \]

This is exactly what it means to say “the random effects are correlated”: within the same firm, a high (or low) intercept deviation tends to co-occur with slope deviations according to the fitted \(\rho_{uv}\) values.

2) Why realized random effects are not estimated like fixed effects

Fixed effects are treated like ordinary regression parameters (similar to the usual OLS estimates): we estimate \(\hat\beta_0, \hat\beta_1, \hat\beta_2\) because those are the population-level quantities we want to learn. Random effects are different: \(b_{0,j}, b_{1,j}, b_{2,j}\) are not “extra terms to estimate such as in OLS”. They are latent realizations whose distribution is part of the model (again, recall the multivariate Normal distribution we saw in class for the random effects). In other words, we do not primarily try to learn each firm’s exact effects \(b\)’s as if they were fixed unknown numbers; instead we learn:

  1. the fixed effects \(\hat\beta_0, \hat\beta_1, \hat\beta_2\) (the population effects), and
  2. the variance components and correlations (e.g., \(\hat\sigma_0^2\), \(\hat\sigma_1^2\), \(\hat\sigma_2^2\), and \(\hat\rho_{uv}\)) that describe how much and in what correlated way firms deviate from that population line.
Important

Only after estimating \(\hat\beta_0, \hat\beta_1, \hat\beta_2\), and all variance components, we do compute firm-specific random effects.

3) What lmer() reports for random effects: BLUPs

The numbers reported as “random effects” (i.e., \(\hat b_{0,j}\), \(\hat b_{1,j}\), and \(\hat b_{2,j}\)) are usually BLUPs: Best Linear Unbiased Predictors. Why do the have this name?

  • Predictor: We are predicting the unobserved realization \(b_{0,j}\) (or \(b_{1,j}, b_{2,j}\)) from the observed data.
  • Linear: The predictor can be written as a linear combination of the observed responses.
  • Unbiased: It is constructed so it is centered correctly under the model’s specification.
  • Best: among linear unbiased predictors, it has the smallest mean-squared prediction error.

In linear Mixed-effects models with normality assumptions (as in our lecture), the BLUP is also the conditional mean of the random effect given the observed data, using the fitted variance components. When we plug in estimated variance components (which is what software does), people often call this an empirical BLUP.

BLUP for a random intercept (in a random-intercept-only model)

This case gives a clean scalar formula that shows the main idea.

Step A: remove the fixed part. Define the fixed-effects-only fitted value

\[ \widehat{\texttt{investment}}^{\,(\text{fixed})}_{i,j} = \hat\beta_0 + \hat\beta_1\,\texttt{marketValue}_{i,j} + \hat\beta_2\,\texttt{capital}_{i,j}, \]

and the corresponding residuals

\[ e_{i,j} = \texttt{investment}_{i,j} - \widehat{\texttt{investment}}^{\,(\text{fixed})}_{i,j}. \]

In a random-intercept-only model, these satisfy (conceptually)

\[ e_{i,j} = b_{0,j} + \varepsilon_{i,j}. \]

Step B: average within the group. Let

\[ \overline{e}_{\cdot,j} = \frac{1}{n_j}\sum_{i=1}^{n_j} e_{i,j}. \]

Because \(\varepsilon_{i,j}\) averages out, we have that:

\[ \overline{e}_{\cdot,j} \mid b_{0,j} \sim \mathcal{N}\!\left(b_{0,j},\,\frac{\sigma^2}{n_j}\right). \]

So \(\overline{e}_{\cdot,j}\) is a noisy “measurement” of \(b_{0,j}\).

Step C: combine model belief plus data evidence. Marginally, the model says (note this is a property of the Multivariate normal distribution):

\[ b_{0,j} \sim \mathcal{N}(0,\sigma_0^2), \]

so values far from \(0\) are possible but less likely when \(\sigma_0^2\) is small.

Under these two Normal pieces, the conditional distribution of \(b_{0,j}\) given the data in firm \(j\) is also Normal, and its mean is

\[ \mathbb{E}[b_{0,j} \mid \text{data in firm } j] = \lambda_j\,\overline{e}_{\cdot,j}, \qquad \lambda_j = \frac{\sigma_0^2}{\sigma_0^2 + \sigma^2/n_j}. \]

This conditional mean is the BLUP. In practice, lmer() plugs in variance estimates, so you see

\[ \hat b_{0,j} = \hat\lambda_j\,\overline{e}_{\cdot,j}, \qquad \hat\lambda_j = \frac{\hat\sigma_0^2}{\hat\sigma_0^2 + \hat\sigma^2/n_j}. \]

You can also read off the uncertainty in this prediction from the conditional variance:

\[ \mathrm{Var}(b_{0,j} \mid \text{data in firm } j) = \left(\frac{1}{\sigma_0^2} + \frac{n_j}{\sigma^2}\right)^{-1} = \frac{\sigma_0^2\,\sigma^2/n_j}{\sigma_0^2 + \sigma^2/n_j}. \]

Note a shrinkage happens since \(0 < \lambda_j < 1\):

  • If \(n_j\) is large (many observations in firm \(j\)), then \(\sigma^2/n_j\) is small, \(\lambda_j \approx 1\), and \(\hat b_{0,j}\) is close to the raw firm-average residual \(\overline{e}_{\cdot,j}\).
  • If \(n_j\) is small (little information in firm \(j\)), then \(\sigma^2/n_j\) is large, \(\lambda_j\) is smaller, and \(\hat b_{0,j}\) is pulled toward \(0\).
  • If \(\sigma_0^2\) is small (firms do not differ much), then \(\lambda_j\) is small and everyone’s \(\hat b_{0,j}\) is strongly shrunk toward \(0\).

Finally, the firm-specific intercept used for prediction is exactly what the lecture notes describe for the \(j\)th firm:

\[ \hat\beta_{0,j} = \hat\beta_0 + \hat b_{0,j}. \]

What changes when we also have random slopes?

When \((b_{0,j}, b_{1,j}, b_{2,j})^{T}\) are included, the same idea applies, but now we are predicting three realized random effects at once from firm \(j\)’s data. A helpful scalar way to think about this is that for each firm \(j\), the predicted deviations \(\hat b_{0,j}, \hat b_{1,j}, \hat b_{2,j}\) are the values that balance two forces:

  1. Fit the \(j\)th firm well: They reduce the within-firm residual sum of squares

\[ \sum_{i=1}^{n_j} \Big[ \texttt{investment}_{i,j} - (\hat\beta_0 + b_{0,j}) - (\hat\beta_1 + b_{1,j})\,\texttt{marketValue}_{i,j} - (\hat\beta_2 + b_{2,j})\,\texttt{capital}_{i,j} \Big]^2. \]

  1. Stay plausible under the random-effects distribution: Very large deviations are penalized according to the fitted variances and correlations.
    • If the random effects were independent (i.e, all \(\rho_{uv}=0\)), the penalty would look like a familiar “ridge” term: \[ \frac{b_{0,j}^2}{\hat\sigma_0^2} + \frac{b_{1,j}^2}{\hat\sigma_1^2} + \frac{b_{2,j}^2}{\hat\sigma_2^2}. \]
    • When the random effects are correlated (\(\rho_{uv} \neq 0\)), the penalty also contains cross-terms that couple the deviations, schematically of the form \[ C_{01}\,b_{0,j}b_{1,j} + C_{02}\,b_{0,j}b_{2,j} + C_{12}\,b_{1,j}b_{2,j}, \]

Note these constants \(C_{uv}\) are determined by \(\hat\sigma_0^2, \hat\sigma_1^2,\ \hat\sigma_2^2\) and the fitted correlations \(\hat\rho_{uv}\). This is precisely how correlated random variables shows up in the computation: evidence about one deviation (say, an intercept shift) informs the predicted slope deviations because the model has learned that they move together.

Takeaway: Fixed effects \(\beta_0,\beta_1,\beta_2\) are estimated parameters about the population mean relationship. Random effects \(b_{0,j},b_{1,j},b_{2,j}\) are group-specific realizations of correlated random variables. We estimate the variance/correlation structure (e.g., \(\sigma_0^2,\sigma_1^2,\sigma_2^2\) and \(\rho_{uv}\)), and then obtain reported random-effects values as BLUP-style conditional predictions that automatically shrink toward \(0\) when a group has limited or noisy data.

Survival Analysis

Question 14 (Not to be evaluated in the quiz)

Question: Can we obtain an unrestricted mean survival time with Cox regression?

Answer: Not reliably, unless you are willing to add extra assumptions about the tail of the survival distribution beyond the observed follow-up window (i.e, beyond your training data). Suppose we have a single regressor \(x\), we know that the unrestricted mean survival time is the total area under the whole survival curve:

\[ \mathbb{E}[T \mid x] \;=\; \int_{0}^{\infty} S(t \mid x)\,dt . \]

A Cox model estimates \(S(t\mid x)\) well over the time range supported by the training data, but it leaves the baseline hazard (and therefore the long-run tail behavior of \(S(t\mid x)\)) unspecified. With right censoring, it is common that the estimated survival curve does not reach \(0\) by the end of follow-up, which means the above integral to \(\infty\) is not identified from the data without extrapolation. This is the same core issue that already appears with Kaplan–Meier: the estimated and unrestricted mean survival time is undefined if the last observation is censored, and any “mean” requires a convention or extrapolation.