Missing Data
By the end of this lecture, you should be able to:
{mice} package in R to fit multiple imputed models.NA in one of the columns).Heads-up: It is the ideal class of missing data since there is no missingness pattern.
time, Camera_1, …, Camera_100.
time.Example
With the previous example, if we did not record time, we would be MNAR data since we do not have any other information rather than just the car count.
Another missingness cause of this type would be that, as the cameras get older, their probability of failure increases, but we cannot check this from our observed data.
Are these higher standard errors related to the MCAR nature of the deleted rows?
A. Yes.
B. No.
flights dataset from the package {nycflights13}.dep_delay: departure delays in minutes.arr_delay: arrival delays in minutes.carrier: two-letter carrier abbreviation.dep_delay per carrier, but we have missing data in dep_delay, which will return a summary with NAs.
carrier, we just use na.rm = TRUE in mean().
flights is not MCAR, we can introduce serious bias in our estimates (means in this case) for some given statistical approach.
A. MCAR.
B. MAR.
C. MNAR.
Heads-up: This imputation technique can only be used with continuous and count-type variables!
flights dataset, for the exercise’s sake (not to be generalized to other data cases), we will filter those observations with NAs in arr_delay OR dep_delay values larger than 100 minutes.dep_delay > 100 is also depicted in the plot.dep_delay versus arr_delay with 593 complete observations.is.na(arr_delay) | dep_delay > 100 (i.e., OR), we would eventually have an imputed subset of observations with negative dep_delay, i.e., flights on time but missing arr_delay.md.pattern() from the package mice to display missing data patterns.dep_delay and arr_delay, 53 with arr_delay missing, and 354 with both dep_delay and arr_delay missing.md.pattern() summary shows that our sample does NOT contain observations where dep_delay is missing and arr_delay is present.dep_delay and arr_delay.R!mice() from the mice package!arr_delay changes with dep_delay.mice() using method = "norm.predict".R!
mice() function, which offers this multiple imputation method. MICE stands for Multiple Imputation by Chained Equations.R!data, the four automatic steps in this R function are the following:m copies of this dataset: data_1, data_2,…, data_m.data_1, data_2,…,data_m when you have a continuous variable to impute.
Source: van Buuren (2012).
mice() implements the predictive mean matching (PMM) found in Rubin (1987) on page 166.NAs), we follow an univariate Bayesian method.mice()) from all the \(\hat{y}_i^*\) (corresponding to those non-missing rows in \(Y\)) which are the closest ones to each missing case.m times.Note that this algorithm also has a multivariate version where
NAs can be present in more than one column.
sampled_flights with m = 20.R!arr_delay) in these 20 datasets as follows:m imputed datasets that were created.m imputed datasets with complete().R!We can plot the imputed dataset number 3 as follows:
sampled_flights_imputed_PMM.estimate column is just the averages of all m models. Full details about the computation on these columns are provided by Rubin (1987) on pages 76 and 77.estimate: the average of the regression coefficients across m models.ubar: the average variance (i.e., average SE^2) across m models.b: the sample variance of the m regression coefficients.t: a final estimate of the SE^2 of each regression coefficient.
ubar + (1 + 1/m) * bdfcom: the degrees of freedom associated with the final regression coefficient estimates.
alpha-level confidence interval: estimate +/- qt(alpha/2, df) * sqrt(t).riv: the relative increase in variance due to randomness.
t/ubar - 1lambda: the proportion of total variance due to missingness.fmi: the fraction of missing information.Where are the usual outputs (std.error, p.value,…)?
We can call summary() on the pooled model to obtain them.
Data imputation involves some wrangling effort and proper missingness visualizations.
We have to be careful when defining our class of data missingness since this will determine the type of data imputation we need to make (or maybe data deletion!).
In general, multiple imputation will work OK for MAR and MNAR data.
We only saw continuous imputation in this example. Nonetheless, the mice() approach can be extended to other data types such as binary or categorical. In those cases, we have to switch to generalized linear models (GLMs), even with a Bayesian approach.

DSCI 562 - Regression II