DSCI_562_regr-2

Lecture 8: Missing Data

Worksheet: html; .Rmd

Identify and explain the three common types of missing data mechanisms.
Identify a potential consequence of removing missing data on downstream analyses.
Identify a potential consequence of a mean imputation method on downstream analyses.
Identify the three steps involved with a multiple imputation method for handling missing data.
Use the mice package in R to fit multiply imputed models.

Tutorials on the mice package, in increasing order of depth:

my brief mice tutorial
data science plus
jstat article on mice R package
- Extensively covers the idea behind the mice package.
- Section 2.2 provides a good overview of the package workflow, which I recommend reading.
- Section 2.3 provides details on the imputation. It’s theoretical, so only take a look if you really need to know how the imputation is happening.
- Section 2.4 provides an example of using the mice package. Perhaps more verbose than it needs to be?

If you want a deeper look on multiple imputation:

Book: “Multiple Imputation for Nonresponse in Surveys” by Rubin (1987), available through UBC library.
- Extensively covers the theory behind multiple imputation.
- Page 76-77 covers pooling of models, and uses variable names that you see in the output of mice::pool().

There are three common missing data mechanisms:
- Missing Completely At Random (MCAR): when the chance of missingness does not depend on any variable; missingness is totally random.
- Missing At Random (MAR): when the chance of missingness depends on other observed variables.
- Missing Not At Random (MNAR): when the chance of missingness depends on unobserved variables.
Proceeding with an analysis by removing missing data can result in a model with standard errors of the estimates that are larger than they could be by including partially complete records.
Proceeding with an analysis by imputing missing data by an estimate of the mean can result in a model with standard errors of the estimates that are smaller than they ought to be.
An approach that uses the information contained in partially complete records, yet does not assume any more information, is to use multiple imputations. The approach contains three steps:
1. Form multiple datasets containing imputed values. Each dataset should be formed by imputing the missing records in each unit/row with a random draw from a predictive distribution for those records.
2. Fit the model of interest on each imputed dataset.
3. Combine the models to obtain one final model.

This site is open source. Improve this page.