Lecture 8: Missing Data
- Worksheet: using the
- 15-minute break: 10 of which is to be used for instructor evaluations.
- Discussion: types of missingness
- Review of the course (time permitting)
Worksheet: html; .Rmd
- Identify and explain the three common types of missing data mechanisms.
- Identify a potential consequence of removing missing data on downstream analyses.
- Identify a potential consequence of a mean imputation method on downstream analyses.
- Identify the three steps involved with a multiple imputation method for handling missing data.
- Use the
mice package in R to fit multiply imputed models.
Tutorials on the
mice package, in increasing order of depth:
- my brief
- data science plus
- jstat article on
mice R package
- Extensively covers the idea behind the
- Section 2.2 provides a good overview of the package workflow, which I recommend reading.
- Section 2.3 provides details on the imputation. It’s theoretical, so only take a look if you really need to know how the imputation is happening.
- Section 2.4 provides an example of using the
mice package. Perhaps more verbose than it needs to be?
If you want a deeper look on multiple imputation:
- Book: “Multiple Imputation for Nonresponse in Surveys” by Rubin (1987), available through UBC library.
- Extensively covers the theory behind multiple imputation.
- Page 76-77 covers pooling of models, and uses variable names that you see in the output of
- There are three common missing data mechanisms:
- Missing Completely At Random (MCAR): when the chance of missingness does not depend on any variable; missingness is totally random.
- Missing At Random (MAR): when the chance of missingness depends on other observed variables.
- Missing Not At Random (MNAR): when the chance of missingness depends on unobserved variables.
- Proceeding with an analysis by removing missing data can result in a model with standard errors of the estimates that are larger than they could be by including partially complete records.
- Proceeding with an analysis by imputing missing data by an estimate of the mean can result in a model with standard errors of the estimates that are smaller than they ought to be.
- An approach that uses the information contained in partially complete records, yet does not assume any more information, is to use multiple imputations. The approach contains three steps:
- Form multiple datasets containing imputed values. Each dataset should be formed by imputing the missing records in each unit/row with a random draw from a predictive distribution for those records.
- Fit the model of interest on each imputed dataset.
- Combine the models to obtain one final model.