Predictive mean matching is a technique for missing value imputation. It calculates the predicted value of the missing variable based on a regression model from complete data, then selects one value (from the observed) that produces the closest prediction. PMM is robust to transformation, less vulnerable to model misspecification. More theoretical details for PMM can be found here.
Assumption for PMM: distribution of missing is the same aas obsereved data of the candidates that produce the closest values to the predicted value by the missing entry.
Available R package
mice is a powerful R package developed by Stef van Buuren, Karin Groothuis-Oudshoorn and other contributors.
We use the small dataset nhanes included in mice package. It has 25 rows, and three out of four variables have missings.
The original NHANES data is a large national level survey, some are publicly available via R package nhanes.
library(mice)
Attaching package: 'mice'
The following object is masked from 'package:stats':
filter
The following objects are masked from 'package:base':
cbind, rbind
# load example dataset from micehead(nhanes)
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
6 3 NA NA 184
summary(nhanes)
age bmi hyp chl
Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0
1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0
Median :2.00 Median :26.75 Median :1.000 Median :187.0
Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4
3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0
Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0
NA's :9 NA's :8 NA's :10
Impute with PMM
To impute with PMM is straightforward: specify the method, method = pmm.
Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03