Multiple Imputation: Predictive Mean Matching

Overview

Predictive mean matching is a technique for missing value imputation. It calculates the predicted value of the missing variable based on a regression model from complete data, then selects one value (from the observed) that produces the closest prediction. PMM is robust to transformation, less vulnerable to model misspecification. More theoretical details for PMM can be found here.

Assumption for PMM: distribution of missing is the same aas obsereved data of the candidates that produce the closest values to the predicted value by the missing entry.

Available R package

mice is a powerful R package developed by Stef van Buuren, Karin Groothuis-Oudshoorn and other contributors.

Implementation of PMM in mice:

Example

We use the small dataset nhanes included in mice package. It has 25 rows, and three out of four variables have missings.

The original NHANES data is a large national level survey, some are publicly available via R package nhanes.

library(mice)

Attaching package: 'mice'
The following object is masked from 'package:stats':

    filter
The following objects are masked from 'package:base':

    cbind, rbind
# load example dataset from mice
head(nhanes)
  age  bmi hyp chl
1   1   NA  NA  NA
2   2 22.7   1 187
3   1   NA   1 187
4   3   NA  NA  NA
5   1 20.4   1 113
6   3   NA  NA 184
summary(nhanes)
      age            bmi             hyp             chl       
 Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
 1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
 Median :2.00   Median :26.75   Median :1.000   Median :187.0  
 Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
 3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
 Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
                NA's   :9       NA's   :8       NA's   :10     

Impute with PMM

To impute with PMM is straightforward: specify the method, method = pmm.

imp_pmm <- mice(nhanes, method = 'pmm', m=5, maxit=10)

 iter imp variable
  1   1  bmi  hyp  chl
  1   2  bmi  hyp  chl
  1   3  bmi  hyp  chl
  1   4  bmi  hyp  chl
  1   5  bmi  hyp  chl
  2   1  bmi  hyp  chl
  2   2  bmi  hyp  chl
  2   3  bmi  hyp  chl
  2   4  bmi  hyp  chl
  2   5  bmi  hyp  chl
  3   1  bmi  hyp  chl
  3   2  bmi  hyp  chl
  3   3  bmi  hyp  chl
  3   4  bmi  hyp  chl
  3   5  bmi  hyp  chl
  4   1  bmi  hyp  chl
  4   2  bmi  hyp  chl
  4   3  bmi  hyp  chl
  4   4  bmi  hyp  chl
  4   5  bmi  hyp  chl
  5   1  bmi  hyp  chl
  5   2  bmi  hyp  chl
  5   3  bmi  hyp  chl
  5   4  bmi  hyp  chl
  5   5  bmi  hyp  chl
  6   1  bmi  hyp  chl
  6   2  bmi  hyp  chl
  6   3  bmi  hyp  chl
  6   4  bmi  hyp  chl
  6   5  bmi  hyp  chl
  7   1  bmi  hyp  chl
  7   2  bmi  hyp  chl
  7   3  bmi  hyp  chl
  7   4  bmi  hyp  chl
  7   5  bmi  hyp  chl
  8   1  bmi  hyp  chl
  8   2  bmi  hyp  chl
  8   3  bmi  hyp  chl
  8   4  bmi  hyp  chl
  8   5  bmi  hyp  chl
  9   1  bmi  hyp  chl
  9   2  bmi  hyp  chl
  9   3  bmi  hyp  chl
  9   4  bmi  hyp  chl
  9   5  bmi  hyp  chl
  10   1  bmi  hyp  chl
  10   2  bmi  hyp  chl
  10   3  bmi  hyp  chl
  10   4  bmi  hyp  chl
  10   5  bmi  hyp  chl
imp_pmm
Class: mids
Number of multiple imputations:  5 
Imputation methods:
  age   bmi   hyp   chl 
   "" "pmm" "pmm" "pmm" 
PredictorMatrix:
    age bmi hyp chl
age   0   1   1   1
bmi   1   0   1   1
hyp   1   1   0   1
chl   1   1   1   0
# imputations for bmi
imp_pmm$imp$bmi
      1    2    3    4    5
1  29.6 35.3 35.3 30.1 22.0
3  22.0 30.1 27.2 22.0 35.3
4  27.4 30.1 20.4 22.7 25.5
6  24.9 24.9 21.7 22.7 27.4
10 27.4 26.3 26.3 27.4 26.3
11 30.1 27.2 30.1 29.6 27.5
12 22.5 28.7 26.3 28.7 22.5
16 35.3 22.7 35.3 27.2 30.1
21 30.1 22.5 30.1 30.1 33.2

An alternative to the standard PMM is midastouch.

imp_pmms <- mice(nhanes, method = 'midastouch', m=5, maxit=10)

 iter imp variable
  1   1  bmi  hyp  chl
  1   2  bmi  hyp  chl
  1   3  bmi  hyp  chl
  1   4  bmi  hyp  chl
  1   5  bmi  hyp  chl
  2   1  bmi  hyp  chl
  2   2  bmi  hyp  chl
  2   3  bmi  hyp  chl
  2   4  bmi  hyp  chl
  2   5  bmi  hyp  chl
  3   1  bmi  hyp  chl
  3   2  bmi  hyp  chl
  3   3  bmi  hyp  chl
  3   4  bmi  hyp  chl
  3   5  bmi  hyp  chl
  4   1  bmi  hyp  chl
  4   2  bmi  hyp  chl
  4   3  bmi  hyp  chl
  4   4  bmi  hyp  chl
  4   5  bmi  hyp  chl
  5   1  bmi  hyp  chl
  5   2  bmi  hyp  chl
  5   3  bmi  hyp  chl
  5   4  bmi  hyp  chl
  5   5  bmi  hyp  chl
  6   1  bmi  hyp  chl
  6   2  bmi  hyp  chl
  6   3  bmi  hyp  chl
  6   4  bmi  hyp  chl
  6   5  bmi  hyp  chl
  7   1  bmi  hyp  chl
  7   2  bmi  hyp  chl
  7   3  bmi  hyp  chl
  7   4  bmi  hyp  chl
  7   5  bmi  hyp  chl
  8   1  bmi  hyp  chl
  8   2  bmi  hyp  chl
  8   3  bmi  hyp  chl
  8   4  bmi  hyp  chl
  8   5  bmi  hyp  chl
  9   1  bmi  hyp  chl
  9   2  bmi  hyp  chl
  9   3  bmi  hyp  chl
  9   4  bmi  hyp  chl
  9   5  bmi  hyp  chl
  10   1  bmi  hyp  chl
  10   2  bmi  hyp  chl
  10   3  bmi  hyp  chl
  10   4  bmi  hyp  chl
  10   5  bmi  hyp  chl
imp_pmm
Class: mids
Number of multiple imputations:  5 
Imputation methods:
  age   bmi   hyp   chl 
   "" "pmm" "pmm" "pmm" 
PredictorMatrix:
    age bmi hyp chl
age   0   1   1   1
bmi   1   0   1   1
hyp   1   1   0   1
chl   1   1   1   0
imp_pmms$imp$bmi
      1    2    3    4    5
1  35.3 22.0 30.1 30.1 30.1
3  30.1 30.1 30.1 30.1 30.1
4  27.2 28.7 25.5 25.5 21.7
6  24.9 35.3 25.5 25.5 21.7
10 21.7 33.2 22.7 27.5 28.7
11 35.3 22.0 30.1 30.1 30.1
12 21.7 22.5 35.3 27.5 28.7
16 27.2 33.2 30.1 30.1 27.5
21 26.3 22.0 30.1 30.1 33.2

Reference

Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03