The aim of this vignette is to introduce {missRanger} for imputation of missing values and to explain how to use it for multiple imputation.
{missRanger} uses the {ranger} package (Wright and Ziegler 2017) to do fast
missing value imputation by chained random forest. As such, it can be
used as an alternative to {missForest}, a beautiful algorithm introduced
in (Stekhoven and
Buehlmann 2011). Basically, each variable is imputed by
predictions from a random forest using all other variables as
covariables. The main function missRanger()
iterates
multiple times over all variables until the average out-of-bag
prediction error of the models stops to improve.
Why should you consider {missRanger}?
It is fast.
It is flexible and intuitive to apply: E.g. calling
missRanger(data, . ~ 1)
would impute all variables
univariately, missRanger(data, Species ~ Sepal.Width)
would
use Sepal.Width
to impute Species
.
It can deal with most realistic variable types, even dates and times without destroying the original data structure.
It combines random forest imputation with predictive mean
matching. This generates realistic variability and avoids “new” values
like 0.3334 in a 0-1 coded variable. Like this,
missRanger()
can be used for realistic multiple imputation
scenarios, see e.g. (Rubin
1987) for the statistical background.
In the examples below, we will meet two functions from {missRanger}:
generateNA()
: To replace values in a data set by
missing values.
missRanger()
: To impute missing values in a data
frame.
We first generate a data set with about 20% missing values per column
and fill them again by missRanger()
.
library(missRanger)
set.seed(84553)
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Generate data with missing values in all columns
irisWithNA <- generateNA(iris, p = 0.2)
head(irisWithNA)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA NA NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# Impute missing values with missRanger
irisImputed <- missRanger(irisWithNA, num.trees = 100, verbose = 0)
head(irisImputed)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.100000 3.500000 1.503583 0.2000000 setosa
#> 2 4.900000 3.000000 1.400000 0.2845833 setosa
#> 3 4.700000 3.200000 1.300000 0.2000000 setosa
#> 4 5.673567 3.273117 2.505867 0.2000000 setosa
#> 5 5.000000 3.600000 1.400000 0.1914333 setosa
#> 6 5.400000 3.900000 1.509900 0.4000000 setosa
It worked! Unfortunately, the new values look somewhat unnatural due
to different rounding. If we would like to avoid this, we just set the
pmm.k
argument to a positive number. All imputations done
during the process are then combined with a predictive mean matching
(PMM) step, leading to more natural imputations and improved
distributional properties of the resulting values:
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, verbose = 0)
head(irisImputed)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.8 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
missRanger()
offers a ...
argument to pass
options to ranger()
, e.g. num.trees
or
min.node.size
. How would we use its “extremely randomized
trees” variant with 50 trees?
irisImputed_et <- missRanger(
irisWithNA,
pmm.k = 3,
splitrule = "extratrees",
num.trees = 50,
verbose = 0
)
head(irisImputed_et)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.3 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.8 2.7 1.3 0.2 setosa
#> 5 5.0 3.6 1.4 0.4 setosa
#> 6 5.4 3.9 1.3 0.4 setosa
It is as simple!
{missRanger} also plays well together with the pipe:
Since {missRanger} 2.4.0, setting data_only = FALSE
allows to not just return the imputed data, but rather a “missRanger”
object containing more information.
(imp <- missRanger(irisWithNA, data_only = FALSE, verbose = 0))
#> missRanger object. Extract imputed data via $data
#> - best iteration: 5
#> - best average OOB imputation error: 0.02276647
# Summary
summary(imp)
#> missRanger object. Extract imputed data via $data
#> - best iteration: 5
#> - best average OOB imputation error: 0.02276647
#>
#> Sequence of OOB prediction errors:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> [1,] 1.00000000 1.05103455 0.355630263 0.21719804 0.075000000
#> [2,] 0.03032799 0.08326453 0.006800955 0.01048203 0.008333333
#> [3,] 0.03118913 0.07219809 0.006673501 0.01263541 0.000000000
#> [4,] 0.02816872 0.07309112 0.005963934 0.01021492 0.000000000
#> [5,] 0.02908707 0.06850496 0.005953581 0.01028676 0.000000000
#> [6,] 0.02947424 0.06854362 0.005493755 0.01056056 0.000000000
#>
#> Corresponding means:
#> [1] 0.53977257 0.02784177 0.02453923 0.02348774 0.02276647 0.02281443
#>
#> First rows of imputed data:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.494435 0.2000000 setosa
#> 2 4.9 3.0 1.400000 0.2630367 setosa
#> 3 4.7 3.2 1.300000 0.2000000 setosa
By default missRanger()
uses all columns in the data set
to impute all columns with missings. To override this behaviour, you can
use an intuitive formula interface: The left hand side specifies the
variables to be imputed (variable names separated by a +
),
while the right hand side lists the variables used for imputation.
# Impute all variables with all (default behaviour). Note that variables without
# missing values will be skipped from the left hand side of the formula.
m <- missRanger(
irisWithNA, formula = . ~ ., pmm.k = 3, num.trees = 10, seed = 1, verbose = 0
)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.6 0.2 setosa
#> 2 4.9 3.0 1.4 0.3 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.5 3.6 4.4 0.2 setosa
#> 5 5.0 3.6 1.4 0.3 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
# Same
m <- missRanger(irisWithNA, pmm.k = 3, num.trees = 10, seed = 1, verbose = 0)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.6 0.2 setosa
#> 2 4.9 3.0 1.4 0.3 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.5 3.6 4.4 0.2 setosa
#> 5 5.0 3.6 1.4 0.3 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
# Impute all variables with all except Species
m <- missRanger(irisWithNA, . ~ . - Species, pmm.k = 3, num.trees = 10, verbose = 0)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 3.5 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.4 2.9 3.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.4 0.4 setosa
# Impute Sepal.Width by Species
m <- missRanger(
irisWithNA, Sepal.Width ~ Species, pmm.k = 3, num.trees = 10, verbose = 0
)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA 3.0 NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# No success. Why? Species contains missing values and thus can only
# be used for imputation if it is being imputed as well
m <- missRanger(
irisWithNA, Sepal.Width + Species ~ Species, pmm.k = 3, num.trees = 10, verbose = 0
)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 NA 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 NA 3.8 NA 0.2 setosa
#> 5 5.0 3.6 1.4 NA setosa
#> 6 5.4 3.9 NA 0.4 setosa
# Impute all variables univariatly
m <- missRanger(irisWithNA, . ~ 1, verbose = 0)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 6.7 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.4 3.3 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 1.3 setosa
#> 6 5.4 3.9 1.5 0.4 setosa
missRanger()
is based on iteratively fitting random
forests for each variable with missing values. Since the underlying
random forest implementation ranger()
uses 500 trees per
default, a huge number of trees might be calculated. For larger data
sets, the overall process can take very long.
Here are tweaks to make things faster:
Use less trees, e.g. by setting num.trees = 50
. Even
one single tree might be sufficient. Typically, the number of iterations
until convergence will increase with fewer trees though.
Use smaller bootstrap samples by setting
e.g. sample.fraction = 0.1
.
Use the less greedy
splitrule = "extratrees"
.
Use a low tree depth max.depth = 6
.
Use large leafs,
e.g. min.node.size = 10000
.
Use a low max.iter
, e.g. 1 or 2.
Evaluated on a normal laptop:
library(ggplot2) # for diamonds data
dim(diamonds) # 53940 10
diamonds_with_NA <- generateNA(diamonds)
# Takes 270 seconds (10 * 500 trees per iteration!)
system.time(
m <- missRanger(diamonds_with_NA, pmm.k = 3)
)
# Takes 19 seconds
system.time(
m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50)
)
# Takes 6 seconds
system.time(
m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 1)
)
# Takes 9 seconds
system.time(
m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50, sample.fraction = 0.1)
)
case.weights
to weight down contribution of
rows with many missingsUsing the case.weights
argument, you can pass case
weights to the imputation models. This might be useful to weight down
the contribution of rows with many missings.
# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
table(non_miss)
#> non_miss
#> 1 2 3 4 5
#> 2 6 28 68 46
# No weighting
m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, verbose = 0)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.5 0.2 setosa
#> 2 4.9 3.0 1.4 0.1 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.7 3.8 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.5 0.4 setosa
# Weighted by number of non-missing values per row.
m <- missRanger(
irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, verbose = 0, case.weights = non_miss
)
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.3 0.2 setosa
#> 2 4.9 3.0 1.4 0.1 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 5.4 3.4 1.4 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.1 0.4 setosa