https://github.com/shangzhi-hong/rfempimp

Multiple Imputation using Chained Random Forests
https://github.com/shangzhi-hong/rfempimp
imputation missing-data random-forest
Last synced: 8 months ago
JSON representation
Multiple Imputation using Chained Random Forests
Host: GitHub
URL: https://github.com/shangzhi-hong/rfempimp
Owner: shangzhi-hong
Created: 2020-03-08T01:04:13.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-10-20T08:37:13.000Z (over 3 years ago)
Last Synced: 2025-10-22T03:52:56.045Z (8 months ago)
Topics: imputation, missing-data, random-forest
Language: R
Homepage:
Size: 336 KB
Stars: 5
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project

README

          ---

output: github_document

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%",

  fig.align = "center"

)

```

# RfEmpImp 

[![CRAN Status Badge](http://www.r-pkg.org/badges/version/RfEmpImp)](https://CRAN.R-project.org/package=RfEmpImp)

[![GitHub Version Badge](https://img.shields.io/static/v1?label=GitHub&message=2.1.8&color=3399ff)](https://github.com/shangzhi-hong/RfEmpImp)

An R package for random-forest-empowered imputation of missing Data

## Random-forest-based multiple imputation evolved

`RfEmpImp` is an R package for multiple imputation using chained random forests

(RF).  

This R package provides prediction-based and node-based multiple imputation

algorithms using random forests, and currently operates under the multiple

imputation computation framework [`mice`](https://CRAN.R-project.org/package=mice).  

For more details of the implemented imputation algorithms, please refer to:

[arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon).

## Installation

Users can install the CRAN version of `RfEmpImp` from CRAN, or the latest

development version of `RfEmpImp` from GitHub:  

```r

# Install from CRAN

install.packages("RfEmpImp")

# Install from GitHub online

if(!"remotes" %in% installed.packages()) install.packages("remotes")

remotes::install_github("shangzhi-hong/RfEmpImp")

# Install from released source package

install.packages(path_to_source_file, repos = NULL, type = "source")

# Attach

library(RfEmpImp)

```

## Prediction-based imputation

### For mixed types of variables

For data with mixed types of variables, users can call function `imp.rfemp()` to

use `RfEmp` method, for using `RfPred-Emp` method for continuous variables, and

using `RfPred-Cate` method for categorical variables

(of type `logical` or `factor`, etc.).  

Starting with version `2.0.0`, the names of parameters were further simplified,

please refer to the documentation for details.

### Prediction-based imputation for continuous variables

For continuous variables, in `RfPred-Emp` method, the empirical distribution of

random forest's out-of-bag prediction errors is used when constructing the

conditional distributions of the variable under imputation, providing conditional

distributions with better quality. Users can set `method = "rfpred.emp"` in

function call to `mice` to use it.

Also, in `RfPred-Norm` method, normality was assumed for RF prediction errors,

as proposed by Shah *et al.*, and users can set `method = "rfpred.norm"`

in function call to `mice` to use it.

### Prediction-based imputation for categorical variables

For categorical variables, in `RfPred.Cate` method, the probability machine

theory is used, and the predictions of missing categories are based on the

predicted probabilities for each missing observation. Users can set 

`method = "rfpred.cate"` in function call to `mice` to use it.

### Example for prediction-based imputation

```r

# Prepare data

df <- conv.factor(nhanes, c("age", "hyp"))

# Do imputation

imp <- imp.rfemp(df)

# Do analyses

regObj <- with(imp, lm(chl ~ bmi + hyp))

# Pool analyzed results

poolObj <- pool(regObj)

# Extract estimates

res <- reg.ests(poolObj)

```

## Node-based imputation

For continuous or categorical variables, the observations under the predicting

nodes of random forest are used as candidates for imputation.  

Two methods are now available for the `RfNode` algorithm series.  

It should be noted that categorical variables should be of types of `logical` or

`factor`, etc.

### Node-based imputation using predicting nodes

Users can call function `imp.rfnode.cond()` to use `RfNode-Cond` method,

performing imputation using the conditional distribution formed by the

prediction nodes.  

The weight changes of observations caused by the bootstrapping of random

forest are considered, and only the "in-bag" observations are used as candidates

for imputation.  

Also, users can set `method = "rfnode.cond"` in function call to `mice` to use

it.

### Node-based imputation using proximities

Users can call function `imp.rfnode.prox()` to use `RfNode-Prox` method, 

performing imputation using the proximity matrices of random forests.  

All the observations fall under the same predicting nodes are used as candidates

for imputation, including the out-of-bag ones.  

Also, users can set `method = "rfnode.prox"` in function call to `mice`

to use it.

### Example for node-based imputation

```r

# Prepare data

df <- conv.factor(nhanes, c("age", "hyp"))

# Do imputation

imp <- imp.rfnode.cond(df)

# Or: imp <- imp.rfnode.prox(df)

# Do analyses

regObj <- with(imp, lm(chl ~ bmi + hyp))

# Pool analyzed results

poolObj <- pool(regObj)

# Extract estimates

res <- reg.ests(poolObj)

```

## Imputation functions

| Type                        | Impute function | Univariate sampler        | Variable type |

|-----------------------------|-----------------|---------------------------|---------------|

| Prediction-based imputation | imp.emp()       | mice.impute.rfemp()       | Mixed         |

|                             | /               | mice.impute.rfpred.emp()  | Continuous    |

|                             | /               | mice.impute.rfpred.norm() | Continuous    |

|                             | /               | mice.impute.rfpred.cate() | Categorical   |

| Node-based imputation       | imp.node.cond() | mice.impute.rfnode.cond() | Mixed         |

|                             | imp.node.prox() | mice.impute.rfnode.prox() | Mixed         |

|                             | /               | mice.impute.rfnode()      | Mixed         |

## Package structure

The figure below shows how the imputation functions are organized in this R

package.  



## Support for parallel computation

As random forest can be compute-intensive itself, and during multiple imputation

process, random forest models will be built for the variables containing missing

data for a certain number of iterations (usually 5 to 10 times) repeatedly

(usually 5 to 20 times, for the number of imputations performed).

Thus, computational efficiency is of crucial importance for multiple imputation

using chained random forests, especially for large data sets.  

So in `RfEmpImp`, the random forest model building process is accelerated using

parallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger).

The ranger R package provides support for parallel computation using native C++.

In our simulations, parallel computation can provide impressive performance boost

for imputation process (about 4x faster on a quad-core laptop).

## References

1. Hong, Shangzhi, et al. "Multiple imputation using chained random forests."

Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.

2. Zhang, Haozhe, et al. "Random forest prediction intervals."

The American Statistician (2019): 1-15.

3. Wright, Marvin N., and Andreas Ziegler. "ranger: A Fast Implementation of

Random Forests for High Dimensional Data in C++ and R." Journal of Statistical

Software 77.i01 (2017).

4. Shah, Anoop D., et al. "Comparison of random forest and parametric imputation

models for imputing missing data using MICE: a CALIBER study." American Journal

of Epidemiology 179.6 (2014): 764-774.

5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning

for missing data imputation in the presence of interaction effects."

Computational Statistics & Data Analysis 72 (2014): 92-104.

6. Malley, James D., et al. "Probability machines." Methods of information in

medicine 51.01 (2012): 74-81.

7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate Imputation

by Chained Equations in R." Journal of Statistical Software 45.i03 (2011).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shangzhi-hong/rfempimp

Awesome Lists containing this project

README