{"id":32200611,"url":"https://github.com/shangzhi-hong/rfempimp","last_synced_at":"2025-10-22T03:52:59.946Z","repository":{"id":56936240,"uuid":"245730649","full_name":"shangzhi-hong/RfEmpImp","owner":"shangzhi-hong","description":"Multiple Imputation using Chained Random Forests","archived":false,"fork":false,"pushed_at":"2022-10-20T08:37:13.000Z","size":344,"stargazers_count":5,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-22T03:52:56.045Z","etag":null,"topics":["imputation","missing-data","random-forest"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shangzhi-hong.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-08T01:04:13.000Z","updated_at":"2024-06-26T13:44:53.000Z","dependencies_parsed_at":"2022-08-21T01:10:29.417Z","dependency_job_id":null,"html_url":"https://github.com/shangzhi-hong/RfEmpImp","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/shangzhi-hong/RfEmpImp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shangzhi-hong%2FRfEmpImp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shangzhi-hong%2FRfEmpImp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shangzhi-hong%2FRfEmpImp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shangzhi-hong%2FRfEmpImp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shangzhi-hong","download_url":"https://codeload.github.com/shangzhi-hong/RfEmpImp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shangzhi-hong%2FRfEmpImp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280376536,"owners_count":26320276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-22T02:00:06.515Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imputation","missing-data","random-forest"],"created_at":"2025-10-22T03:52:58.936Z","updated_at":"2025-10-22T03:52:59.941Z","avatar_url":"https://github.com/shangzhi-hong.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\",\n  fig.align = \"center\"\n)\n```\n\n# RfEmpImp \u003ca href='https://github.com/shangzhi-hong/RfEmpImp'\u003e\u003cimg src='man/figures/logo.png' height=\"160\" style=\"float:right;\"/\u003e\u003c/a\u003e\n\n[![CRAN Status Badge](http://www.r-pkg.org/badges/version/RfEmpImp)](https://CRAN.R-project.org/package=RfEmpImp)\n[![GitHub Version Badge](https://img.shields.io/static/v1?label=GitHub\u0026message=2.1.8\u0026color=3399ff)](https://github.com/shangzhi-hong/RfEmpImp)\n\nAn R package for random-forest-empowered imputation of missing Data\n\n## Random-forest-based multiple imputation evolved\n`RfEmpImp` is an R package for multiple imputation using chained random forests\n(RF).  \nThis R package provides prediction-based and node-based multiple imputation\nalgorithms using random forests, and currently operates under the multiple\nimputation computation framework [`mice`](https://CRAN.R-project.org/package=mice).  \nFor more details of the implemented imputation algorithms, please refer to:\n[arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon).\n\n\n## Installation\nUsers can install the CRAN version of `RfEmpImp` from CRAN, or the latest\ndevelopment version of `RfEmpImp` from GitHub:  \n```r\n# Install from CRAN\ninstall.packages(\"RfEmpImp\")\n# Install from GitHub online\nif(!\"remotes\" %in% installed.packages()) install.packages(\"remotes\")\nremotes::install_github(\"shangzhi-hong/RfEmpImp\")\n# Install from released source package\ninstall.packages(path_to_source_file, repos = NULL, type = \"source\")\n# Attach\nlibrary(RfEmpImp)\n```\n\n\n## Prediction-based imputation\n### For mixed types of variables\nFor data with mixed types of variables, users can call function `imp.rfemp()` to\nuse `RfEmp` method, for using `RfPred-Emp` method for continuous variables, and\nusing `RfPred-Cate` method for categorical variables\n(of type `logical` or `factor`, etc.).  \nStarting with version `2.0.0`, the names of parameters were further simplified,\nplease refer to the documentation for details.\n\n### Prediction-based imputation for continuous variables\nFor continuous variables, in `RfPred-Emp` method, the empirical distribution of\nrandom forest's out-of-bag prediction errors is used when constructing the\nconditional distributions of the variable under imputation, providing conditional\ndistributions with better quality. Users can set `method = \"rfpred.emp\"` in\nfunction call to `mice` to use it.\n\nAlso, in `RfPred-Norm` method, normality was assumed for RF prediction errors,\nas proposed by Shah *et al.*, and users can set `method = \"rfpred.norm\"`\nin function call to `mice` to use it.\n\n### Prediction-based imputation for categorical variables\nFor categorical variables, in `RfPred.Cate` method, the probability machine\ntheory is used, and the predictions of missing categories are based on the\npredicted probabilities for each missing observation. Users can set \n`method = \"rfpred.cate\"` in function call to `mice` to use it.\n\n### Example for prediction-based imputation\n```r\n# Prepare data\ndf \u003c- conv.factor(nhanes, c(\"age\", \"hyp\"))\n# Do imputation\nimp \u003c- imp.rfemp(df)\n# Do analyses\nregObj \u003c- with(imp, lm(chl ~ bmi + hyp))\n# Pool analyzed results\npoolObj \u003c- pool(regObj)\n# Extract estimates\nres \u003c- reg.ests(poolObj)\n```\n\n## Node-based imputation\nFor continuous or categorical variables, the observations under the predicting\nnodes of random forest are used as candidates for imputation.  \nTwo methods are now available for the `RfNode` algorithm series.  \nIt should be noted that categorical variables should be of types of `logical` or\n`factor`, etc.\n\n### Node-based imputation using predicting nodes\nUsers can call function `imp.rfnode.cond()` to use `RfNode-Cond` method,\nperforming imputation using the conditional distribution formed by the\nprediction nodes.  \nThe weight changes of observations caused by the bootstrapping of random\nforest are considered, and only the \"in-bag\" observations are used as candidates\nfor imputation.  \nAlso, users can set `method = \"rfnode.cond\"` in function call to `mice` to use\nit.\n\n### Node-based imputation using proximities\nUsers can call function `imp.rfnode.prox()` to use `RfNode-Prox` method, \nperforming imputation using the proximity matrices of random forests.  \nAll the observations fall under the same predicting nodes are used as candidates\nfor imputation, including the out-of-bag ones.  \nAlso, users can set `method = \"rfnode.prox\"` in function call to `mice`\nto use it.\n\n### Example for node-based imputation\n```r\n# Prepare data\ndf \u003c- conv.factor(nhanes, c(\"age\", \"hyp\"))\n# Do imputation\nimp \u003c- imp.rfnode.cond(df)\n# Or: imp \u003c- imp.rfnode.prox(df)\n# Do analyses\nregObj \u003c- with(imp, lm(chl ~ bmi + hyp))\n# Pool analyzed results\npoolObj \u003c- pool(regObj)\n# Extract estimates\nres \u003c- reg.ests(poolObj)\n```\n\n\n## Imputation functions\n| Type                        | Impute function | Univariate sampler        | Variable type |\n|-----------------------------|-----------------|---------------------------|---------------|\n| Prediction-based imputation | imp.emp()       | mice.impute.rfemp()       | Mixed         |\n|                             | /               | mice.impute.rfpred.emp()  | Continuous    |\n|                             | /               | mice.impute.rfpred.norm() | Continuous    |\n|                             | /               | mice.impute.rfpred.cate() | Categorical   |\n| Node-based imputation       | imp.node.cond() | mice.impute.rfnode.cond() | Mixed         |\n|                             | imp.node.prox() | mice.impute.rfnode.prox() | Mixed         |\n|                             | /               | mice.impute.rfnode()      | Mixed         |\n\n\n## Package structure\nThe figure below shows how the imputation functions are organized in this R\npackage.  \n\u003cimg src=\"man/figures/package-structure.png\" alt=\"Package structure\" width = \"80%\"/\u003e\n\n\n## Support for parallel computation\nAs random forest can be compute-intensive itself, and during multiple imputation\nprocess, random forest models will be built for the variables containing missing\ndata for a certain number of iterations (usually 5 to 10 times) repeatedly\n(usually 5 to 20 times, for the number of imputations performed).\nThus, computational efficiency is of crucial importance for multiple imputation\nusing chained random forests, especially for large data sets.  \nSo in `RfEmpImp`, the random forest model building process is accelerated using\nparallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger).\nThe ranger R package provides support for parallel computation using native C++.\nIn our simulations, parallel computation can provide impressive performance boost\nfor imputation process (about 4x faster on a quad-core laptop).\n\n\n## References\n1. Hong, Shangzhi, et al. \"Multiple imputation using chained random forests.\"\nPreprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.\n2. Zhang, Haozhe, et al. \"Random forest prediction intervals.\"\nThe American Statistician (2019): 1-15.\n3. Wright, Marvin N., and Andreas Ziegler. \"ranger: A Fast Implementation of\nRandom Forests for High Dimensional Data in C++ and R.\" Journal of Statistical\nSoftware 77.i01 (2017).\n4. Shah, Anoop D., et al. \"Comparison of random forest and parametric imputation\nmodels for imputing missing data using MICE: a CALIBER study.\" American Journal\nof Epidemiology 179.6 (2014): 764-774.\n5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. \"Recursive partitioning\nfor missing data imputation in the presence of interaction effects.\"\nComputational Statistics \u0026 Data Analysis 72 (2014): 92-104.\n6. Malley, James D., et al. \"Probability machines.\" Methods of information in\nmedicine 51.01 (2012): 74-81.\n7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. \"mice: Multivariate Imputation\nby Chained Equations in R.\" Journal of Statistical Software 45.i03 (2011).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshangzhi-hong%2Frfempimp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshangzhi-hong%2Frfempimp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshangzhi-hong%2Frfempimp/lists"}