{"id":30278297,"url":"https://github.com/farrellday/miceranger","last_synced_at":"2026-03-07T21:32:23.645Z","repository":{"id":45465970,"uuid":"232341116","full_name":"FarrellDay/miceRanger","owner":"FarrellDay","description":"miceRanger: Fast Imputation with Random Forests in R","archived":false,"fork":false,"pushed_at":"2022-08-24T14:13:39.000Z","size":2135,"stargazers_count":69,"open_issues_count":9,"forks_count":13,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-07-07T22:14:00.694Z","etag":null,"topics":["imputation-methods","machine-learning","mice","missing-data","missing-values","r","random-forests"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FarrellDay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-07T14:23:55.000Z","updated_at":"2025-05-19T09:57:35.000Z","dependencies_parsed_at":"2022-07-14T15:00:29.922Z","dependency_job_id":null,"html_url":"https://github.com/FarrellDay/miceRanger","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/FarrellDay/miceRanger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FarrellDay%2FmiceRanger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FarrellDay%2FmiceRanger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FarrellDay%2FmiceRanger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FarrellDay%2FmiceRanger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FarrellDay","download_url":"https://codeload.github.com/FarrellDay/miceRanger/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FarrellDay%2FmiceRanger/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270709088,"owners_count":24631992,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imputation-methods","machine-learning","mice","missing-data","missing-values","r","random-forests"],"created_at":"2025-08-16T12:35:48.215Z","updated_at":"2026-03-07T21:32:23.583Z","avatar_url":"https://github.com/FarrellDay.png","language":"R","readme":"\n[![CRAN\\_Status\\_Badge](http://www.r-pkg.org/badges/version/miceRanger)](http://cran.r-project.org/package=miceRanger)\n[![DEV\\_Version\\_Badge](https://img.shields.io/badge/Dev-1.5.0-blue.svg)](http://cran.r-project.org/package=miceRanger)\n[![MIT\nlicense](http://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT)\n[![Build\nStatus](https://travis-ci.com/FarrellDay/miceRanger.svg?branch=master)](https://travis-ci.com/FarrellDay/miceRanger)\n[![CRAN\\_Downloads](https://cranlogs.r-pkg.org/badges/miceRanger)](https://CRAN.R-project.org/package=miceRanger)\n[![Coverage\nStatus](https://codecov.io/gh/FarrellDay/miceRanger/branch/master/graph/badge.svg)](https://codecov.io/gh/FarrellDay/miceRanger/branch/master)\n\n## miceRanger: Fast Imputation with Random Forests\n\n\u003ca href='https://github.com/FarrellDay/miceRanger'\u003e\u003cimg src='icon.png' align=\"right\" height=\"300\" /\u003e\u003c/a\u003e\n\nFast, memory efficient Multiple Imputation by Chained Equations (MICE)\nwith random forests. It can impute categorical and numeric data without\nmuch setup, and has an array of diagnostic plots available.\n\nThis document contains a thorough walkthrough of the package,\nbenchmarks, and an introduction to multiple imputation. More information\non MICE can be found in Stef van Buuren’s excellent online book, which\nyou can find\n[here](https://stefvanbuuren.name/fimd/ch-introduction.html).\n\n#### Table of Contents:\n\n  - [Using\n    miceRanger](https://github.com/FarrellDay/miceRanger#Using-miceRanger)\n      - [Simple\n        Example](https://github.com/FarrellDay/miceRanger#Simple-Example)\n      - [Running in\n        Parallel](https://github.com/FarrellDay/miceRanger#Running-in-Parallel)\n      - [Adding\n        Iterations/Datasets](https://github.com/FarrellDay/miceRanger#adding-more-iterationsdatasets)\n      - [Custom Imputation\n        Schemas](https://github.com/FarrellDay/miceRanger#Creating-a-Custom-Imputation-Schema)\n      - [Imputing New Data with Existing\n        Models](https://github.com/FarrellDay/miceRanger#Imputing-New-Data-with-Existing-Models)\n  - [Diagnostic\n    Plotting](https://github.com/FarrellDay/miceRanger#Diagnostic-Plotting)\n      - [Imputed\n        Distributions](https://github.com/FarrellDay/miceRanger#Distribution-of-Imputed-Values)\n      - [Correlation\n        Convergence](https://github.com/FarrellDay/miceRanger#Convergence-of-Correlation)\n      - [Center/Dispersion\n        Convergence](https://github.com/FarrellDay/miceRanger#Center-and-Dispersion-Convergence)\n      - [Model OOB\n        Error](https://github.com/FarrellDay/miceRanger#Model-OOB-Error)\n      - [Variable\n        Importance](https://github.com/FarrellDay/miceRanger#Variable-Importance)\n      - [Inter-Dataset\n        Variance](https://github.com/FarrellDay/miceRanger#Imputed-Variance-Between-Datasets)\n  - [Using the Imputed\n    Data](https://github.com/FarrellDay/miceRanger#Using-the-Imputed-Data)  \n  - [Benchmarks](https://github.com/FarrellDay/miceRanger#Benchmarks)  \n  - [The MICE\n    Algorithm](https://github.com/FarrellDay/miceRanger#The-MICE-Algorithm)\n      - [Introduction](https://github.com/FarrellDay/miceRanger#The-MICE-Algorithm)\n      - [Common Use\n        Cases](https://github.com/FarrellDay/miceRanger#Common-Use-Cases)\n      - [Predictive Mean\n        Matching](https://github.com/FarrellDay/miceRanger#Predictive-Mean-Matching)\n\n## Installation\n\nYou can download the latest stable version from CRAN:\n\n``` r\ninstall.packages(\"miceRanger\")\n```\n\nYou can also download the latest development version from this\nrepository:\n\n``` r\nlibrary(devtools)\ndevtools::install_github(\"FarrellDay/miceRanger\")\n```\n\nFor more information about updates, please see\n[NEWS.md](https://github.com/FarrellDay/miceRanger/blob/master/NEWS.md).\n\n## Using miceRanger\n\nIn these examples we will be looking at a simple example of multiple\nimputation. We need to load the packages, and define the data:\n\n``` r\nrequire(miceRanger)\nset.seed(1)\n\n# Load data\ndata(iris)\n\n# Ampute the data. iris contains no missing values by default.\nampIris \u003c- amputeData(iris,perc=0.25)\nhead(ampIris,10)\n```\n\n    ##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n    ##  1:          5.1         3.5           NA         0.2    \u003cNA\u003e\n    ##  2:          4.9         3.0          1.4         0.2  setosa\n    ##  3:          4.7         3.2          1.3         0.2  setosa\n    ##  4:          4.6         3.1          1.5         0.2  setosa\n    ##  5:          5.0         3.6          1.4         0.2  setosa\n    ##  6:          5.4         3.9          1.7         0.4  setosa\n    ##  7:           NA         3.4          1.4         0.3  setosa\n    ##  8:          5.0         3.4          1.5         0.2  setosa\n    ##  9:          4.4         2.9          1.4         0.2    \u003cNA\u003e\n    ## 10:          4.9         3.1          1.5         0.1  setosa\n\n### Simple example\n\n``` r\n# Perform mice, return 6 datasets. \nseqTime \u003c- system.time(\n  miceObj \u003c- miceRanger(\n      ampIris\n    , m=6\n    , returnModels = TRUE\n    , verbose=FALSE\n  )\n)\nmiceObj\n```\n\n    ## Class:          miceDefs\n    ## Datasets:       6 \n    ## Iterations:     5 \n    ## Total Seconds:  3 \n    ## Imputed Cols:   5 \n    ## Estimated Time per Additional Iteration is 1 Seconds \n    ## Estimated Time per Additional Dataset is 0 Seconds \n    ## \n    ## For additional metrics, see the different plotting functions.\n\nPrinting the miceDefs object will tell you some high level information,\nincluding how long the process took, and how long it estimates adding\nmore datasets/iterations will take.\n\n### Running in Parallel\n\nRunning in parallel is usually not necessary. By default, `ranger` will\nuse all available cores, and `data.table`s assignment by reference is\nalready lightning fast. However, in certain cases, we can still save\nsome time by sending each dataset imputation to a different R back end.\nTo do this, we need to set up a core cluster and use `parallel = TRUE`.\n*This causes the dataset to be copied for each back end, which may eat\nup your RAM. If the process is memory constrained, this can cause the\nparallel implementation to actually take more time than the sequential\nimplementation.*\n\n``` r\nlibrary(doParallel)\n\n# Set up back ends.\ncl \u003c- makeCluster(2)\nregisterDoParallel(cl)\n\n# Perform mice \nparTime \u003c- system.time(\n  miceObjPar \u003c- miceRanger(\n      ampIris\n    , m=6\n    , parallel = TRUE\n    , verbose = FALSE\n  )\n)\nstopCluster(cl)\nregisterDoSEQ()\n```\n\nLet’s take a look at the time we saved running in parallel:\n\n``` r\nperc \u003c- round(1-parTime[[3]]/seqTime[[3]],2)*100\nprint(paste0(\"The parallel process ran \",perc,\"% faster using 2 R back ends.\"))\n```\n\n    ## [1] \"The parallel process ran 7% faster using 2 R back ends.\"\n\nWe did not save that much time by running in parallel. `ranger` already\nmakes full use of our CPU. Running in parallel will save you time if you\nare using a high `meanMatchCandidates`, or if you are working with large\ndata and use a low `num.trees`. See\n[`benchmarks/`](https://github.com/FarrellDay/miceRanger/tree/master/benchmarks)\nfor more details.\n\n### Adding More Iterations/Datasets\n\nIf you plot your data and notice that you need to may need to run more\niterations, or you would like more datasets for your analysis, you can\nuse the following functions:\n\n``` r\nmiceObj \u003c- addIterations(miceObj,iters=2,verbose=FALSE)\nmiceObj \u003c- addDatasets(miceObj,datasets=1,verbose=FALSE)\n```\n\n### Creating a Custom Imputation Schema\n\nIt is possible to customize our imputation procedure by variable. By\npassing a named list to `vars`, you can specify the predictors for each\nvariable to impute. You can also select which variables should be\nimputed using mean matching, as well as the mean matching candidates, by\npassing a named vector to `valueSelector` and `meanMatchCandidates`,\nrespectively:\n\n``` r\nv \u003c- list(\n  Sepal.Width = c(\"Sepal.Length\",\"Petal.Width\",\"Species\")\n  , Sepal.Length = c(\"Sepal.Width\",\"Petal.Width\")\n  , Species = c(\"Sepal.Width\")\n)\nvs \u003c- c(\n    Sepal.Width = \"meanMatch\"\n  , Sepal.Length = \"value\"\n  , Species = \"meanMatch\"\n)\nmmc \u003c- c(\n    Sepal.Width = 4\n  , Species = 10\n)\n\nmiceObjCustom \u003c- miceRanger(\n    ampIris\n  , vars = v\n  , valueSelector = vs\n  , meanMatchCandidates = mmc\n  , verbose=FALSE\n)\n```\n\n### Imputing New Data with Existing Models\n\nMultiple Imputation can take a long time. If you wish to impute a\ndataset using the MICE algorithm, but don’t have time to train new\nmodels, it is possible to impute new datasets using a `miceDefs` object.\nThe `impute` function uses the random forests returned by `miceRanger`\nto perform multiple imputation without updating the random forest at\neach iteration:\n\n``` r\nnewDat \u003c- amputeData(iris)\nnewImputed \u003c- impute(newDat,miceObj,verbose=FALSE)\n```\n\nAll of the imputation parameters (valueSelector, vars, etc) will be\ncarried over from the original `miceDefs` object. When mean matching,\nthe candidate values are pulled from the original dataset. This method\nreturns results just as good as re-running the data through MICE in\nbenchmarking:\n\n\u003cimg src=\"benchmarks/graphics/accuracyImputeVsMICE.png\" width=\"800px\" /\u003e\n\nIn the chart above, a dataset with 15 variables (a-j numeric, k-p\ncategorical) and 51200 rows was imputed using `miceRanger`. A different\ndataset with the same dimensions, but different data, was then imputed\nusing the models created with `miceRanger`. See the\n[`benchmarks/`](https://github.com/FarrellDay/miceRanger/tree/master/benchmarks)\nfolder for scripts and more information on this chart. See `?impute` for\nmore details on the function.\n\n## Diagnostic Plotting\n\n`miceRanger` comes with an array of diagnostic plots that tell you how\nvalid the imputations may be, how they are distributed, which variables\nwere used to impute other variables, and so on.\n\n### Distribution of Imputed Values\n\nWe can take a look at the imputed distributions compared to the original\ndistribution for each variable:\n\n``` r\nplotDistributions(miceObj,vars='allNumeric')\n```\n\n![](README_files/figure-gfm/plotDistributions-1.png)\u003c!-- --\u003e\n\nThe red line is the density of the original, nonmissing data. The\nsmaller, black lines are the density of the imputed values in each of\nthe datasets. If these don’t match up, it’s not a problem, however it\nmay tell you that your data was not Missing Completely at Random (MCAR).\n\n### Convergence of Correlation\n\nWe are probably interested in knowing how our values between datasets\nconverged over the iterations. The `plotCorrelations` function shows you\na boxplot of the correlations between imputed values in every\ncombination of datasets, at each iteration:\n\n``` r\nplotCorrelations(miceObj,vars='allNumeric')\n```\n\n![](README_files/figure-gfm/plotCorrelations-1.png)\u003c!-- --\u003e\n\nDifferent correlation measures can be plotted by specifying\n`factCorrMetric` and `numbCorrMetric`. See `?plotCorrelations` for more\ndetails.\n\n### Center and Dispersion Convergence\n\nSometimes, if the missing data locations are correlated with higher or\nlower values, we need to run multiple iterations for the process to\nconverge to the true theoretical mean (given the information that exists\nin the dataset). We can see if the imputed data converged, or if we need\nto run more iterations:\n\n``` r\nplotVarConvergence(miceObj,vars='allNumeric')\n```\n\n![](README_files/figure-gfm/plotVarConvergence-1.png)\u003c!-- --\u003e\n\nIt doesn’t look like this dataset had a convergence issue. We wouldn’t\nexpect one, since we amputed the data above completely at random for\neach variable. When plotting categorical variables, the center and\ndispersion metrics plotted are the percent of the mode and the entropy,\nrespectively.\n\n### Model OOB Error\n\nRandom Forests give us a cheap way to determine model error without\ncross validation. Each model returns the OOB accuracy for\nclassification, and r-squared for regression. We can see how these\nconverged as the iterations progress:\n\n``` r\nplotModelError(miceObj,vars='allNumeric')\n```\n\n![](README_files/figure-gfm/plotModelError-1.png)\u003c!-- --\u003e\n\nIt looks like the variables were imputed with a reasonable degree of\naccuracy. That spike after the first iteration was due to the nature of\nhow the missing values are filled in before the models are run.\n\n### Variable Importance\n\nNow let’s plot the variable importance for each imputed variable. The\ntop axis contains the variable that was used to impute the variable on\nthe left axis.\n\n``` r\nplotVarImportance(miceObj)\n```\n\n![](README_files/figure-gfm/plotVarImportance-1.png)\u003c!-- --\u003e\n\nThe variable importance metric used is returned by ranger when\n`importance = 'impurity'`. Due to large possible variances in the\nreturned value, the data plotted here has been 0-1 scaled within each\nimputed variable. Use `display = 'Absolute'` to show unscaled variable\nimportance.\n\n### Imputed Variance Between Datasets\n\nWe are probably interested in how “certain” we were of our imputations.\nWe can get a feel for the variance experienced for each imputed value\nbetween the datasets by using `plotImputationVariance()` function:\n\n``` r\nplotImputationVariance(miceObj,ncol=2,widths=c(5,3))\n```\n\n![](README_files/figure-gfm/plotImputationVariance-1.png)\u003c!-- --\u003e\n\nWhen plotting categorical data, the distribution of the number of unique\nimputed levels is compared to the theoretical distribution of unique\nlevels, given they were drawn randomly. You can see that most of the\nimputed values only had 1 imputed value across our 8 datasets, which\nmeans that the imputation process was fairly ‘certain’ of that imputed\nclass. According to the graph, most of our samples would have had 3\ndifferent samples drawn, if they were drawn randomly for each dataset\nsample.  \nWhen plotting the variance of numeric features, the standard deviation\nof the imputed values is calculated for each sample. This is then\ncompared to the total population standard deviation. Percentage of the\nsamples with a SD below the population SD is shaded in the densities\nabove, and the Quantile is shown in the title. The `iris` dataset tends\nto be full of correlation, so all of our imputations had a SD lower than\nthe population SD, however this will not always be the case.\n\n## Using the Imputed Data\n\nTo return the imputed data simply use the `completeData` function:\n\n``` r\ndataList \u003c- completeData(miceObj)\nhead(dataList[[1]],10)\n```\n\n    ##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n    ##  1:          5.1         3.5          1.3         0.2  setosa\n    ##  2:          4.9         3.0          1.4         0.2  setosa\n    ##  3:          4.7         3.2          1.3         0.2  setosa\n    ##  4:          4.6         3.1          1.5         0.2  setosa\n    ##  5:          5.0         3.6          1.4         0.2  setosa\n    ##  6:          5.4         3.9          1.7         0.4  setosa\n    ##  7:          5.1         3.4          1.4         0.3  setosa\n    ##  8:          5.0         3.4          1.5         0.2  setosa\n    ##  9:          4.4         2.9          1.4         0.2  setosa\n    ## 10:          4.9         3.1          1.5         0.1  setosa\n\nWe can see how the imputed data compares to the original data before it\nwas amputed:\n\n![](README_files/figure-gfm/impAccuracy-1.png)\u003c!-- --\u003e\n\nIt looks like most of our variables were imputed with a high degree of\naccuracy. Sepal.Width had a relatively poor Spearman correlation,\nhowever we expected this when we saw the results from `plotModelError()`\nabove.\n\n## Benchmarks\n\nScripts and more details on benchmarks can be found in\n[`benchmarks/`](https://github.com/FarrellDay/miceRanger/tree/master/benchmarks).\n\nUsing artificial data, the time and performance of miceRanger, mice\n(`method = \"rf\"`) and missForest were recorded. parlmice was used to run\n`mice` in parallel, and a parallel back end was set up for missForest.\nAll runs used 5 cores. miceRangerPar refers to miceRanger being run with\n`parallel = TRUE`.\n\n### Timing - Small and Medium Data\n\n\u003cimg src=\"benchmarks/graphics/timeBenchmarks.png\" width=\"650px\" /\u003e\n\n## The MICE Algorithm\n\nMultiple Imputation by Chained Equations ‘fills in’ (imputes) missing\ndata in a dataset through an iterative series of predictive models. In\neach iteration, each specified variable in the dataset is imputed using\nthe other variables in the dataset. These iterations should be run until\nit appears that convergence has been met.\n\n\u003cimg src=\"vignettes/MICEalgorithm.png\" style=\"display: block; margin: auto;\" /\u003e\n\nThis process is continued until all specified variables have been\nimputed. Additional iterations can be run if it appears that the average\nimputed values have not converged, although no more than 5 iterations\nare usually necessary.\n\n### Common Use Cases\n\n##### **Data Leakage:**\n\nMICE is particularly useful if missing values are associated with the\ntarget variable in a way that introduces leakage. For instance, let’s\nsay you wanted to model customer retention at the time of sign up. A\ncertain variable is collected at sign up or 1 month after sign up. The\nabsence of that variable is a data leak, since it tells you that the\ncustomer did not retain for 1 month.\n\n##### **Funnel Analysis:**\n\nInformation is often collected at different stages of a ‘funnel’. MICE\ncan be used to make educated guesses about the characteristics of\nentities at different points in a funnel.\n\n##### **Confidence Intervals:**\n\nMICE can be used to impute missing values, however it is important to\nkeep in mind that these imputed values are a prediction. Creating\nmultiple datasets with different imputed values allows you to do two\ntypes of inference:\n\n  - Imputed Value Distribution: A profile can be built for each imputed\n    value, allowing you to make statements about the likely distribution\n    of that value.  \n  - Model Prediction Distribution: With multiple datasets, you can build\n    multiple models and create a distribution of predictions for each\n    sample. Those samples with imputed values which were not able to be\n    imputed with much confidence would have a larger variance in their\n    predictions.\n\n### Predictive Mean Matching\n\n`miceRanger` can make use of a procedure called predictive mean matching\n(PMM) to select which values are imputed. PMM involves selecting a\ndatapoint from the original, nonmissing data which has a predicted value\nclose to the predicted value of the missing sample. The closest N\n(`meanMatchCandidates` parameter in `miceRanger()`) values are chosen as\ncandidates, from which a value is chosen at random. This can be\nspecified on a column-by-column basis in `miceRanger`. Going into more\ndetail from our example above, we see how this works in practice:\n\n\u003cimg src=\"vignettes/PMM.png\" style=\"display: block; margin: auto;\" /\u003e\n\nThis method is very useful if you have a variable which needs imputing\nwhich has any of the following characteristics:\n\n  - Multimodal  \n  - Integer  \n  - Skewed\n\n### Effects of Mean Matching\n\nAs an example, let’s construct a dataset with some of the above\ncharacteristics:\n\n``` r\n# random uniform variable\nnrws \u003c- 1000\ndat \u003c- data.table(Uniform_Variable = runif(nrws))\n\n# slightly bimodal variable correlated with Uniform_Variable\ndat$Close_Bimodal_Variable \u003c- sapply(\n    dat$Uniform_Variable\n  , function(x) sample(c(rnorm(1,-2),rnorm(1,2)),prob=c(x,1-x),size=1)\n) + dat$Uniform_Variable\n\n# very bimodal variable correlated with Uniform_Variable\ndat$Far_Bimodal_Variable \u003c- sapply(\n    dat$Uniform_Variable\n  , function(x) sample(c(rnorm(1,-3),rnorm(1,3)),prob=c(x,1-x),size=1)\n)\n\n# Highly skewed variable correlated with Uniform_Variable\ndat$Skewed_Variable \u003c- exp((dat$Uniform_Variable*runif(nrws)*3)) + runif(nrws)*3\n\n# Integer variable correlated with Close_Bimodal_Variable and Uniform_Variable\ndat$Integer_Variable \u003c- round(dat$Uniform_Variable + dat$Close_Bimodal_Variable/3 + runif(nrws)*2)\n\n# Ampute the data.\nampDat \u003c- amputeData(dat,0.2)\n\n# Plot the original data\nplot(dat)\n```\n\n![](README_files/figure-gfm/skewedData-1.png)\u003c!-- --\u003e\n\nWe can see how our variables are distributed and correlated in the graph\nabove. Now let’s run our imputation process twice, once using mean\nmatching, and once using the model prediction.\n\n``` r\nmrMeanMatch \u003c- miceRanger(ampDat,valueSelector = \"meanMatch\",verbose=FALSE)\nmrModelOutput \u003c- miceRanger(ampDat,valueSelector = \"value\",verbose=FALSE)\n```\n\nLet’s look at the effect on the different variables.\n\n#### Bimodial Variable\n\n\u003cimg src=\"vignettes/mmEffectsFarBimodal.png\" width=\"800px\" style=\"display: block; margin: auto;\" /\u003e\n\nThe affect of mean matching on our imputations is immediately apparent.\nIf we were only looking at model error, we may be inclined to use the\nPrediction Value, since it has a higher OOB R-Squared. However, we are\nleft with imputations that do not match our original distribution, and\ntherefore, do not behave like our original data.\n\n#### Skewed Variable\n\n\u003cimg src=\"vignettes/mmEffectsSkewed.png\" width=\"800px\" style=\"display: block; margin: auto;\" /\u003e\n\nWe see a similar occurance in the skewed variable - the distribution of\nthe values imputed with the Prediction Value are shifted towards the\nmean.\n\n#### Integer Variable\n\n\u003cimg src=\"vignettes/mmEffectsInteger.png\" width=\"800px\" style=\"display: block; margin: auto;\" /\u003e\n\nThe most obvious variable affected by mean matching was our integer\nvariable - using `valueSelector = 'value'` allows interpolation in the\nnumeric variables. Using mean matching has allowed us to keep the\ndistribution and distinct values of the original data, without\nsacrificing accuracy.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarrellday%2Fmiceranger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffarrellday%2Fmiceranger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarrellday%2Fmiceranger/lists"}