{"id":13665868,"url":"https://github.com/ck37/varimpact","last_synced_at":"2026-05-18T10:17:06.643Z","repository":{"id":37580332,"uuid":"59684252","full_name":"ck37/varimpact","owner":"ck37","description":"Variable importance through targeted causal inference, with Alan Hubbard","archived":false,"fork":false,"pushed_at":"2025-06-05T15:14:41.000Z","size":669,"stargazers_count":57,"open_issues_count":12,"forks_count":12,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-06-05T15:37:53.651Z","etag":null,"topics":["causal-inference","cv-tmle","observational-study","targeted-learning","tmle","variable-importance"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ck37.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-05-25T17:37:25.000Z","updated_at":"2024-04-30T12:40:53.000Z","dependencies_parsed_at":"2022-08-29T09:42:14.550Z","dependency_job_id":"f0c6daab-9c0f-4fc7-a00e-bbe6f25d8808","html_url":"https://github.com/ck37/varimpact","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ck37/varimpact","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ck37%2Fvarimpact","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ck37%2Fvarimpact/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ck37%2Fvarimpact/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ck37%2Fvarimpact/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ck37","download_url":"https://codeload.github.com/ck37/varimpact/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ck37%2Fvarimpact/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33174144,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"ssl_error","status_checked_at":"2026-05-18T09:27:28.300Z","response_time":71,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["causal-inference","cv-tmle","observational-study","targeted-learning","tmle","variable-importance"],"created_at":"2024-08-02T06:00:52.558Z","updated_at":"2026-05-18T10:17:01.635Z","avatar_url":"https://github.com/ck37.png","language":"R","funding_links":[],"categories":["Causal Effect Estimation","R"],"sub_categories":["With i.i.d Data"],"readme":"\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n# varimpact - variable importance through causal inference\n\n[![Build\nStatus](https://travis-ci.org/ck37/varimpact.svg?branch=master)](https://travis-ci.org/ck37/varimpact)\n[![AppVeyor Build\nStatus](https://ci.appveyor.com/api/projects/status/github/ck37/varimpact?branch=master\u0026svg=true)](https://ci.appveyor.com/project/ck37/varimpact)\n[![codecov](https://codecov.io/gh/ck37/varimpact/branch/master/graph/badge.svg)](https://codecov.io/gh/ck37/varimpact)\n\n## Summary\n\nvarimpact uses causal inference statistics to generate variable\nimportance estimates for a given dataset and outcome. It answers the\nquestion: which of my Xs are most related to my Y? Each variable’s\ninfluence on the outcome is estimated semiparametrically, without\nassuming a linear relationship or other functional form, and the\ncovariate list is ranked by order of importance. This can be used for\nexploratory data analysis, for dimensionality reduction, for\nexperimental design (e.g. to determine blocking and re-randomization),\nto reduce variance in an estimation procedure, etc. See Hubbard,\nKennedy, \u0026 van der Laan (2018) for more details, or Hubbard \u0026 van der\nLaan (2016) for an earlier description.\n\n## Details\n\nEach covariate is analyzed using targeted minimum loss-based estimation\n([TMLE](https://CRAN.R-project.org/package=tmle)) as though it were a\ntreatment, with all other variables serving as adjustment variables via\n[SuperLearner](https://github.com/ecpolley/SuperLearner). Then the\nstatistical significance of the estimated treatment effect for each\ncovariate determines the variable importance ranking. This formulation\nallows the asymptotics of TMLE to provide valid standard errors and\np-values, unlike other variable importance algorithms.\n\nThe results provide raw p-values as well as p-values adjusted for false\ndiscovery rate using the Benjamini-Hochberg (1995) procedure. Adjustment\nvariables are automatically clustered hierarchically using HOPACH (van\nder Laan \u0026 Pollard 2003) in order to reduce dimensionality. The package\nsupports multi-core and multi-node parallelization, which are detected\nand used automatically when a parallel backend is registered. Missing\nvalues are automatically imputed using K-nearest neighbors (Troyanskaya\net al. 2001, Jerez et al. 2010) and missingness indicator variables are\nincorporated into the analysis.\n\nvarimpact is under active development so please submit any bug reports\nor feature requests to the [issue\nqueue](https://github.com/ck37/varimpact/issues), or email Alan and/or\nChris directly.\n\n## Installation\n\n### GitHub\n\n``` r\n# Install remotes if necessary:\n# install.packages(\"remotes\")\nremotes::install_github(\"ck37/varimpact\")\n```\n\n### CRAN\n\nForthcoming fall 2022\n\n## Examples\n\n### Example: basic functionality\n\n``` r\nlibrary(varimpact)\n#\u003e Loading required package: SuperLearner\n#\u003e Loading required package: nnls\n#\u003e Super Learner\n#\u003e Version: 2.0-27-9000\n#\u003e Package created on 2021-03-28\n\n####################################\n# Create test dataset.\nset.seed(1, \"L'Ecuyer-CMRG\")\nN \u003c- 300\nnum_normal \u003c- 5\nX \u003c- as.data.frame(matrix(rnorm(N * num_normal), N, num_normal))\nY \u003c- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))\n# Add some missing data to X so we can test imputation.\nfor (i in 1:10) X[sample(nrow(X), 1), sample(ncol(X), 1)] \u003c- NA\n\n####################################\n# Basic example\nvim \u003c- varimpact(Y = Y, data = X)\n#\u003e Finished pre-processing variables.\n#\u003e \n#\u003e Processing results:\n#\u003e - Factor variables: 0 \n#\u003e - Numeric variables: 5 \n#\u003e \n#\u003e No factor variables - skip VIM estimation.\n#\u003e \n#\u003e Estimating variable importance for 5 numerics.\n\n# Review consistent and significant results.\nvim\n#\u003e No significant and consistent results.\n#\u003e All results:\n#\u003e       Type    Estimate              CI95    P-value Adj. p-value   Est. RR\n#\u003e V4 Ordered  0.17058432 (-0.0518 - 0.393) 0.06639069    0.3319535 1.3174241\n#\u003e V1 Ordered  0.03831094  (-0.158 - 0.234) 0.35081119    0.8770280 1.0724707\n#\u003e V3 Ordered -0.05171291  (-0.339 - 0.235) 0.63807247    0.8835731 0.9673808\n#\u003e V2 Ordered -0.06678388  (-0.307 - 0.174) 0.70685848    0.8835731 0.9305320\n#\u003e V5 Ordered -0.12419619 (-0.304 - 0.0561) 0.91152962    0.9115296 0.8468485\n#\u003e           CI95 RR P-value RR Adj. p-value RR Consistent\n#\u003e V4 (0.953 - 1.82)  0.0474590       0.2372950       TRUE\n#\u003e V1 (0.446 - 2.58)  0.4379413       0.8101835       TRUE\n#\u003e V3 (0.654 - 1.43)  0.5658283       0.8101835      FALSE\n#\u003e V2 (0.642 - 1.35)  0.6481468       0.8101835      FALSE\n#\u003e V5 (0.634 - 1.13)  0.8696742       0.8696742      FALSE\n\n# Look at all results.\nvim$results_all\n#\u003e       Type    Estimate              CI95    P-value Adj. p-value   Est. RR\n#\u003e V4 Ordered  0.17058432 (-0.0518 - 0.393) 0.06639069    0.3319535 1.3174241\n#\u003e V1 Ordered  0.03831094  (-0.158 - 0.234) 0.35081119    0.8770280 1.0724707\n#\u003e V3 Ordered -0.05171291  (-0.339 - 0.235) 0.63807247    0.8835731 0.9673808\n#\u003e V2 Ordered -0.06678388  (-0.307 - 0.174) 0.70685848    0.8835731 0.9305320\n#\u003e V5 Ordered -0.12419619 (-0.304 - 0.0561) 0.91152962    0.9115296 0.8468485\n#\u003e           CI95 RR P-value RR Adj. p-value RR Consistent\n#\u003e V4 (0.953 - 1.82)  0.0474590       0.2372950       TRUE\n#\u003e V1 (0.446 - 2.58)  0.4379413       0.8101835       TRUE\n#\u003e V3 (0.654 - 1.43)  0.5658283       0.8101835      FALSE\n#\u003e V2 (0.642 - 1.35)  0.6481468       0.8101835      FALSE\n#\u003e V5 (0.634 - 1.13)  0.8696742       0.8696742      FALSE\n\n# Plot the V2 impact.\nplot_var(\"V2\", vim)\n```\n\n![](images/README-example_1-1.png)\u003c!-- --\u003e\n\n``` r\n\n# Generate latex tables with results.\nexportLatex(vim)\n#\u003e NULL\n\n# Clean up - will get a warning if there were no consistent results.\nsuppressWarnings({\n  file.remove(c(\"varimpByFold.tex\", \"varImpAll.tex\", \"varimpConsistent.tex\"))\n})\n#\u003e [1]  TRUE  TRUE FALSE\n```\n\n### Example: customize outcome and propensity score estimation\n\n``` r\nQ_lib = c(\"SL.mean\", \"SL.glmnet\", \"SL.ranger\", \"SL.rpartPrune\")\ng_lib = c(\"SL.mean\", \"SL.glmnet\")\nset.seed(1, \"L'Ecuyer-CMRG\")\n(vim = varimpact(Y = Y, data = X, Q.library = Q_lib, g.library = g_lib))\n#\u003e Finished pre-processing variables.\n#\u003e \n#\u003e Processing results:\n#\u003e - Factor variables: 0 \n#\u003e - Numeric variables: 5 \n#\u003e \n#\u003e No factor variables - skip VIM estimation.\n#\u003e \n#\u003e Estimating variable importance for 5 numerics.\n#\u003e No significant and consistent results.\n#\u003e All results:\n#\u003e       Type    Estimate               CI95   P-value Adj. p-value   Est. RR\n#\u003e V4 Ordered -0.02595001    (-0.25 - 0.198) 0.5897958    0.9863267 0.9926791\n#\u003e V3 Ordered -0.12688688   (-0.391 - 0.137) 0.8267887    0.9863267 0.8304033\n#\u003e V2 Ordered -0.11547591   (-0.355 - 0.124) 0.8277832    0.9863267 0.8529795\n#\u003e V1 Ordered -0.17014276  (-0.397 - 0.0571) 0.9288760    0.9863267 0.6823365\n#\u003e V5 Ordered -0.19094845 (-0.361 - -0.0213) 0.9863267    0.9863267 0.6210945\n#\u003e           CI95 RR P-value RR Adj. p-value RR Consistent\n#\u003e V4 (0.719 - 1.37)  0.5177707       0.9838749      FALSE\n#\u003e V3  (0.55 - 1.25)  0.7944201       0.9838749      FALSE\n#\u003e V2 (0.584 - 1.25)  0.8117625       0.9838749      FALSE\n#\u003e V1  (0.44 - 1.06)  0.9560802       0.9838749       TRUE\n#\u003e V5 (0.402 - 0.96)  0.9838749       0.9838749      FALSE\n```\n\n### Example: parallel via multicore\n\n``` r\nlibrary(future)\nplan(\"multiprocess\")\nvim = varimpact(Y = Y, data = X)\n#\u003e Finished pre-processing variables.\n#\u003e \n#\u003e Processing results:\n#\u003e - Factor variables: 0 \n#\u003e - Numeric variables: 5 \n#\u003e \n#\u003e No factor variables - skip VIM estimation.\n#\u003e \n#\u003e Estimating variable importance for 5 numerics.\n```\n\n### Example: parallel via SNOW\n\n``` r\nlibrary(RhpcBLASctl)\n# Detect the number of physical cores on this computer using RhpcBLASctl.\ncl = parallel::makeCluster(get_num_cores())\nplan(\"cluster\", workers = cl)\nvim = varimpact(Y = Y, data = X)\n#\u003e Finished pre-processing variables.\n#\u003e \n#\u003e Processing results:\n#\u003e - Factor variables: 0 \n#\u003e - Numeric variables: 5 \n#\u003e \n#\u003e No factor variables - skip VIM estimation.\n#\u003e \n#\u003e Estimating variable importance for 5 numerics.\nparallel::stopCluster(cl)\n```\n\n### Example: mlbench breast cancer\n\n``` r\ndata(BreastCancer, package = \"mlbench\")\ndata = BreastCancer\n\n# Create a numeric outcome variable.\ndata$Y = as.integer(data$Class == \"malignant\")\n\n# Use multicore parallelization to speed up processing.\nplan(\"multiprocess\")\n(vim = varimpact(Y = data$Y, data = subset(data, select = -c(Y, Class, Id))))\n#\u003e Finished pre-processing variables.\n#\u003e \n#\u003e Processing results:\n#\u003e - Factor variables: 9 \n#\u003e - Numeric variables: 0 \n#\u003e \n#\u003e Estimating variable importance for 9 factors.\n#\u003e Significant and consistent results:\n#\u003e                Type  Estimate            CI95      P-value Adj. p-value\n#\u003e Bare.nuclei  Factor 0.6284939 (0.503 - 0.754) 0.000000e+00 0.000000e+00\n#\u003e Mitoses      Factor 0.4097166 (0.336 - 0.483) 0.000000e+00 0.000000e+00\n#\u003e Cl.thickness Factor 0.5344847 (0.378 - 0.691) 1.040124e-11 2.340278e-11\n#\u003e Cell.size    Factor 0.5577438 (0.386 - 0.729) 8.920165e-11 1.605630e-10\n#\u003e               Est. RR       CI95 RR   P-value RR Adj. p-value RR\n#\u003e Bare.nuclei  3.697125 (2.15 - 6.35) 0.000000e+00    0.000000e+00\n#\u003e Mitoses      2.095869 (1.85 - 2.37) 7.227108e-12    3.252199e-11\n#\u003e Cl.thickness 3.103819 (2.23 - 4.31) 4.128421e-07    9.288948e-07\n#\u003e Cell.size    3.326385 (1.93 - 5.73) 1.062140e-06    1.911853e-06\nplot_var(\"Mitoses\", vim)\n```\n\n![](images/README-example_5-1.png)\u003c!-- --\u003e\n\n## Authors\n\nAlan E. Hubbard and Chris J. Kennedy, University of California, Berkeley\n\n## References\n\nBenjamini, Y., \u0026 Hochberg, Y. (1995). Controlling the false discovery\nrate: a practical and powerful approach to multiple testing. Journal of\nthe royal statistical society. Series B (Methodological), 289-300.\n\nGruber, S., \u0026 van der Laan, M. J. (2012). tmle: An R Package for\nTargeted Maximum Likelihood Estimation. Journal of Statistical Software,\n51(i13).\n\nHubbard, A. E., Kennedy, C. J., van der Laan, M. J. (2018).\nData-adaptive target parameters. In M. van der Laan \u0026 S. Rose (2018)\nTargeted Learning in Data Science. Springer.\n\nHubbard, A. E., Kherad-Pajouh, S., \u0026 van der Laan, M. J. (2016).\nStatistical Inference for Data Adaptive Target Parameters. The\ninternational journal of biostatistics, 12(1), 3-19.\n\nHubbard, A., Munoz, I. D., Decker, A., Holcomb, J. B., Schreiber, M. A.,\nBulger, E. M., … \u0026 Rahbar, M. H. (2013). Time-Dependent Prediction and\nEvaluation of Variable Importance Using SuperLearning in High\nDimensional Clinical Data. The journal of trauma and acute care surgery,\n75(1 0 1), S53.\n\nHubbard, A. E., \u0026 van der Laan, M. J. (2016). Mining with inference:\ndata-adaptive target parameters (pp. 439-452). In P. Bühlmann et\nal. (Ed.), Handbook of Big Data. CRC Press, Taylor \u0026 Francis Group, LLC:\nBoca Raton, FL.\n\nJerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles,\nN., Martín, M., \u0026 Franco, L. (2010). Missing data imputation using\nstatistical and machine learning methods in a real breast cancer\nproblem. Artificial intelligence in medicine, 50(2), 105-115.\n\nRozenholc, Y., Mildenberger, T., \u0026 Gather, U. (2010). Combining regular\nand irregular histograms by penalized likelihood. Computational\nStatistics \u0026 Data Analysis, 54(12), 3313-3323.\n\nTroyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T.,\nTibshirani, R., Botstein, D., \u0026 Altman, R. B. (2001). Missing value\nestimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.\n\nvan der Laan, M. J. (2006). Statistical inference for variable\nimportance. The International Journal of Biostatistics, 2(1).\n\nvan der Laan, M. J., \u0026 Pollard, K. S. (2003). A new algorithm for hybrid\nhierarchical clustering with visualization and the bootstrap. Journal of\nStatistical Planning and Inference, 117(2), 275-303.\n\nvan der Laan, M. J., Polley, E. C., \u0026 Hubbard, A. E. (2007). Super\nlearner. Statistical applications in genetics and molecular biology,\n6(1).\n\nvan der Laan, M. J., \u0026 Rose, S. (2011). Targeted learning: causal\ninference for observational and experimental data. Springer Science \u0026\nBusiness Media.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fck37%2Fvarimpact","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fck37%2Fvarimpact","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fck37%2Fvarimpact/lists"}