{"id":27362044,"url":"https://github.com/tidymodels/important","last_synced_at":"2025-09-08T15:41:44.237Z","repository":{"id":257867890,"uuid":"873019677","full_name":"tidymodels/important","owner":"tidymodels","description":"Tools for Measuring Predictor Importance","archived":false,"fork":false,"pushed_at":"2025-02-22T16:21:19.000Z","size":387,"stargazers_count":12,"open_issues_count":5,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T02:14:41.907Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://tidymodels.github.io/important/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tidymodels.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-15T13:24:36.000Z","updated_at":"2025-02-22T16:17:03.000Z","dependencies_parsed_at":"2024-12-18T01:31:21.535Z","dependency_job_id":"3442ea17-240c-4816-90d5-15a8ab650746","html_url":"https://github.com/tidymodels/important","commit_stats":null,"previous_names":["topepo/important","tidymodels/important"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tidymodels%2Fimportant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tidymodels%2Fimportant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tidymodels%2Fimportant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tidymodels%2Fimportant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tidymodels","download_url":"https://codeload.github.com/tidymodels/important/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248654095,"owners_count":21140236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-13T02:15:43.108Z","updated_at":"2025-09-08T15:41:44.219Z","avatar_url":"https://github.com/tidymodels.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n# important\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/tidymodels/important/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/important/actions/workflows/R-CMD-check.yaml)\n[![Codecov test coverage](https://codecov.io/gh/tidymodels/important/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/important)\n\u003c!-- badges: end --\u003e\n\nThe important package has a succinct interface for obtaining estimates of predictor importance with tidymodels objects. A few of the main features: \n\n- Any performance metrics from the yardstick package can be used. \n- Importance can be calculated for either the original columns or at the level of any derived model terms created during feature engineering. \n- The computations that loop across permutation iterations and predictors columns are easily parallelized. \n- The results are returned in a tidy format. \n\nThere are also recipe steps for supervised feature selection: \n\n- `step_predictors_retain()` can filter the predictors using a single conditional statement (e.g., absolute correlation with the outcome \u003e 0.75, etc). \n- `step_predictors_best()` can retain the most important predictors for the outcome using a single scoring function. \n- `step_predictors_desirability()` retains the most important predictors for the outcome using multiple scoring functions, blended using desirability functions.  \n\nThe latter two steps can be tuned over the proportion of predictors to be retained.  \n\n## Installation\n\nYou can install the development version of important from [GitHub](https://github.com/) with:\n\n``` r\ninstall.packages(\"devtools\")\n# or\npak::pak(\"tidymodels/important\")\n```\n\n## Do we really need another package that computes variable importances?\n\nThe main reason for making important is censored regression models. tidymodels released tools for fitting and qualifying models that have censored outcomes. This included some dynamic performance metrics that were evaluated at different time points. This was a substantial change for us, and it would have been even more challenging to add to other packages. \n\n## Variable importance Example\n\nLet's look at an analysis that models [food delivery times](https://aml4td.org/chapters/whole-game.html#sec-delivery-times). The outcome is the time between an order being placed and the delivery (all data are complete - there is no censoring). We model this in terms of the order day/time, the distance to the restaurant, and which items are contained in the order. Exploratory data analysis shows several nonlinear trends in the data and some interactions between these trends.  \n\nWe'll load the tidymodels and important packages to get started. \n\n```{r}\n#| label: startup-sshh\n#| include: false\nlibrary(tidymodels)\nlibrary(important)\ntheme_set(theme_bw())\n```\n```{r}\n#| label: startup\n#| include: false\nlibrary(tidymodels)\nlibrary(important)\n```\n\nThe data are split into training, validation, and testing sets. \n\n```{r}\n#| label: food-data\ndata(deliveries, package = \"modeldata\")\n\nset.seed(991)\ndelivery_split \u003c- initial_validation_split(deliveries, prop = c(0.6, 0.2), strata = time_to_delivery)\ndelivery_train \u003c- training(delivery_split)\n```\n\nThe model uses a recipe with spline terms for the hour and distances. The nonlinear trend over the time of order changes on the day, so we added interactions between these two sets of terms. Finally, a simple linear regression model is used for estimation:  \n\n```{r}\n#| label: model\ndelivery_rec \u003c- \n  recipe(time_to_delivery ~ ., data = delivery_train) |\u003e \n  step_dummy(all_factor_predictors()) |\u003e \n  step_zv(all_predictors()) |\u003e \n  step_spline_natural(hour, distance, deg_free = 10) |\u003e \n  step_interact(~ starts_with(\"hour_\"):starts_with(\"day_\"))\n\nlm_wflow \u003c- workflow(delivery_rec, linear_reg())\nlm_fit \u003c- fit(lm_wflow, delivery_train)\n```\n\nFirst, let’s capture the effect of the individual model terms. These terms are from the derived features in the models, such as dummy variables, spline terms, interaction columns, etc. \n\n```{r}\n#| label: derived-importance\nset.seed(382)\nlm_deriv_imp \u003c- \n  importance_perm(\n    lm_fit,\n    data = delivery_train,\n    metrics = metric_set(mae, rsq),\n    times = 50,\n    type = \"derived\"\n  )\nlm_deriv_imp\n```\nUsing mean absolute error as the metric of interest, the top 5 features are: \n\n```{r}\nlm_deriv_imp |\u003e \n\tfilter(.metric == \"mae\") |\u003e \n\tslice_max(importance, n = 5)\n```\n\nTwo notes: \n\n- The importance scores are the ratio of the mean change in performance and the associated standard error. The mean value is always increasing with importance, no matter which direction is preferred for the specific metric(s). \n\n- We can run these in parallel by loading the future package and specifying a parallel backend using the `plan()` function.  \n\nThere is a plot method that can help visualize the results: \n\n```{r}\n#| label: derived-plot\n#| fig.height: 8\n\nautoplot(lm_deriv_imp, top = 50)\n```\n\nSince there are spline terms and interactions for the hour column, we might not care about the importance of a term such as `hour_06` (the sixth spline feature). In aggregate, we might want to know the effect of the original predictor columns. The `type` option is used for this purpose:\n\n```{r}\n#| label: original-importance\nset.seed(382)\nlm_orig_imp \u003c- \n\timportance_perm(\n\t\tlm_fit,\n\t\tdata = delivery_train,\n\t\tmetrics = metric_set(mae, rsq),\n\t\ttimes = 50,\n\t\ttype = \"original\"\n\t)\n\n# Top five: \nlm_orig_imp |\u003e \n\tfilter(.metric == \"mae\") |\u003e \n\tslice_max(importance, n = 5)\n```\n\n```{r}\n#| label: original-plot\n\nautoplot(lm_orig_imp)\n```\n\n## Supervised Feature Selection Example\n\nUsing the same dataset, let's illustrate the most common tool for filtering predictors: using random forest importance scores. \n\nimportant can use any of the \"scoring functions\" from the [filtro](https://filtro.tidymodels.org/) package. You can supply one, and the proportion of the predictors  to retain: \n\n```{r}\n#| label: select-top\nset.seed(491)\nselection_rec \u003c- \n\trecipe(time_to_delivery ~ ., data = delivery_train) |\u003e \n\tstep_predictor_best(all_predictors(), score = \"imp_rf\", prop_terms = 1/4) |\u003e \n\tstep_dummy(all_factor_predictors()) |\u003e \n\tstep_zv(all_predictors()) |\u003e \n\tstep_spline_natural(any_of(c(\"hour\", \"distance\")), deg_free = 10) |\u003e \n\tstep_interact(~ starts_with(\"hour_\"):starts_with(\"day_\")) |\u003e \n\tprep()\nselection_rec\n```\n\nA list of possible scores is contained in the help page for the recipe steps. \n\nNote that we changed selectors in `step_spline_natural()` to use `any_of()` instead of specific names. Any step downstream of any filtering steps should be generalized so that there is no failure if the columns were removed. Using `any_of()` selects these two columns _if they still remain in the data_.  \n\nWhich were removed? \n\n```{r}\n#| label: tidy-filter\nselection_res \u003c- \n\ttidy(selection_rec, number = 1) |\u003e \n\tarrange(desc(score))\n\nselection_res\n\nmean(selection_res$removed)\n```\n\nThis example shows the basic usage of the recipe. In practice, we would probably do things differently: \n\n - This step would be included in a workflow so that it is coupled to a model. \n - It would be a good idea to optimize how much selection is done by setting `prop_terms = tune()` in the step and using one of the tuning functions to find a good proportion. \n \n*Inappropriate* use of these selection steps occurs when it is used before the data are split or outside of a resampling step.  \n\n## Code of Conduct\n  \nPlease note that the important project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftidymodels%2Fimportant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftidymodels%2Fimportant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftidymodels%2Fimportant/lists"}