https://github.com/tidymodels/filtro

Tidy tools to apply filter-based supervised feature selection methods
https://github.com/tidymodels/filtro

Last synced: 11 months ago
JSON representation

Tidy tools to apply filter-based supervised feature selection methods

Host: GitHub
URL: https://github.com/tidymodels/filtro
Owner: tidymodels
License: other
Created: 2025-06-12T20:39:50.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-08-15T21:59:03.000Z (11 months ago)
Last Synced: 2025-08-15T23:35:49.303Z (11 months ago)
Language: R
Homepage: https://filtro.tidymodels.org/dev/
Size: 1.92 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 22
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# filtro

[![R-CMD-check](https://github.com/tidymodels/filtro/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/filtro/actions/workflows/R-CMD-check.yaml)

[![Codecov test coverage](https://codecov.io/gh/tidymodels/filtro/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/filtro)

[![CRAN status](https://www.r-pkg.org/badges/version/filtro)](https://CRAN.R-project.org/package=filtro)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

> ⚠️ **filtro is under active development; breaking changes may occur.**

## Overview

filtro is tidy tools to apply filter-based supervised feature

selection methods. These methods score and rank feature relevance

using metrics such as p-values, correlation, feature importance, information gain, 

and more.

The package provides functions to rank and select a top proportion or number 

of features using built-in methods and the

[desirability2](https://desirability2.tidymodels.org) package, and 

supports streamlined preprocessing, either standalone or within tidymodels

workflows such as the [recipes](https://recipes.tidymodels.org) package. 

For a detailed introduction, please see [vignette("filtro")](https://filtro.tidymodels.org/dev/articles/filtro.html). 

## Installation

Install the released version of filtro from [CRAN](https://CRAN.R-project.org) with:

``` r

install.packages("filtro")

```

Install the development version from GitHub with:

``` r

# install.packages("pak")

pak::pak("tidymodels/filtro")

```

## Feature selection methods

Currently, the implemented filters include:

1. ANOVA F-test 

2. Correlation

3. Random forest feature importance 

4. Information gain

5. Area under the ROC curve 

6. Cross tabulation (Chi-squared test and Fisher's exact test) 

## Scoring examples

```{r}

#| label: start

#| include: false

library(filtro)

library(desirability2)

library(dplyr)

library(modeldata)

```

```{r}

library(filtro)

library(desirability2)

library(dplyr)

library(modeldata)

```

```{r}

ames_subset <- modeldata::ames |>

  # Use a subset of data for demonstration

  dplyr::select(

    Sale_Price,

    MS_SubClass,

    MS_Zoning,

    Lot_Frontage,

    Lot_Area,

    Street

  )

ames_subset <- ames_subset |>

  dplyr::mutate(Sale_Price = log10(Sale_Price))

```

```{r}

# ANOVA p-value

ames_aov_pval_res <-

  score_aov_pval |>

  fit(Sale_Price ~ ., data = ames_subset)

ames_aov_pval_res@results

```

```{r}

# Pearson correlation

ames_cor_pearson_res <-

  score_cor_pearson |>

  fit(Sale_Price ~ ., data = ames_subset)

ames_cor_pearson_res@results

```

```{r}

# Forest importance

ames_imp_rf_reg_res <-

  score_imp_rf |>

  fit(Sale_Price ~ ., data = ames_subset, seed = 42)

ames_imp_rf_reg_res@results

```

```{r}

# Information gain

ames_info_gain_reg_res <-

  score_info_gain |>

  fit(Sale_Price ~ ., data = ames_subset)

ames_info_gain_reg_res@results

```

## Filtering exmples for score *singular* 

```{r}

ames_aov_pval_res@results

```

```{r}

# Show best score, based on proportion of predictors

ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)

```

```{r}

# Fill safe value, then show best score 

ames_aov_pval_res <- ames_aov_pval_res |> fill_safe_value()

ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)

```

## Filtering examples for scores *plural* 

```{r}

# Create a list

class_score_list <- list(

  ames_cor_pearson_res,

  ames_imp_rf_reg_res,

  ames_info_gain_reg_res

)

```

```{r}

# Fill safe values

ames_scores_results <- class_score_list |>

  fill_safe_values() |>

  # Remove outcome

  dplyr::select(-outcome)

ames_scores_results

```

```{r}

# Single and multi-parameter optimization using desirability functions

# Optimize correlation alone

ames_scores_results |>

  show_best_desirability_prop(

    maximize(cor_pearson, low = 0, high = 1)

  )

# Optimize correlation and forest importance

ames_scores_results |>

  show_best_desirability_prop(

    maximize(cor_pearson, low = 0, high = 1),

    maximize(imp_rf)

  )

# Optimize correlation, forest importance and information gain

ames_scores_results |>

  show_best_desirability_prop(

    maximize(cor_pearson, low = 0, high = 1),

    maximize(imp_rf),

    maximize(infogain)

  )

# Same as above, but retain only a proportion of predictors

ames_scores_results |>

  show_best_desirability_prop(

    maximize(cor_pearson, low = 0, high = 1),

    maximize(imp_rf),

    maximize(infogain),

    prop_terms = 0.2

  )

# Optimize toward a target

ames_scores_results |>

  show_best_desirability_prop(

    target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)

  )

# Optimize with box constraints

ames_scores_results |>

  show_best_desirability_prop(

    constrain(cor_pearson, low = 0.2, high = 1)

  )

```

## Contributing

Please note that the filtro project is released with a [Contributor Code of Conduct](https://filtro.tidymodels.org/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/filtro/issues).

- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.

- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tidymodels/filtro

Awesome Lists containing this project

README