An open API service indexing awesome lists of open source software.

https://github.com/tidymodels/filtro

Tidy tools to apply filter-based supervised feature selection methods
https://github.com/tidymodels/filtro

Last synced: 10 months ago
JSON representation

Tidy tools to apply filter-based supervised feature selection methods

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# filtro

[![R-CMD-check](https://github.com/tidymodels/filtro/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/filtro/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/filtro/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/filtro)
[![CRAN status](https://www.r-pkg.org/badges/version/filtro)](https://CRAN.R-project.org/package=filtro)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

> ⚠️ **filtro is under active development; breaking changes may occur.**

## Overview

filtro is tidy tools to apply filter-based supervised feature
selection methods. These methods score and rank feature relevance
using metrics such as p-values, correlation, feature importance, information gain,
and more.

The package provides functions to rank and select a top proportion or number
of features using built-in methods and the
[desirability2](https://desirability2.tidymodels.org) package, and
supports streamlined preprocessing, either standalone or within tidymodels
workflows such as the [recipes](https://recipes.tidymodels.org) package.

For a detailed introduction, please see [vignette("filtro")](https://filtro.tidymodels.org/dev/articles/filtro.html).

## Installation

Install the released version of filtro from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("filtro")
```

Install the development version from GitHub with:

``` r
# install.packages("pak")
pak::pak("tidymodels/filtro")
```

## Feature selection methods

Currently, the implemented filters include:

1. ANOVA F-test

2. Correlation

3. Random forest feature importance

4. Information gain

5. Area under the ROC curve

6. Cross tabulation (Chi-squared test and Fisher's exact test)

## Scoring examples

```{r}
#| label: start
#| include: false
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
```

```{r}
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
```

```{r}
ames_subset <- modeldata::ames |>
# Use a subset of data for demonstration
dplyr::select(
Sale_Price,
MS_SubClass,
MS_Zoning,
Lot_Frontage,
Lot_Area,
Street
)
ames_subset <- ames_subset |>
dplyr::mutate(Sale_Price = log10(Sale_Price))
```

```{r}
# ANOVA p-value
ames_aov_pval_res <-
score_aov_pval |>
fit(Sale_Price ~ ., data = ames_subset)
ames_aov_pval_res@results
```

```{r}
# Pearson correlation
ames_cor_pearson_res <-
score_cor_pearson |>
fit(Sale_Price ~ ., data = ames_subset)
ames_cor_pearson_res@results
```

```{r}
# Forest importance
ames_imp_rf_reg_res <-
score_imp_rf |>
fit(Sale_Price ~ ., data = ames_subset, seed = 42)
ames_imp_rf_reg_res@results
```

```{r}
# Information gain
ames_info_gain_reg_res <-
score_info_gain |>
fit(Sale_Price ~ ., data = ames_subset)
ames_info_gain_reg_res@results
```

## Filtering exmples for score *singular*

```{r}
ames_aov_pval_res@results
```

```{r}
# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
```

```{r}
# Fill safe value, then show best score
ames_aov_pval_res <- ames_aov_pval_res |> fill_safe_value()
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
```

## Filtering examples for scores *plural*

```{r}
# Create a list
class_score_list <- list(
ames_cor_pearson_res,
ames_imp_rf_reg_res,
ames_info_gain_reg_res
)
```

```{r}
# Fill safe values
ames_scores_results <- class_score_list |>
fill_safe_values() |>
# Remove outcome
dplyr::select(-outcome)
ames_scores_results
```

```{r}
# Single and multi-parameter optimization using desirability functions
# Optimize correlation alone
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1)
)

# Optimize correlation and forest importance
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf)
)

# Optimize correlation, forest importance and information gain
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain)
)

# Same as above, but retain only a proportion of predictors
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain),
prop_terms = 0.2
)

# Optimize toward a target
ames_scores_results |>
show_best_desirability_prop(
target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
)

# Optimize with box constraints
ames_scores_results |>
show_best_desirability_prop(
constrain(cor_pearson, low = 0.2, high = 1)
)
```

## Contributing

Please note that the filtro project is released with a [Contributor Code of Conduct](https://filtro.tidymodels.org/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/filtro/issues).

- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.

- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).