https://github.com/habedi/feature-factory

A high-performance feature engineering library for Rust powered by Apache DataFusion 🦀
https://github.com/habedi/feature-factory

data-preprocessing data-science feature-engineering feature-selection machine-learning rust-lang rust-library

Last synced: 2 months ago
JSON representation

A high-performance feature engineering library for Rust powered by Apache DataFusion 🦀

Host: GitHub
URL: https://github.com/habedi/feature-factory
Owner: habedi
License: apache-2.0
Created: 2025-03-05T22:20:55.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-03-29T19:47:35.000Z (6 months ago)
Last Synced: 2025-07-03T01:17:12.792Z (3 months ago)
Topics: data-preprocessing, data-science, feature-engineering, feature-selection, machine-learning, rust-lang, rust-library
Language: Rust
Homepage:
Size: 87.9 KB
Stars: 16
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

## Feature Factory

[![Tests](https://img.shields.io/github/actions/workflow/status/habedi/feature-factory/tests.yml?label=tests&style=flat&labelColor=282c34&color=4caf50&logo=github)](https://github.com/habedi/feature-factory/actions/workflows/tests.yml)
[![Lints](https://img.shields.io/github/actions/workflow/status/habedi/feature-factory/lints.yml?label=lints&style=flat&labelColor=282c34&color=4caf50&logo=github)](https://github.com/habedi/feature-factory/actions/workflows/lints.yml)
[![Code Coverage](https://img.shields.io/codecov/c/github/habedi/feature-factory?style=flat&labelColor=282c34&color=ffca28&logo=codecov)](https://codecov.io/gh/habedi/feature-factory)
[![CodeFactor](https://img.shields.io/codefactor/grade/github/habedi/feature-factory?style=flat&labelColor=282c34&color=4caf50&logo=codefactor)](https://www.codefactor.io/repository/github/habedi/feature-factory)
[![Crates.io](https://img.shields.io/crates/v/feature-factory.svg?style=flat&labelColor=282c34&color=f46623&logo=rust)](https://crates.io/crates/feature-factory)
[![Docs.rs](https://img.shields.io/badge/docs.rs-feature--factory-66c2a5?style=flat&labelColor=282c34&logo=docs.rs)](https://docs.rs/feature-factory)
[![Downloads](https://img.shields.io/crates/d/feature-factory?style=flat&labelColor=282c34&color=4caf50&logo=rust)](https://crates.io/crates/feature-factory)
[![MSRV](https://img.shields.io/badge/MSRV-1.83.0-007ec6?label=msrv&style=flat&labelColor=282c34&logo=rust)](https://github.com/rust-lang/rust/releases/tag/1.83.0)
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-007ec6?style=flat&labelColor=282c34&logo=open-source-initiative)](https://github.com/habedi/feature-factory)
[![Status: Alpha](https://img.shields.io/badge/status-alpha-ec407a.svg?style=flat&labelColor=282c34)](https://github.com/habedi/feature-factory)

Feature Factory is a feature engineering library for Rust built on top
of [Apache DataFusion](https://datafusion.apache.org/).
It uses DataFusion internally for fast, in-memory data processing.
It is inspired by the [Feature-engine](https://feature-engine.readthedocs.io/en/latest/) Python library and
provides a wide range of components (referred to as transformers) for common feature engineering tasks like imputation,
encoding, discretization, and feature selection.

Feature Factory aims to be feature-rich and provide an API similar to [Scikit-learn](https://scikit-learn.org/stable/),
with the performance benefits of Rust and Apache DataFusion. Feature Factory transformers follow
a [fit-transform paradigm](https://scikit-learn.org/stable/data_transforms.html), where each transformer provides a
constructor, a `fit` method, and a `transform` method. Given an input dataframe, a transformer applies a
transformation to the data and returns a new dataframe.
The library also provides a pipeline API that allows users to chain multiple transformers together to create data
transformation pipelines for feature engineering.

> [!IMPORTANT]
> Feature Factory is currently in the early stage of development. APIs are unstable and may change without notice.
> Inconsistencies in documentation are expected, and not all features have been implemented yet.
> It has not yet been thoroughly tested, benchmarked, or optimized for performance.
> Bug reports, feature requests, and contributions are welcome!

### Features

- **High Performance**: Feature Factory uses Apache DataFusion as the backend data processing engine.
- **Scikit-learn API**: It provides a Scikit-learn-like API which is familiar to most data scientists.
- **Pipeline API**: Users can chain multiple transformers together to build a feature engineering pipeline.
- **Large Set of Transformers**: Currently, Feature Factory includes the following transformers:

| **Task** | **Transformers** | Status |
|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
| [**Imputation**](src/transformers/imputation.rs) | - `MeanMedianImputer`: Replace missing values with the mean (or median).
- `ArbitraryNumberImputer`: Replace missing values with an arbitrary number.
- `EndTailImputer`: Replace missing values with values at distribution tails.
- `CategoricalImputer`: Replace missing values with an arbitrary string or most frequent category.
- `AddMissingIndicator`: Add a binary indicator for missing values.
- `DropMissingData`: Remove rows with missing values. | Tested |
| [**Categorical Encoding**](src/transformers/categorical.rs) | - `OneHotEncoder`: Perform one-hot encoding.
- `CountFrequencyEncoder`: Replace categories with their frequencies.
- `OrdinalEncoder`: Replace categories with ordered numbers.
- `MeanEncoder`: Replace categories with target mean.
- `WoEEncoder`: Replace categories with the weight of evidence.
- `RareLabelEncoder`: Group infrequent categories. | Tested |
| [**Variable Discretization**](src/transformers/discretization.rs) | - `ArbitraryDiscretizer`: Discretize based on user-defined intervals.
- `EqualFrequencyDiscretizer`: Discretize into equal-frequency bins.
- `EqualWidthDiscretizer`: Discretize into equal-width bins.
- `GeometricWidthDiscretizer`: Discretize into geometric intervals. | Tested |
| [**Outlier Handling**](src/transformers/outliers.rs) | - `ArbitraryOutlierCapper`: Cap outliers at user-defined bounds.
- `Winsorizer`: Cap outliers using percentile thresholds.
- `OutlierTrimmer`: Remove outliers from the dataset. | Tested |
| [**Numerical Transformations**](src/transformers/numerical.rs) | - `LogTransformer`: Apply logarithmic transformation.
- `LogCpTransformer`: Apply log transformation with a constant.
- `ReciprocalTransformer`: Apply reciprocal transformation.
- `PowerTransformer`: Apply power transformation.
- `BoxCoxTransformer`: Apply Box-Cox transformation.
- `YeoJohnsonTransformer`: Apply Yeo-Johnson transformation.
- `ArcsinTransformer`: Apply arcsin transformation. | Tested |
| [**Feature Creation**](src/transformers/feature_creation.rs) | - `MathFeatures`: Create new features with mathematical operations.
- `RelativeFeatures`: Combine features with reference features.
- `CyclicalFeatures`: Encode cyclical features using sine or cosine. | Tested |
| [**Datetime Features**](src/transformers/datetime.rs) | - `DatetimeFeatures`: Extract features from datetime values.
- `DatetimeSubtraction`: Compute time differences between datetime values. | Tested |
| [**Feature Selection**](src/transformers/feature_selection.rs) | - `DropFeatures`: Drop specific features.
- `DropConstantFeatures`: Remove constant and quasi-constant features.
- `DropDuplicateFeatures`: Remove duplicate features.
- `DropCorrelatedFeatures`: Remove highly correlated features.
- `SmartCorrelatedSelection`: Select the best features from correlated groups.
-`DropHighPSIFeatures`: Drop features based on Population Stability Index (PSI).
- `SelectByInformationValue`: Select features based on information value.
- `SelectBySingleFeaturePerformance`: Select features based on univariate estimators.
- `SelectByTargetMeanPerformance`: Select features based on target mean encoding.
- `MRMR`: Select features using Maximum Relevance Minimum Redundancy. | Tested |

> [!NOTE]
> Status shows whether the module is `Tested` (unit, integration, and documentation tests) and `Benchmarked`.
> Empty status means the module has not yet been tested and benchmarked.

### Installation

```shell
cargo add feature-factory
```

Or add this to your `Cargo.toml`:

```toml
[dependencies]
feature-factory = "0.1"
```

*Feature Factory requires Rust 1.83 or later.*

### Documentation

You can find the latest API documentation at [docs.rs/feature-factory](https://docs.rs/feature-factory).

### Architecture

The main building blocks of Feature Factory are *transformers* and *pipelines*.

#### Transformers

A transformer takes one or more columns from an input DataFrame and creates new columns based on a transformation.
Transformers can be *stateful* or *stateless*:

- A stateful transformer needs to learn one or more parameters from the data during training (via calling `fit`) before
it can transform the data. A stateful transformer with learned parameters is referred to as a *fitted* transformer.
- A Stateless transformer can directly transform the data without needing to learn any parameters.

All transformers implement the [`Transformer`](src/pipeline.rs) trait, which includes:

| **Method** | **Description** |
|---------------|-----------------------------------------------------------------------------------------------------|
| `new` | Creates a new transformer instance. Can accept hyperparameters and column names as input arguments. |
| `fit` | Learns parameters from data. For stateless transformers this is a no-op. |
| `transform` | Applies the transformation to data. Stateful transformers require calling `fit` first. |
| `is_stateful` | Returns `true` if the transformer is stateful, otherwise `false`. |

The figure below shows a high-level overview of how a single Feature Factory transformer works:

![Feature Factory Transformer](assets/transformer_architecture.svg)

> [!IMPORTANT]
> In most cases, to avoid data leakage, the data used for training a transformer must not be the same as the data that
> is going to be transformed.

#### Pipelines

A pipeline chains multiple transformers together. Pipelines are created using the [`make_pipeline`](src/pipeline.rs)
macro, which accepts a list of `(name, transformer)` tuples.
Stateful transformers must be fitted before they're used in a pipeline.

The figure below shows a high-level overview of how a Feature Factory pipeline works:

![Feature Factory Pipeline](assets/pipeline_architecture.svg)

> [!IMPORTANT]
> Currently, to use a stateful transformer in a pipeline, it must be already fitted.

### Examples

Check out the [examples](examples) and [tests](tests) directories for examples of how to use Feature Factory.

### Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for details on how to make a contribution.

### Logo

The mascot of this project is named "Weldon the Penguin".
He is a Rustacean penguin who loves to swim in the sea and play video games—and is always ready to help you with your
data.

The logo was created using Gimp, ComfyUI, and a Flux Schnell v2 model.

### Licensing

Feature Factory is available under the terms of either of these licenses:

* MIT License ([LICENSE-MIT](LICENSE-MIT))
* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/habedi/feature-factory

Awesome Lists containing this project

README