https://github.com/floscha/featurefilter
A Python library for removing uninformative variables from datasets
https://github.com/floscha/featurefilter
feature-engineering feature-selection
Last synced: 17 days ago
JSON representation
A Python library for removing uninformative variables from datasets
- Host: GitHub
- URL: https://github.com/floscha/featurefilter
- Owner: floscha
- License: mit
- Created: 2019-01-26T20:32:47.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2021-04-19T19:01:51.000Z (about 5 years ago)
- Last Synced: 2025-02-03T12:49:53.924Z (over 1 year ago)
- Topics: feature-engineering, feature-selection
- Language: Python
- Size: 54.7 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Featurefilter
[](https://travis-ci.com/floscha/featurefilter)
[](https://coveralls.io/github/floscha/featurefilter?branch=master)
[](https://www.codacy.com/app/floscha/featurefilter?utm_source=github.com&utm_medium=referral&utm_content=floscha/featurefilter&utm_campaign=Badge_Grade)
[](https://pypi.python.org/pypi/featurefilter)
[](https://opensource.org/licenses/MIT)
**Featurefilter** is a Python library for removing uninformative variables from datasets.
## Features
- [x] 100% test coverage
- [x] Pandas backend
- [x] Support for scikit-learn pipelines
- [x] Support for scikit-learn selectors
- [ ] PySpark backend (planned for version 0.2)
## Usage Examples
All examples can also be found in the [example notebook](examples.ipynb).
### Remove columns with too many NA values
```python
import numpy as np
import pandas as pd
from featurefilter import NaFilter
df = pd.DataFrame({'A': [0, np.nan, np.nan],
'B': [0, 0, np.nan]})
na_filter = NaFilter(max_na_ratio=0.5)
na_filter.columns_to_drop = ['A']
na_filter.fit_transform(df)
```
### Remove columns with too low or high variance
```python
import pandas as pd
from featurefilter import VarianceFilter
df = pd.DataFrame({'A': [0., 1.], 'B': [0., 0.]})
variance_filter = VarianceFilter()
variance_filter.fit_transform(df)
```
### Remove columns with too high correlation to the target variables
```python
import pandas as pd
from featurefilter import TargetCorrelationFilter
df = pd.DataFrame({'A': [0, 0], 'B': [0, 1], 'Y': [0, 1]})
target_correlation_filter = TargetCorrelationFilter(target_column='Y')
target_correlation_filter.fit_transform(df)
```
### Remove columns using generalized linear models (GLMs)
```python
import pandas as pd
from featurefilter import GLMFilter
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [0, 1, 0, 1],
'Y': [0, 0, 1, 1]})
glm_filter = GLMFilter(target_column='Y', top_features=1)
glm_filter.fit_transform(df)
```
### Remove columns using tree-based models
```python
import pandas as pd
from featurefilter import TreeBasedFilter
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [0, 1, 0, 1],
'Y': ['a', 'a', 'b', 'b']})
tree_based_filter = TreeBasedFilter(target_column='Y',
categorical_target=True,
top_features=1)
tree_based_filter.fit_transform(df)
```
### Remove columns using multiple filters combined with scikit-learn's Pipeline API
```python
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from featurefilter import NaFilter, VarianceFilter
df = pd.DataFrame({'A': [0, np.nan, np.nan],
'B': [0, 0, 0],
'C': [0, np.nan, 1]})
pipeline = Pipeline([
('na_filter', NaFilter(max_na_ratio=0.5)),
('variance_filter', VarianceFilter())
])
pipeline.fit_transform(df)
```
### Remove columns using existing selectors provided by scikit-learn
```python
import pandas as pd
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from featurefilter import SklearnWrapper
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [0, 1, 0, 1],
'Y': [0, 0, 1, 1]})
model = RFECV(LinearRegression(),
min_features_to_select=1,
cv=3)
selector = SklearnWrapper(model, target_column='Y')
selector.fit_transform(df)
```