Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pfnet-research/xfeat
Flexible Feature Engineering & Exploration Library using GPUs and Optuna.
https://github.com/pfnet-research/xfeat
Last synced: 2 days ago
JSON representation
Flexible Feature Engineering & Exploration Library using GPUs and Optuna.
- Host: GitHub
- URL: https://github.com/pfnet-research/xfeat
- Owner: pfnet-research
- License: mit
- Created: 2020-05-29T08:40:10.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-11-01T02:12:52.000Z (about 1 year ago)
- Last Synced: 2024-08-02T16:46:20.324Z (3 months ago)
- Language: Python
- Homepage:
- Size: 1.07 MB
- Stars: 363
- Watchers: 29
- Forks: 21
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- my-awesome-starred - pfnet-research/xfeat - Flexible Feature Engineering & Exploration Library using GPUs and Optuna. (Python)
README
# xfeat
**[Slides](_docs/xfeat_slides.pdf)** | **[Tutorial](examples/xfeat_tutorial_notebook.ipynb)** | **[Document](#document)** | **[Installation](#installation)**
Flexible Feature Engineering & Exploration Library using GPUs and [Optuna](https://github.com/optuna/optuna).
xfeat provides sklearn-like transformation classes for feature engineering and exploration. Unlike sklearn API, xfeat provides a dataframe-in, dataframe-out interface. xfeat supports both [pandas](https://pandas.pydata.org/) and [cuDF](https://github.com/rapidsai/cudf) dataframes. By using cuDF and [CuPy](https://github.com/cupy/cupy), xfeat can generate features 10 ~ 30 times faster than a naive pandas operation.
| ![xfeat_bench_result](./_docs/benchmark_groupby_aggregation_marked.png) | ![xfeat_target_encoding_image](./_docs/benchmark_target_encoding_marked.png) |
|------------------------------------------------------------------|-------|
| Group-by aggregation benchmark ([result](./_docs/groupby_aggregation_benchmark.md)) | Target encoding benchmark ([result](./_docs/target_encoding_benchmark.md)) |## Document
* [Slides](_docs/xfeat_slides.pdf)
* [Tutorial notebook](examples/xfeat_tutorial_notebook.ipynb)
* [Feature Encoding and Pipelining](./_docs/feature_encoding.md)
* [Target encoding and benchmark result](./_docs/target_encoding_benchmark.md)
* [Group-by aggregation and benchmark result](./_docs/groupby_aggregation_benchmark.md)
* Feature selection with OptunaMore examples are available in the [./examples](examples/) directory.
## Quick Start
xfeat provides a dataframe-in, dataframe-out interface:
![xfeat_arithmetic_combination](./_docs/arithmetic_combination.png)
### Feature Engineering
It is possible to sequentially concatenate encoder objects with `xfeat.Pipeline`. To avoid repeating the same feature extraction process, it is useful to output the results to the feather file format.
* More encoder classes available [here](./_docs/feature_encoding.md).
```python
import pandas as pd
from xfeat import Pipeline, SelectNumerical, ArithmeticCombinations# 2-order Arithmetic combinations.
Pipeline(
[
SelectNumerical(),
ArithmeticCombinations(
exclude_cols=["target"], drop_origin=True, operator="+", r=2,
),
]
).fit_transform(pd.read_feather("train_test.ftr")).reset_index(
drop=True
).to_feather(
"feature_arithmetic_combi2.ftr"
)
```### Target Encoding with cuDF/CuPy
![xfeat_target_encoding_image](./_docs/target_encoding_image.png)
Target encoding can be greatly accelerated with cuDF. Internally, aggregation is computed on the GPU using CuPy.
```python
from sklearn.model_selection import KFold
from xfeat import TargetEncoderfold = KFold(n_splits=5, shuffle=False)
encoder = TargetEncoder(input_cols=cols, fold=fold)df = cudf.from_pandas(df) # if cuDF is available.
df_encoded = encoder.fit_transform(df)
```### Groupby features with cuDF
![xfeat_groupby_agg_image](./_docs/aggregation_image.png)
Benchmark result: [Group-by aggregation and benchmark result](./_docs/groupby_aggregation_benchmark.md).
```python
from xfeat import aggregationdf = cudf.from_pandas(df) # if cuDF is available.
df_agg = aggregation(df,
group_key="user_id",
group_values=["price", "purchased_amount"],
agg_methods=["sum", "min", "max"]
).to_pandas()
```### Feature Selection with GBDT feature importance
Example code: [examples/feature_selection_with_gbdt.py](examples/feature_selection_with_gbdt.py)
```python
from xfeat import GBDTFeatureSelectorparams = {
"objective": "regression",
"seed": 111,
}
fit_kwargs = {
"num_boost_round": 10,
}selector = GBDTFeatureSelector(
input_cols=cols,
target_col="target",
threshold=0.5,
lgbm_params=params,
lgbm_fit_kwargs=fit_kwargs,
)
df_selected = selector.fit_transform(df)
print("Selected columns:", selector._selected_cols)
```### Feature Selection with Optuna
`GBDTFeatureSelector` uses a percentile hyperparameter to select features with the highest scores.
By using Optuna, we can search for the best value for this hyperparameter to maximize the objective.Example code: [examples/feature_selection_with_gbdt_and_optuna.py](examples/feature_selection_with_gbdt_and_optuna.py)
```python
import optunadef objective(df, selector, trial):
selector.set_trial(trial)
selector.fit(df)
input_cols = selector.get_selected_cols()# Evaluate with selected columns
train_set = lgb.Dataset(df[input_cols], label=df["target"])
scores = lgb.cv(LGBM_PARAMS, train_set, num_boost_round=100, stratified=False, seed=1)
rmsle_score = scores["rmse-mean"][-1]
return rmsle_scoreselector = GBDTFeatureExplorer(
input_cols=input_cols,
target_col="target",
fit_once=True,
threshold_range=(0.6, 1.0),
lgbm_params=params,
lgbm_fit_kwargs=fit_params,
)study = optuna.create_study(direction="minimize")
study.optimize(partial(objective, df_train, selector), n_trials=20)selector.from_trial(study.best_trial)
print("Selected columns:", selector.get_selected_cols())
```## Installation
```
$ python setup.py install
```If you want to use GPUs, cuDF and CuPy are required. See the [cuDF installation guide](https://github.com/rapidsai/cudf#installation).
### For Developers
```
$ python setup.py test
```