https://github.com/scikit-learn-contrib/category_encoders

A library of sklearn compatible categorical variable encoders
https://github.com/scikit-learn-contrib/category_encoders

Last synced: 6 days ago
JSON representation

A library of sklearn compatible categorical variable encoders

Host: GitHub
URL: https://github.com/scikit-learn-contrib/category_encoders
Owner: scikit-learn-contrib
License: bsd-3-clause
Created: 2015-11-29T19:32:37.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2024-10-01T21:23:20.000Z (6 months ago)
Last Synced: 2024-10-29T14:50:23.067Z (6 months ago)
Language: Python
Homepage: http://contrib.scikit-learn.org/category_encoders/
Size: 42.2 MB
Stars: 2,408
Watchers: 39
Forks: 395
Open Issues: 46
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-list - category_encoders - A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques (Machine Learning Framework / General Purpose Framework)
awesome-python-machine-learning-resources - GitHub - 25% open · ⏱️ 02.06.2022): (Sklearn实用程序)
StarryDivineSky - scikit-learn-contrib/category_encoders - learn兼容的库，提供多种方法将分类变量编码为数值型，方便机器学习模型使用。它包含无监督和监督两种类型的编码方法，无监督方法包括One-Hot、Ordinal等，监督方法包括Target Encoding、LeaveOneOut等。该库支持numpy数组和pandas数据框作为输入，并提供可配置的选项。用户可以通过pip或conda安装该库，并使用其提供的编码器进行数据预处理。 (其他_机器学习与深度学习)

README

        Categorical Encoding Methods

============================

[![Downloads](https://pepy.tech/badge/category-encoders)](https://pepy.tech/project/category-encoders)

[![Downloads](https://pepy.tech/badge/category-encoders/month)](https://pepy.tech/project/category-encoders)

![Test Suite and Linting](https://github.com/scikit-learn-contrib/category_encoders/workflows/Test%20Suite%20and%20Linting/badge.svg)

[![DOI](https://zenodo.org/badge/47077067.svg)](https://zenodo.org/badge/latestdoi/47077067)

A set of scikit-learn-style transformers for encoding categorical 

variables into numeric by means of different techniques.

Important Links

---------------

Documentation: [http://contrib.scikit-learn.org/category_encoders/](http://contrib.scikit-learn.org/category_encoders/)

Encoding Methods

----------------

__Unsupervised:__

 * Backward Difference Contrast [2][3]

 * BaseN [6]

 * Binary [5]

 * Gray [14]

 * Count [10]

 * Hashing [1]

 * Helmert Contrast [2][3]

 * Ordinal [2][3]

 * One-Hot [2][3]

 * Rank Hot [15]

 * Polynomial Contrast [2][3]

 * Sum Contrast [2][3]

__Supervised:__

 * CatBoost [11]

 * Generalized Linear Mixed Model [12] 

 * James-Stein Estimator [9]

 * LeaveOneOut [4]

 * M-estimator [7]

 * Target Encoding [7]

 * Weight of Evidence [8]

 * Quantile Encoder [13]

 * Summary Encoder [13]

Installation

------------

The package requires: `numpy`, `statsmodels`, and `scipy`.

To install the package, execute:

```shell

$ python setup.py install

```

or 

```shell

pip install category_encoders

```

or

```shell

conda install -c conda-forge category_encoders

```

To install the development version, you may use:

```shell

pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders

```

Usage

-----

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing 

scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all 

columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific 

configuration options.

Examples

--------

There are two types of encoders: unsupervised and supervised. An unsupervised example:

```python

from category_encoders import *

import pandas as pd

from sklearn.datasets import load_boston

# prepare some data

bunch = load_boston()

y = bunch.target

X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

# use binary encoding to encode two categorical features

enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset

numeric_dataset = enc.transform(X)

```

And a supervised example:

```python

from category_encoders import *

import pandas as pd

from sklearn.datasets import load_boston

# prepare some data

bunch = load_boston()

y_train = bunch.target[0:250]

y_test = bunch.target[250:506]

X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)

X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features

enc = TargetEncoder(cols=['CHAS', 'RAD'])

# transform the datasets

training_numeric_dataset = enc.fit_transform(X_train, y_train)

testing_numeric_dataset = enc.transform(X_test)

```

For the transformation of the _training_ data with the supervised methods, you should use `fit_transform()` method instead of `fit().transform()`, because these two methods _do not_ have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the _training_ data in `fit_transform()` method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with `transform()` method (to get as accurate estimates as possible).

Furthermore, you may benefit from following wrappers:

 * PolynomialWrapper, which extends supervised encoders to support polynomial targets

 * NestedCVWrapper, which helps to prevent overfitting  

Additional examples and benchmarks can be found in the `examples` directory.

Contributing

------------

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file

or open an issue on the github project to get started.

References

----------

 1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.

 2. Contrast Coding Systems for categorical variables.  UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.

 3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

 4. Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding

 5. Beyond One-Hot: an exploration of categorical variables. From https://mcginniscommawill.com/posts/2015-11-29-beyond-one-hot-an-exploration-of-categorical-variables/

 6. BaseN Encoding and Grid Search in categorical variables. From https://mcginniscommawill.com/posts/2016-12-18-basen-encoding-grid-search-category-encoders/

 7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538

 8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

 9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/

 10. Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies

 11. Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/

 12. Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf

 13. Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. Modeling Decisions for Artificial Intelligence, 2021. Springer International Publishing https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14

 14. Gray Encoding. From https://en.wikipedia.org/wiki/Gray_code 

 15. Jacob Buckman, Aurko Roy, Colin Raffel, Ian Goodfellow: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. From https://openreview.net/forum?id=S18Su--CW

 16. Fairness implications of encoding protected categorical attributes. Carlos Mougan, Jose Alvarez, Salvatore Ruggieri, and Steffen Staab.  In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, https://arxiv.org/abs/2201.11358

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scikit-learn-contrib/category_encoders

Awesome Lists containing this project

README