Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/btrotta/avg-transform

Bayesian group transformations
https://github.com/btrotta/avg-transform

Last synced: 18 days ago
JSON representation

Bayesian group transformations

Host: GitHub
URL: https://github.com/btrotta/avg-transform
Owner: btrotta
Created: 2018-01-20T02:44:55.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-01-20T03:09:21.000Z (almost 7 years ago)
Last Synced: 2024-11-24T03:26:51.287Z (about 1 month ago)
Language: Python
Size: 1.95 KB
Stars: 1
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

# Bayesian estimation of group averages, for machine learning feature engineering

In machine learning problems, it is common to have categorical variables with a large number (hundreds or thousands)
of levels. In this case, memory constraints make one-hot encoding these variables impractical. Therefore, a common way
to use such a variable in the model is to create a feature which is the average of the prediction variable (whether binary
or continuous) in each category. (In pandas, this can be done with groupby and transform.)
However, this approach will give misleading estimates of the group mean for groups
where the training set contains only a few samples. In this situation, a high or low average is more likely to occur
just by chance. So we need to adjust the
estimated group average to make it more conservative. We can do this using Bayesian methods. We assume a prior
distribution based on the overall data set, and combine this with the sample data in each group to calculate a
Bayesian posterior estimate of the group average.

This module also adds some other useful functionality which is not available in pandas. There is the option to calculate
the group averages using only a subset of the dataframe (so we can calculate the averages only using the training data,
and avoid leaking the target variable of test data). There is also an
option to exclude the current row from the average, which avoids overfitting.