Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mirri66/xgbmagic
https://github.com/mirri66/xgbmagic
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/mirri66/xgbmagic
- Owner: mirri66
- License: mit
- Created: 2016-04-20T10:32:24.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-07-06T21:08:52.000Z (over 1 year ago)
- Last Synced: 2024-04-25T06:42:44.557Z (7 months ago)
- Language: Python
- Size: 7.16 MB
- Stars: 189
- Watchers: 15
- Forks: 26
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# xgbmagic
*Pandas dataframe goes in, XGBoost model results come out*The feature engineering step (creating new features and selectively removing unwanted features) is the most creative and fun step of training a model, whereas what follows is usually a standard data-processing flow.
Once you're done engineering your features, xgbmagic automatically runs a standard workflow for using XGBoost to train a model on a pandas dataframe.
- performs one-hot encoding for categorical features,
- drops uninformative features (no variability, too many missing values...)
- trains the model
- plots the most important features in order of importance.#### New features!
*Iterate faster with smaller samples! Improve accuracy with ensemble learning (bagging)!*
- allows for training on random sample(s) - e.g. if you'd like to train a subset of your data
- allows for training multiple times on different subsamples, then returning average/consensus from all models as the prediction (bagging)#### To do
- detect highly correlated columns and remove redundant columns
- remove categorical features with too many possible category values (to remove unhelpful features like names and ids)
- parameter tuning with GridsearchCV
- allow custom values for more parameters
- for classification problems, convert labels to 0-n integers (in case they're floats or strings)## Installation
Install xgboost first
```
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost; make -j4
cd python-package; sudo python setup.py install
```
Then install xgbmagic
```
pip install xgbmagic
```## Documentation
Input parameters:
* df (DataFrame): pandas dataframe
* target_column (string): name of column containing the target parameter
* id_column (string): name of column containing IDs
* target_type (string): 'binary' for binary targets (classification), 'multiclass' for multiple classes (classification), 'linear' for continuous targets (linear regression)
* categorical_columns (list of strings): a list of names of columns containing categorical data
* numerical_columns (list of strings): a list of names of columns containing numerical data
* drop_columns (list of strings): a list of names of columns to drop
* verbose (boolean): verbosity of printouts. True = verbose
* sample_fraction (float, 0-1): if this is not 1, a subsample of the data will be used to train the model
* n_samples (int): if this is more than 1, this number of models will be trained, each iteration trained on a subsample of data (size of sample determined by sample_fraction). Note that if n_samples * sample_fraction is more than one, the maximum number of models will be trained based on how many samples we can get from the stated sample_fraction (e.g. if sample_fraction = 0.2, the maximum by number of samples will be 5)
* prefix (string): prefix for names of the output filespredict(test_df, return_multi_outputs, return_mean_std)
* return_multi_outputs (boolean): if True, returns one set of outputs per model trained on a distinct subsample
* return_mean_std (boolean): if True, returns tuple with mean and standard deviation of the model outputs
if both return_multi_outputs and return_mean_std are set to True, return_multi_outputs takes precedence## Example
```
import xgbmagic
import pandas as pd# read the training data
df = pd.read_csv('train.csv')# for logistic regression, target_type is 'binary'
target_type = 'binary'# set columns that are categorical, numeric, and to be dropped here.
xgb = xgbmagic.Xgb(df, target_column='TARGET', id_column='ID', target_type=target_type, categorical_columns=[], drop_columns=[], numeric_columns=[], num_training_rounds=500, verbose=1, early_stopping_rounds=50)
xgb.train()# use the model to predict values for the test set
test_df = pd.read_csv('test.csv')
print(xgb.feature_importance())
output = xgb.predict(test_df)# write to csv
xgb.write_csv('output-xgbmagic.csv')# save model
xgb.save('xgbmodel.pkl')# load model
from sklearn.externals import joblib
xgb = joblib.load('xgbmodel.pkl')```
## Issues
Please report issues and feedback [here](https://github.com/mirri66/xgbmagic/issues)