https://github.com/petersontylerd/mlmachine
mlmachine accelerates machine learning experimentation
https://github.com/petersontylerd/mlmachine
data-analysis data-science data-visualization machine-learning python
Last synced: about 2 months ago
JSON representation
mlmachine accelerates machine learning experimentation
- Host: GitHub
- URL: https://github.com/petersontylerd/mlmachine
- Owner: petersontylerd
- License: mit
- Created: 2019-02-10T17:24:47.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-12-29T01:08:15.000Z (over 3 years ago)
- Last Synced: 2024-11-09T11:49:04.541Z (6 months ago)
- Topics: data-analysis, data-science, data-visualization, machine-learning, python
- Language: Jupyter Notebook
- Homepage:
- Size: 20.6 MB
- Stars: 30
- Watchers: 6
- Forks: 8
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://badge.fury.io/py/mlmachine)
# mlmachine
"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments."
## Table of Contents
- [Novel Functionality](#Novel-Functionality)
- [Example Notebooks](#Example-Notebooks)
- [Articles on Medium](#Articles-on-Medium)
- [Installation](#Installation)
- [Feedback](#Feedback)
- [Acknowledgments](#Acknowledgments)## Novel Functionality
__Easy, Elegant EDA__
mlmachine creates beautiful and informative EDA panels with ease:
```python
# create EDA panel for all "category" features
for feature in mlmachine_titanic.data.mlm_dtypes["category"]:
mlmachine_titanic.eda_cat_target_cat_feat(
feature=feature,
legend_labels=["Died","Survived"],
)
```
__Pandas-in / Pandas-out Pipelines__
mlmachine makes Scikit-learn transformers Pandas-friendly.
Here's an example. See how simply wrapping the mlmachine utility `PandasTransformer()` around `OneHotEncoder()` maintains our `DataFrame`:

__KFold Target Encoding__
mlmachine includes a utility called `KFoldEncoder`, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:
```python
# perform 5-fold target encoding with TargetEncoder from the category_encoders library
encoder = KFoldEncoder(
target=mlmachine_titanic_machine.training_target,
cv=KFold(n_splits=5, shuffle=True, random_state=0),
encoder=TargetEncoder,
)
encoder.fit_transform(mlmachine_titanic_machine.training_features[["Pclass"]])
```
__Crowd-sourced Feature Importance & Exhaustive Feature Selection__
mlmachine employs a robust approach to estimating feature importance by using a variety of techniques:
- Tree-based Feature Importance
- Recursive Feature Elimination
- Sequential Forward Selection
- Sequential Backward Selection
- F-value / p-value
- Variance
- Target CorrelationThis occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:
```python
# instantiate custom models
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)# estimator list - default XGBClassifier, default
# RandomForestClassifier and three custom models
estimators = [
XGBClassifier,
RandomForestClassifier,
rf2,
rf4,
rf6,
]# instantiate FeatureSelector object
fs = mlmachine_titanic_machine.FeatureSelector(
data=mlmachine_titanic_machine.training_features,
target=mlmachine_titanic_machine.training_target,
estimators=estimators,
)# run feature importance techniques, use ROC AUC and
# accuracy score metrics and 0 CV folds (where applicable)
feature_selector_summary = fs.feature_selector_suite(
sequential_scoring=["roc_auc","accuracy_score"],
sequential_n_folds=0,
save_to_csv=True,
)
```Then the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:

__Hyperparameter Tuning with Bayesian Optimization__
mlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:
```python
# generate parameter selection panels for each parameter
mlmachine_titanic_machine.model_param_plot(
bayes_optim_summary=bayes_optim_summary,
estimator_class="KNeighborsClassifier",
estimator_parameter_space=estimator_parameter_space,
n_iter=100,
)
```
## Example Notebooks
All examples can be viewed [here](https://github.com/petersontylerd/mlmachine/tree/master/notebooks)
[Example Notebook 1](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_1.ipynb) - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.
[Example Notebook 2](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_2.ipynb) - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as `GroupbyImputer()`, `KFoldEncoder()` and `DualTransformer()`.
[Example Notebook 3](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_3.ipynb) - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.
[Example Notebook 4](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_4.ipynb) - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.
## Articles on Medium
[mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines](https://towardsdatascience.com/mlmachine-clean-ml-experiments-elegant-eda-pandas-pipelines-daba951dde0a) - Published 4/3/2020
[mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction](https://towardsdatascience.com/mlmachine-groupbyimputer-kfoldencoder-and-skew-correction-357f202d2212) - Published 4/13/2020
## Installation
__Python Requirements__: 3.6, 3.7
mlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.
_**pyenv**_
Create a new virtual environment:
`$ pyenv virtualenv 3.7.5 mlmachine-env`
Activate your new virtual environment:
`$ pyenv activate mlmachine-env`
Install mlmachine using pip to install mlmachine and all dependencies:
`$ pip install mlmachine`
_**anaconda**_
Create a new virtual environment:
`$ conda create --name mlmachine-env python=3.7`
Activate your new virtual environment:
`$ conda activate mlmachine-env`
Install mlmachine using pip to install mlmachine and all dependencies:
`$ pip install mlachine`
## Feedback
Any and all feedback is welcome. Please send me an email at [email protected]
## Acknowledgments
mlmachine stands on the shoulders of many great Python packages:
[catboost](https://github.com/catboost/catboost) | [category_encoders](https://github.com/scikit-learn-contrib/categorical-encoding) | [eif](https://github.com/sahandha/eif) | [hyperopt](https://github.com/hyperopt/hyperopt) | [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) | [jupyter](https://github.com/jupyter/notebook) | [lightgbm](https://github.com/microsoft/LightGBM) | [matplotlib](https://github.com/matplotlib/matplotlib) | [numpy](https://github.com/numpy/numpy) | [pandas](https://github.com/pandas-dev/pandas) | [prettierplot](https://github.com/petersontylerd/prettierplot) | [scikit-learn](https://github.com/scikit-learn/scikit-learn) | [scipy](https://github.com/scipy/scipy) | [seaborn](https://github.com/mwaskom/seaborn) | [shap](https://github.com/slundberg/shap) | [statsmodels](https://github.com/statsmodels/statsmodels) | [xgboost](https://github.com/dmlc/xgboost) |