{"id":13701790,"url":"https://github.com/petersontylerd/mlmachine","last_synced_at":"2025-03-21T12:23:07.201Z","repository":{"id":48403670,"uuid":"170004625","full_name":"petersontylerd/mlmachine","owner":"petersontylerd","description":"mlmachine accelerates machine learning experimentation","archived":false,"fork":false,"pushed_at":"2021-12-29T01:08:15.000Z","size":21649,"stargazers_count":30,"open_issues_count":3,"forks_count":8,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-09T11:49:04.541Z","etag":null,"topics":["data-analysis","data-science","data-visualization","machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/petersontylerd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-10T17:24:47.000Z","updated_at":"2024-03-29T00:35:39.000Z","dependencies_parsed_at":"2022-09-26T16:50:33.430Z","dependency_job_id":null,"html_url":"https://github.com/petersontylerd/mlmachine","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersontylerd%2Fmlmachine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersontylerd%2Fmlmachine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersontylerd%2Fmlmachine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petersontylerd%2Fmlmachine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/petersontylerd","download_url":"https://codeload.github.com/petersontylerd/mlmachine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244795519,"owners_count":20511521,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-science","data-visualization","machine-learning","python"],"created_at":"2024-08-02T20:01:58.104Z","updated_at":"2025-03-21T12:23:07.196Z","avatar_url":"https://github.com/petersontylerd.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/mlmachine.svg)](https://badge.fury.io/py/mlmachine)\n\n# mlmachine\n\n\u003ci\u003e\"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments.\"\u003c/i\u003e\n\n## Table of Contents\n\n- [Novel Functionality](#Novel-Functionality)\n- [Example Notebooks](#Example-Notebooks)\n- [Articles on Medium](#Articles-on-Medium)\n- [Installation](#Installation)\n- [Feedback](#Feedback)\n- [Acknowledgments](#Acknowledgments)\n\n\n## Novel Functionality\n\n__Easy, Elegant EDA__\n\nmlmachine creates beautiful and informative EDA panels with ease:\n\n```python\n# create EDA panel for all \"category\" features\nfor feature in mlmachine_titanic.data.mlm_dtypes[\"category\"]:\n    mlmachine_titanic.eda_cat_target_cat_feat(\n        feature=feature,\n        legend_labels=[\"Died\",\"Survived\"],\n    )\n```\n![alt text](/notebooks/images/eda_loop.gif \"EDA loop\")\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n__Pandas-in / Pandas-out Pipelines__\n\nmlmachine makes Scikit-learn transformers Pandas-friendly.\n\nHere's an example. See how simply wrapping the mlmachine utility `PandasTransformer()` around `OneHotEncoder()` maintains our `DataFrame`:\n\n![alt text](/notebooks/images/p1_pandastransformer.jpeg \"Pandas Pipeline\")\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n__KFold Target Encoding__\n\nmlmachine includes a utility called `KFoldEncoder`, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:\n\n```python\n# perform 5-fold target encoding with TargetEncoder from the category_encoders library\nencoder = KFoldEncoder(\n    target=mlmachine_titanic_machine.training_target,\n    cv=KFold(n_splits=5, shuffle=True, random_state=0),\n    encoder=TargetEncoder,\n)\nencoder.fit_transform(mlmachine_titanic_machine.training_features[[\"Pclass\"]])\n```\n\n![alt text](/notebooks/images/kfold.jpeg \"Pandas Pipeline\")\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n__Crowd-sourced Feature Importance \u0026 Exhaustive Feature Selection__\n\nmlmachine employs a robust approach to estimating feature importance by using a variety of techniques:\n\n- Tree-based Feature Importance\n- Recursive Feature Elimination\n- Sequential Forward Selection\n- Sequential Backward Selection\n- F-value / p-value\n- Variance \n- Target Correlation\n\nThis occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:\n\n```python\n# instantiate custom models\nrf2 = RandomForestClassifier(max_depth=2)\nrf4 = RandomForestClassifier(max_depth=4)\nrf6 = RandomForestClassifier(max_depth=6)\n\n# estimator list - default XGBClassifier, default\n# RandomForestClassifier and three custom models\nestimators = [\n    XGBClassifier,\n    RandomForestClassifier,\n    rf2,\n    rf4,\n    rf6,\n]\n\n# instantiate FeatureSelector object\nfs = mlmachine_titanic_machine.FeatureSelector(\n    data=mlmachine_titanic_machine.training_features,\n    target=mlmachine_titanic_machine.training_target,\n    estimators=estimators,\n)\n\n# run feature importance techniques, use ROC AUC and\n# accuracy score metrics and 0 CV folds (where applicable)\nfeature_selector_summary = fs.feature_selector_suite(\n    sequential_scoring=[\"roc_auc\",\"accuracy_score\"],\n    sequential_n_folds=0,\n    save_to_csv=True,\n)\n```\n\nThen the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:\n\n![alt text](/notebooks/images/feature_selection.jpg \"Pandas Pipeline\")\n\n\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n__Hyperparameter Tuning with Bayesian Optimization__\n\nmlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:\n\n```python\n# generate parameter selection panels for each parameter\nmlmachine_titanic_machine.model_param_plot(\n        bayes_optim_summary=bayes_optim_summary,\n        estimator_class=\"KNeighborsClassifier\",\n        estimator_parameter_space=estimator_parameter_space,\n        n_iter=100,\n    )\n```\n![alt text](/notebooks/images/param_loop.gif \"EDA loop\")\n\u003cbr\u003e\u003cbr\u003e\n\n\n\n\n## Example Notebooks\n\nAll examples can be viewed [here](https://github.com/petersontylerd/mlmachine/tree/master/notebooks)\n\n[Example Notebook 1](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_1.ipynb) - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.\n\n[Example Notebook 2](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_2.ipynb) - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as `GroupbyImputer()`, `KFoldEncoder()` and `DualTransformer()`.\n\n[Example Notebook 3](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_3.ipynb) - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.\n\n[Example Notebook 4](https://github.com/petersontylerd/mlmachine/tree/master/notebooks/mlmachine_part_4.ipynb) - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.\n\n\n\n## Articles on Medium\n\n[mlmachine - Clean ML Experiments, Elegant EDA \u0026 Pandas Pipelines](https://towardsdatascience.com/mlmachine-clean-ml-experiments-elegant-eda-pandas-pipelines-daba951dde0a) - Published 4/3/2020\n\n[mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction](https://towardsdatascience.com/mlmachine-groupbyimputer-kfoldencoder-and-skew-correction-357f202d2212) - Published 4/13/2020\n\n\n\n## Installation\n\n__Python Requirements__: 3.6, 3.7\n\nmlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.\n\n_**pyenv**_\n\nCreate a new virtual environment:\n\n`$ pyenv virtualenv 3.7.5 mlmachine-env`\n\nActivate your new virtual environment:\n\n`$ pyenv activate mlmachine-env`\n\nInstall mlmachine using pip to install mlmachine and all dependencies:\n\n`$ pip install mlmachine`\n\n_**anaconda**_\n\nCreate a new virtual environment:\n\n`$ conda create --name mlmachine-env python=3.7`\n\nActivate your new virtual environment:\n\n`$ conda activate mlmachine-env`\n\nInstall mlmachine using pip to install mlmachine and all dependencies:\n\n`$ pip install mlachine`\n\n## Feedback\n\nAny and all feedback is welcome. Please send me an email at petersontylerd@gmail.com\n\n## Acknowledgments\n\nmlmachine stands on the shoulders of many great Python packages:\n\n[catboost](https://github.com/catboost/catboost) | [category_encoders](https://github.com/scikit-learn-contrib/categorical-encoding) | [eif](https://github.com/sahandha/eif) | [hyperopt](https://github.com/hyperopt/hyperopt) | [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) | [jupyter](https://github.com/jupyter/notebook) | [lightgbm](https://github.com/microsoft/LightGBM) | [matplotlib](https://github.com/matplotlib/matplotlib) | [numpy](https://github.com/numpy/numpy) | [pandas](https://github.com/pandas-dev/pandas) | [prettierplot](https://github.com/petersontylerd/prettierplot) | [scikit-learn](https://github.com/scikit-learn/scikit-learn) | [scipy](https://github.com/scipy/scipy) | [seaborn](https://github.com/mwaskom/seaborn) | [shap](https://github.com/slundberg/shap) | [statsmodels](https://github.com/statsmodels/statsmodels) | [xgboost](https://github.com/dmlc/xgboost) |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetersontylerd%2Fmlmachine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpetersontylerd%2Fmlmachine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetersontylerd%2Fmlmachine/lists"}