Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mitchmedeiros/mlcompare
Quickly compare machine learning models across libraries and datasets
https://github.com/mitchmedeiros/mlcompare
huggingface-datasets kaggle openml pytorch scikit-learn xgboost
Last synced: 4 months ago
JSON representation
Quickly compare machine learning models across libraries and datasets
- Host: GitHub
- URL: https://github.com/mitchmedeiros/mlcompare
- Owner: MitchMedeiros
- License: mit
- Created: 2024-06-30T07:42:28.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-09-06T18:42:39.000Z (5 months ago)
- Last Synced: 2024-10-10T08:21:02.834Z (4 months ago)
- Topics: huggingface-datasets, kaggle, openml, pytorch, scikit-learn, xgboost
- Language: Python
- Homepage: https://mlcompare.readthedocs.io/en/stable/api_reference
- Size: 15.5 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
- License: LICENSE.txt
Awesome Lists containing this project
README
MLCompare is a Python package for running model comparison pipelines, with the aim of being both simple and flexible. It supports multiple popular ML libraries, retrieval from multiple online dataset repositories, common data processing steps, and results visualization. Additionally, it allows for using your own models and datasets within the pipelines.
Libraries
Datasets
Data Processing
- Scikit-learn
- XGBoost
- Kaggle
- OpenML
- Hugging Face
- locally saved
- train-test split
- drop columns
- handle NaNs: drop | forward-fill | backward-fill
- encoders: OneHot | Ordinal | Target | Label
- scalers: Standard | MinMax | MaxAbs | Robust
- transformers: Quantile | Power | Normalizer
Installing
It is recommended to create a new virtual environment. Example with Conda:
```console
conda create -n compare_env python==3.11.9
conda activate compare_env
```
Install this library with pip:
```console
pip install mlcompare
```
Note that for MacOS, both XGBoost and LightGBM require `libomp`. It can be installed with Homebrew:
```console
brew install libomp
```
A Simple Example
Running a pipeline with multiple datasets and models is done by creating a list of dictionaries for each and providing them to a pipeline function.
The below example downloads a dataset from OpenML and Kaggle, one-hot encodes some of the columns in the Kaggle dataset, and trains and evaluates a Random Forest and XGBoost model on them.
```python
import mlcompare
datasets = [
{
"type": "openml",
"id": 8,
"target": "drinks",
},
{
"type": "kaggle",
"user": "gorororororo23",
"dataset": "plant-growth-data-classification",
"file": "plant_growth_data.csv",
"target": "Growth_Milestone",
"oneHotEncode": ["Soil_Type", "Water_Frequency", "Fertilizer_Type"],
}
]
models = [
{
"library": "sklearn",
"name": "RandomForestRegressor",
},
{
"library": "xgboost",
"name": "XGBRegressor",
"params": {"num_leaves": 40, "n_estimators": 200}
}
]
mlcompare.full_pipeline(datasets, models, "regression")
```
In the case of the XGBoost model some non-default parameter values were used.
Planned Additions
Version 1.3
- LightGBM support
- CatBoost support
- Model results graphing and visualization
- Improved documentation
- Support for presplit data
Version 1.4
- PyTorch support
- TensorFlow support
- Additional dataset sources
- Built-in model and dataset collections for quick testing of similar model types/datasets
- Optional pipeline caching
- Optional trained model saving
Version 1.5
- S3 Support