Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mitchmedeiros/mlcompare

Quickly compare machine learning models across libraries and datasets
https://github.com/mitchmedeiros/mlcompare

huggingface-datasets kaggle openml pytorch scikit-learn xgboost

Last synced: 4 months ago
JSON representation

Quickly compare machine learning models across libraries and datasets

Host: GitHub
URL: https://github.com/mitchmedeiros/mlcompare
Owner: MitchMedeiros
License: mit
Created: 2024-06-30T07:42:28.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-09-06T18:42:39.000Z (5 months ago)
Last Synced: 2024-10-10T08:21:02.834Z (4 months ago)
Topics: huggingface-datasets, kaggle, openml, pytorch, scikit-learn, xgboost
Language: Python
Homepage: https://mlcompare.readthedocs.io/en/stable/api_reference
Size: 15.5 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
- License: LICENSE.txt

Awesome Lists containing this project

README

MLCompare Logo

MLCompare is a Python package for running model comparison pipelines, with the aim of being both simple and flexible. It supports multiple popular ML libraries, retrieval from multiple online dataset repositories, common data processing steps, and results visualization. Additionally, it allows for using your own models and datasets within the pipelines.

Libraries

Datasets

Data Processing

Scikit-learn

XGBoost

Kaggle

OpenML

Hugging Face

locally saved

train-test split

drop columns

handle NaNs: drop | forward-fill | backward-fill

encoders: OneHot | Ordinal | Target | Label

scalers: Standard | MinMax | MaxAbs | Robust

transformers: Quantile | Power | Normalizer

Installing

It is recommended to create a new virtual environment. Example with Conda:

```console
conda create -n compare_env python==3.11.9
conda activate compare_env
```

Install this library with pip:

```console
pip install mlcompare
```

Note that for MacOS, both XGBoost and LightGBM require `libomp`. It can be installed with Homebrew:

```console
brew install libomp
```

A Simple Example

Running a pipeline with multiple datasets and models is done by creating a list of dictionaries for each and providing them to a pipeline function.

The below example downloads a dataset from OpenML and Kaggle, one-hot encodes some of the columns in the Kaggle dataset, and trains and evaluates a Random Forest and XGBoost model on them.

```python
import mlcompare

datasets = [
{
"type": "openml",
"id": 8,
"target": "drinks",
},
{
"type": "kaggle",
"user": "gorororororo23",
"dataset": "plant-growth-data-classification",
"file": "plant_growth_data.csv",
"target": "Growth_Milestone",
"oneHotEncode": ["Soil_Type", "Water_Frequency", "Fertilizer_Type"],
}
]

models = [
{
"library": "sklearn",
"name": "RandomForestRegressor",
},
{
"library": "xgboost",
"name": "XGBRegressor",
"params": {"num_leaves": 40, "n_estimators": 200}
}
]

mlcompare.full_pipeline(datasets, models, "regression")
```

In the case of the XGBoost model some non-default parameter values were used.

Planned Additions

Version 1.3

LightGBM support

CatBoost support

Model results graphing and visualization

Improved documentation

Support for presplit data

Version 1.4

PyTorch support

TensorFlow support

Additional dataset sources

Built-in model and dataset collections for quick testing of similar model types/datasets

Optional pipeline caching

Optional trained model saving

Version 1.5

S3 Support