Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mitchmedeiros/mlcompare

Quickly compare machine learning models across libraries and datasets
https://github.com/mitchmedeiros/mlcompare

huggingface-datasets kaggle openml pytorch scikit-learn xgboost

Last synced: about 5 hours ago
JSON representation

Quickly compare machine learning models across libraries and datasets

Awesome Lists containing this project

README

        


MLCompare Logo


Supported Python Versions
PyPI - Version
PyPI - License
Pepy Total Downlods


Read the Docs
GitHub Actions Workflow Status
GitHub Actions status (MacOS Unit Tests)
Code Coverage


MLCompare is a Python package for running model comparison pipelines, with the aim of being both simple and flexible. It supports multiple popular ML libraries, retrieval from multiple online dataset repositories, common data processing steps, and results visualization. Additionally, it allows for using your own models and datasets within the pipelines.


Libraries

Datasets

Data Processing





  • Scikit-learn

  • XGBoost





  • Kaggle

  • OpenML

  • Hugging Face

  • locally saved





  • train-test split

  • drop columns

  • handle NaNs: drop | forward-fill | backward-fill

  • encoders: OneHot | Ordinal | Target | Label

  • scalers: Standard | MinMax | MaxAbs | Robust

  • transformers: Quantile | Power | Normalizer



Installing

It is recommended to create a new virtual environment. Example with Conda:

```console
conda create -n compare_env python==3.11.9
conda activate compare_env
```

Install this library with pip:

```console
pip install mlcompare
```

Note that for MacOS, both XGBoost and LightGBM require `libomp`. It can be installed with Homebrew:

```console
brew install libomp
```

A Simple Example

Running a pipeline with multiple datasets and models is done by creating a list of dictionaries for each and providing them to a pipeline function.

The below example downloads a dataset from OpenML and Kaggle, one-hot encodes some of the columns in the Kaggle dataset, and trains and evaluates a Random Forest and XGBoost model on them.

```python
import mlcompare

datasets = [
{
"type": "openml",
"id": 8,
"target": "drinks",
},
{
"type": "kaggle",
"user": "gorororororo23",
"dataset": "plant-growth-data-classification",
"file": "plant_growth_data.csv",
"target": "Growth_Milestone",
"oneHotEncode": ["Soil_Type", "Water_Frequency", "Fertilizer_Type"],
}
]

models = [
{
"library": "sklearn",
"name": "RandomForestRegressor",
},
{
"library": "xgboost",
"name": "XGBRegressor",
"params": {"num_leaves": 40, "n_estimators": 200}
}
]

mlcompare.full_pipeline(datasets, models, "regression")
```

In the case of the XGBoost model some non-default parameter values were used.

Planned Additions

Version 1.3



  • LightGBM support

  • CatBoost support

  • Model results graphing and visualization

  • Improved documentation

  • Support for presplit data

Version 1.4



  • PyTorch support

  • TensorFlow support

  • Additional dataset sources

  • Built-in model and dataset collections for quick testing of similar model types/datasets

  • Optional pipeline caching

  • Optional trained model saving

Version 1.5



  • S3 Support