https://github.com/filipspl/optuml
Optuna-optimized ML methods, with scikit-learn like API
https://github.com/filipspl/optuml
hyperparameter-optimization hyperparameter-tuning machine-learning optuna python python-module scikit-learn
Last synced: about 1 month ago
JSON representation
Optuna-optimized ML methods, with scikit-learn like API
- Host: GitHub
- URL: https://github.com/filipspl/optuml
- Owner: filipsPL
- License: mit
- Created: 2024-09-12T08:09:13.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-27T13:31:35.000Z (over 1 year ago)
- Last Synced: 2025-03-19T13:18:44.547Z (12 months ago)
- Topics: hyperparameter-optimization, hyperparameter-tuning, machine-learning, optuna, python, python-module, scikit-learn
- Language: Python
- Homepage:
- Size: 87.9 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna
```
⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀ ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀ ⡇ ⠄ ⣀⣀ ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀ ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀ ⠄ ⣀⣀ ⢀⡀ ⡀⣀
⢸ ⠏ ⠣⠜ ⠇⠇⠇ ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣ ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏ ⠘⠤ ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏
```
`OptuML` (*Optu*na + *ML*) is a Python module providing hyperparameter optimization for machine learning algorithms using the [Optuna](https://optuna.org/) framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.
[](https://github.com/filipsPL/optuml/actions/workflows/python-package.yml) [](https://github.com/filipsPL/optuml/actions/workflows/python-pip.yml) [](https://pypi.org/project/optuml/) [](https://doi.org/10.5281/zenodo.17305963)
```
Input optuml train Predict
┌─────────────────┐ ┌──────────────────────────────────┐ ┌─────────────────────────────┐
│X_train, y_train ┼────► clf = Optimizer(algorithm="SVC") ├───► y_pred = clf.predict(X_test)│
└─────────────────┘ │ clf.fit(X_train, y_train) │ │ │
┌─────────────────┐ └─▲────────────────────────────────┘ └─────────────────────────▲───┘
│ML algorithm ├──────┘ │
└─────────────────┘ X_test───┘
```
## Key Features
- **Comprehensive Algorithm Support**: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
- **Full Scikit-learn Compatibility**: Seamless integration with pipelines, cross-validation, and all sklearn tools
- **Robust Optimization**: Powered by Optuna with early stopping, timeout protection, and parallel execution
- **Type-Safe Design**: Separate optimizers for classification and regression with proper type checking
- **Production Ready**: Cross-platform compatibility, comprehensive error handling, and extensive validation
- **Flexible Configuration**: Control every aspect of the optimization process
## Installation
### Option A: pip (recommended)
```bash
pip install optuml
```
or upgrade:
```bash
pip install optuml --upgrade
```
### Option B: Manual installation
```bash
# Install required dependencies
pip install optuna scikit-learn numpy pandas
# Optional: Install additional algorithms
pip install catboost xgboost
# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.py
```
## Quick Start
### Classification Example
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
clf = Optimizer(
algorithm="RandomForestClassifier",
n_trials=50,
cv=5,
scoring="accuracy",
random_state=42,
show_progress_bar=True
)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")
```
### Regression Example
```python
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer
# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
reg = Optimizer(
algorithm="XGBRegressor",
n_trials=100,
cv=5,
scoring="r2",
early_stopping_patience=10, # Stop if no improvement for 10 trials
n_jobs=-1, # Use all CPU cores for CV
verbose=True
)
reg.fit(X_train, y_train)
# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
```
## Supported Algorithms
### Classification Algorithms
| Algorithm | Description | Key Features |
| ------------------------ | ------------------------------- | ----------------------------------------- |
| `SVC` | Support Vector Classifier | Non-linear kernels, probability estimates |
| `LogisticRegression` | Logistic Regression | L1/L2/Elastic-Net regularization |
| `KNeighborsClassifier` | k-Nearest Neighbors | Distance weighting, various metrics |
| `RandomForestClassifier` | Random Forest | Feature importance, OOB score |
| `AdaBoostClassifier` | AdaBoost | SAMME/SAMME.R algorithms |
| `MLPClassifier` | Neural Network | Multiple architectures, early stopping |
| `GaussianNB` | Gaussian Naive Bayes | Fast, probabilistic |
| `QDA` | Quadratic Discriminant Analysis | Non-linear boundaries |
| `DecisionTreeClassifier` | Decision Tree | Multiple criteria, pruning |
| `CatBoostClassifier`* | CatBoost | Categorical features, GPU support |
| `XGBClassifier`* | XGBoost | Regularization, missing values |
### Regression Algorithms
| Algorithm | Description | Key Features |
| ----------------------- | ------------------------- | ------------------------ |
| `SVR` | Support Vector Regression | Epsilon-insensitive loss |
| `LinearRegression` | Linear Regression | Simple, interpretable |
| `KNeighborsRegressor` | k-Nearest Neighbors | Local regression |
| `RandomForestRegressor` | Random Forest | Reduces overfitting |
| `AdaBoostRegressor` | AdaBoost | Sequential learning |
| `MLPRegressor` | Neural Network | Non-linear patterns |
| `DecisionTreeRegressor` | Decision Tree | Non-parametric |
| `CatBoostRegressor`* | CatBoost | Handles categoricals |
| `XGBRegressor`* | XGBoost | High performance |
*Optional dependencies (install separately)
## ⚙️ Advanced Features
### Early Stopping
Stop optimization when no improvement is observed:
```python
optimizer = Optimizer(
algorithm="XGBClassifier",
n_trials=1000,
early_stopping_patience=20 # Stop after 20 trials without improvement
)
```
### Parallel Cross-Validation
Speed up optimization using multiple CPU cores:
```python
optimizer = Optimizer(
algorithm="RandomForestClassifier",
n_trials=100,
cv=10,
n_jobs=-1 # Use all available cores
)
```
### Custom Scoring Metrics
Use any scikit-learn compatible scoring metric:
```python
optimizer = Optimizer(
algorithm="SVC",
scoring="roc_auc", # For classification
# scoring="neg_mean_squared_error", # For regression
# scoring="f1_weighted", # For imbalanced classes
)
```
### Timeout Protection
Set time limits for optimization:
```python
optimizer = Optimizer(
algorithm="MLPClassifier",
timeout=300, # Total optimization timeout (5 minutes)
cv_timeout=30, # Per-trial timeout (30 seconds)
n_trials=1000 # Will stop at timeout even if trials remain
)
```
### Access to Optuna Study
Get detailed optimization information:
```python
# After fitting
optimizer.fit(X_train, y_train)
# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")
# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()
# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()
```
### Pipeline Integration
Full compatibility with scikit-learn pipelines:
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create pipeline with OptuML
pipe = Pipeline([
('scaler', StandardScaler()),
('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])
# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
```
### Type-Specific Optimizers
For more control, use the specific optimizer classes:
```python
from optuml import ClassifierOptimizer, RegressorOptimizer
# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
algorithm="RandomForestClassifier",
n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test) # If supported
# Regressor with regression-specific defaults
reg = RegressorOptimizer(
algorithm="RandomForestRegressor",
n_trials=100,
scoring="r2" # Default for regressors
)
```
## API Reference
### Main Classes
#### `Optimizer`
Universal optimizer that automatically selects between classification and regression.
#### `ClassifierOptimizer`
Specialized optimizer for classification algorithms with methods like `predict_proba()` and `decision_function()`.
#### `RegressorOptimizer`
Specialized optimizer for regression algorithms with appropriate default scoring metrics.
### Common Parameters
| Parameter | Type | Default | Description |
| ------------------------- | ---------- | ---------- | ------------------------------------------ |
| `algorithm` | str | required | ML algorithm to optimize |
| `n_trials` | int | 100 | Number of optimization trials |
| `cv` | int | 5 | Cross-validation folds |
| `scoring` | str/None | Auto* | Scoring metric for CV |
| `direction` | str | "maximize" | Optimization direction |
| `timeout` | float/None | None | Total optimization timeout (seconds) |
| `cv_timeout` | float | 120 | Single CV evaluation timeout |
| `random_state` | int/None | None | Random seed for reproducibility |
| `n_jobs` | int | 1 | Parallel jobs for CV (-1 for all cores) |
| `early_stopping_patience` | int/None | None | Trials without improvement before stopping |
| `verbose` | bool/int | False | Verbosity level |
| `show_progress_bar` | bool | False | Show optimization progress |
*Auto defaults: "accuracy" for classifiers, "r2" for regressors
### Methods
| Method | Description | Available For |
| ---------------------- | ---------------------------------- | ---------------- |
| `fit(X, y)` | Optimize hyperparameters and train | All |
| `predict(X)` | Make predictions | All |
| `score(X, y)` | Evaluate model performance | All |
| `predict_proba(X)` | Predict class probabilities | Classifiers |
| `decision_function(X)` | Get decision values | Some classifiers |
| `get_params()` | Get optimizer parameters | All |
| `set_params(**params)` | Set optimizer parameters | All |
### Attributes (after fitting)
| Attribute | Description |
| --------------------- | ---------------------------------- |
| `best_estimator_` | Trained model with best parameters |
| `best_params_` | Best hyperparameters found |
| `best_score_` | Best cross-validation score |
| `study_` | Optuna study object |
| `study_time_` | Total optimization time |
| `n_trials_completed_` | Number of completed trials |
| `classes_` | Class labels (classifiers only) |
| `n_features_in_` | Number of input features |
| `feature_names_in_` | Feature names (if available) |
## Troubleshooting
### Issue: "No successful trials completed"
**Solution**: Increase `cv_timeout` or reduce `cv` folds:
```python
optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)
```
### Issue: CatBoost/XGBoost not available
**Solution**: Install optional dependencies:
```bash
pip install catboost xgboost
```
### Issue: Optimization takes too long
**Solutions**:
1. Use parallel CV: `n_jobs=-1`
2. Set timeout: `timeout=600`
3. Use early stopping: `early_stopping_patience=10`
4. Reduce trials: `n_trials=50`
### Issue: Memory errors with large datasets
**Solutions**:
1. Use algorithms with lower memory footprint (e.g., `LogisticRegression` instead of `SVC`)
2. Reduce CV folds
3. Use `SGDClassifier` or `SGDRegressor` (if added to supported algorithms)
## Best Practices
1. **Start with fewer trials**: Begin with `n_trials=20-50` for exploration, then increase for final optimization
2. **Use appropriate scoring metrics**:
- Imbalanced classification: `"f1_weighted"`, `"roc_auc"`
- Regression: `"r2"`, `"neg_mean_squared_error"`
3. **Enable early stopping** for large trial counts:
```python
Optimizer(n_trials=1000, early_stopping_patience=20)
```
4. **Set random state** for reproducibility:
```python
Optimizer(random_state=42)
```
5. **Use parallel processing** for faster optimization:
```python
Optimizer(n_jobs=-1)
```
## Citation
If you use OptuML in your research, please cite:
```bibtex
@software{stefaniak_optuml_2024,
author = {Filip Stefaniak},
title = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17305963},
url = {https://doi.org/10.5281/zenodo.17305963}
}
```