https://github.com/stitchsages/implyo
An advanced imputation library compatible with mixed type data with a focus on performance and high accuracy, with advanced imputation algorithms for numeric and categorical variables.
https://github.com/stitchsages/implyo
imputation imputation-algorithm imputation-methods knn machine-learning pandas pandas-dataframe pip python python3 random-forest scikit-learn
Last synced: about 6 hours ago
JSON representation
An advanced imputation library compatible with mixed type data with a focus on performance and high accuracy, with advanced imputation algorithms for numeric and categorical variables.
- Host: GitHub
- URL: https://github.com/stitchsages/implyo
- Owner: stitchsages
- License: mit
- Created: 2025-05-14T22:49:25.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-16T13:28:46.000Z (about 1 year ago)
- Last Synced: 2025-12-24T09:21:37.269Z (6 months ago)
- Topics: imputation, imputation-algorithm, imputation-methods, knn, machine-learning, pandas, pandas-dataframe, pip, python, python3, random-forest, scikit-learn
- Language: Python
- Homepage:
- Size: 69.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Implyo: Advanced Missing Value Imputation Library
Implyo is a powerful Python library for handling missing values in mixed-type data, with a focus on performance, accuracy, and uncertainty quantification. It provides a collection of advanced imputation algorithms that can handle both numeric and categorical variables efficiently.
## Features
### Core Imputation Algorithms
- **KNN Imputer**: Fast and efficient k-nearest neighbors imputation with support for mixed data types
- **MICE (Iterative Imputer)**: Multiple Imputation by Chained Equations with various estimator options
- **Random Forest Imputer**: Tree-based imputation with uncertainty quantification
- **XGBoost Imputer**: Gradient boosting based imputation with advanced features
- **LightGBM Imputer**: Light gradient boosting based imputation with high performance
### Key Features
- **Mixed Data Type Support**: Handle both numeric and categorical variables seamlessly
- **Uncertainty Quantification**: Get prediction intervals for imputed values
- **Parallel Processing**: Efficient handling of large datasets
- **Early Stopping**: Automatic convergence detection
- **Feature Importance**: Track which features are most important for imputation
- **Missing Value Indicators**: Optional indicators for missing value patterns
- **Comprehensive Testing**: Extensive test coverage for all imputers
- **Benchmarking Tools**: Compare performance across different imputers
## Installation
```bash
pip install implyo
```
For development installation:
```bash
git clone https://github.com/yourusername/implyo.git
cd implyo
pip install -e ".[dev]"
```
## Quick Start
```python
import pandas as pd
import numpy as np
from implyo import XGBoostImputer, LightGBMImputer, KNNImputer
# Create a sample dataset with missing values
data = pd.DataFrame({
'numeric1': [1, 2, np.nan, 4, 5],
'numeric2': [1.1, np.nan, 3.3, 4.4, 5.5],
'categorical': ['a', 'b', 'c', np.nan, 'e']
})
# Initialize and fit the imputer
imputer = XGBoostImputer(
n_estimators=100,
categorical_features=['categorical'],
uncertainty_quantile=0.95, # Get prediction intervals
random_state=42
)
# Fit and transform the data
X_imputed = imputer.fit_transform(data)
# Get uncertainty intervals
intervals = imputer.uncertainty_intervals_
# Get feature importances
importances = imputer.feature_importances_
```
## Advanced Usage
### Uncertainty Quantification
All tree-based imputers (Random Forest, XGBoost, LightGBM) support uncertainty quantification:
```python
from implyo import RandomForestImputer
imputer = RandomForestImputer(
uncertainty_quantile=0.95, # 95% prediction intervals
n_estimators=100,
random_state=42
)
X_imputed = imputer.fit_transform(data)
intervals = imputer.uncertainty_intervals_
# Access intervals for a specific column
lower, upper = intervals['numeric1']
```
### Parallel Processing
All imputers support parallel processing for faster computation:
```python
imputer = XGBoostImputer(
n_jobs=-1, # Use all available cores
n_estimators=100,
random_state=42
)
```
### Feature Importance
Tree-based imputers provide feature importance information:
```python
imputer = LightGBMImputer(
n_estimators=100,
random_state=42
)
imputer.fit_transform(data)
# Get feature importances for each imputed variable
importances = imputer.feature_importances_
```
### Missing Value Indicators
Add binary indicators for missing value patterns:
```python
imputer = KNNImputer(
add_indicator=True, # Add missing value indicators
n_neighbors=5
)
X_imputed = imputer.fit_transform(data)
```
## Benchmarking
The package includes comprehensive benchmarking tools to compare different imputers:
```python
from implyo.benchmarks import run_benchmark
# Run benchmarks with different configurations
results = run_benchmark(
n_samples=1000,
n_numeric_features=5,
n_categorical_features=3,
missing_ratio=0.2,
n_repeats=3
)
print(results)
```
## Performance
Implyo's imputers are optimized for performance:
- **KNN Imputer**: Faster than scikit-learn's implementation
- **XGBoost Imputer**: Efficient handling of large datasets
- **LightGBM Imputer**: High performance with low memory usage
- **Random Forest Imputer**: Balanced performance and accuracy
- **MICE**: Flexible and robust for complex missing patterns
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citation
If you use Implyo in your research, please cite:
```bibtex
@software{implyo2024,
author = {Darren Wei},
title = {Implyo: Advanced Missing Value Imputation Library},
year = {2024},
publisher = {GitHub},
url = {https://github.com/yourusername/implyo}
}
```
## Roadmap
- [ ] Add more advanced imputation algorithms
- [ ] Support for time series data
- [ ] Integration with deep learning models
- [ ] Web-based visualization tools
- [ ] Distributed computing support
- [ ] GPU acceleration for large datasets