https://github.com/maifeeulasad/accelera
A framework for performing large matrix operations on memory-constrained GPUs through intelligent chunking and CPU-GPU memory management.
https://github.com/maifeeulasad/accelera
Last synced: 2 months ago
JSON representation
A framework for performing large matrix operations on memory-constrained GPUs through intelligent chunking and CPU-GPU memory management.
- Host: GitHub
- URL: https://github.com/maifeeulasad/accelera
- Owner: maifeeulasad
- License: mit
- Created: 2025-10-02T11:26:43.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-04T09:33:02.000Z (9 months ago)
- Last Synced: 2025-12-21T22:54:06.282Z (6 months ago)
- Language: Python
- Homepage: https://pypi.org/project/accelera/
- Size: 87.9 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Accelera - Memory-Efficient Matrix Operations Framework
A framework for performing large matrix operations on memory-constrained GPUs through intelligent chunking and CPU-GPU memory management.
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
[](https://developer.nvidia.com/cuda-toolkit)
## ๐ Problem Statement
When working with large matrices on GPUs with limited VRAM, operations like matrix multiplication can cause **Out-of-Memory (OOM) errors**. Accelera solves this by:
- **๐งฉ Breaking large operations into smaller chunks**
- **๐พ Intelligently offloading intermediate results to CPU/RAM**
- **๐ Dynamically managing GPU memory**
- **๐ฏ Providing a seamless API for large matrix operations**
## โจ Features
- **๐ค Automatic chunking** for matrix operations
- **๐ง Dynamic memory management** between GPU and CPU
- **โก CUDA-optimized** for NVIDIA GPUs
- **๐ Configurable chunk sizes** based on available VRAM
- **๐ Progress tracking** for long-running operations
- **๐ Memory usage monitoring**
- **๐ Multiple input types** (PyTorch tensors, NumPy arrays)
## ๐โโ๏ธ Quick Start
```python
import accelera as acc
# Initialize with automatic VRAM detection
engine = acc.MatrixEngine(auto_detect_memory=True)
# Perform large matrix multiplication that might cause OOM on small GPUs
A = acc.Matrix.random((10000, 8000)) # 10k x 8k matrix (~305 MB)
B = acc.Matrix.random((8000, 12000)) # 8k x 12k matrix (~366 MB)
# This will automatically chunk and manage memory
C = engine.matmul(A, B) # Result: 10k x 12k matrix (~458 MB)
print(f"โ
Success! Result shape: {C.shape}")
```
### ๐ฏ Real-world Example
```python
# Scenario: Training a large neural network layer on a 4GB GPU
import accelera as acc
engine = acc.MatrixEngine()
# Large weight matrix (would normally cause OOM)
weights = acc.Matrix.randn((20000, 15000)) # ~1.1 GB
inputs = acc.Matrix.randn((15000, 8000)) # ~457 MB
# Forward pass - automatically chunked if needed
output = engine.matmul(weights, inputs) # ~610 MB result
# Check memory usage
memory_info = engine.get_memory_info()
print(f"GPU utilization: {memory_info['gpu_utilization']:.1f}%")
```
## ๐ฆ Installation
### Requirements
- **Python 3.8+**
- **PyTorch 2.0+** with CUDA support
- **NVIDIA GPU** with CUDA drivers
- **Sufficient CPU RAM** for temporary storage
### Install
```bash
# Clone the repository
git clone https://github.com/maifeeulasad/accelera
cd accelera
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
# Verify installation
make verify
```
## ๐ ๏ธ Usage Examples
### Basic Operations
```python
import accelera as acc
import numpy as np
# Initialize engine
engine = acc.MatrixEngine(auto_detect_memory=True, enable_progress=True)
# Matrix multiplication
A = acc.Matrix.randn((5000, 4000))
B = acc.Matrix.randn((4000, 6000))
C = engine.matmul(A, B)
# Element-wise operations
X = acc.Matrix.randn((3000, 4000))
Y = acc.Matrix.randn((3000, 4000))
# Addition
Z1 = engine.add(X, Y)
# Element-wise multiplication
Z2 = engine.multiply(X, Y)
# Works with NumPy arrays and PyTorch tensors too!
A_np = np.random.randn(1000, 800).astype(np.float32)
B_np = np.random.randn(800, 1200).astype(np.float32)
C_from_numpy = engine.matmul(A_np, B_np)
```
### Advanced Configuration
```python
# Custom chunking strategy
engine = acc.MatrixEngine(
chunking_strategy='adaptive', # 'row', 'tile', 'adaptive'
chunk_size=1024, # Manual chunk size
enable_progress=True # Show progress bars
)
# Manual memory management
engine.set_chunk_size(512) # Smaller chunks for limited memory
engine.enable_auto_memory_detection(False) # Disable auto-detection
engine.cleanup() # Force GPU memory cleanup
# Memory monitoring
memory_info = engine.get_memory_info()
print(f"GPU Memory: {memory_info['gpu_available_gb']:.2f}GB available")
print(f"CPU Memory: {memory_info['cpu_available_gb']:.2f}GB available")
```
## ๐ Performance Comparison
Run the benchmark to see how Accelera performs on your system:
```bash
# Run full benchmark suite
make benchmark
# Test specific matrix size
python examples/benchmark.py --custom-size 4000 3000 5000
# Quick demo
make demo
```
## ๐ Project Structure
```
accelera/
โโโ accelera/ # Core framework
โ โโโ __init__.py # Main package exports
โ โโโ engine.py # MatrixEngine - main API
โ โโโ matrix.py # Matrix wrapper class
โ โโโ memory_manager.py # GPU/CPU memory management
โ โโโ chunking.py # Chunking strategies
โ โโโ config.py # Configuration and logging
โโโ examples/ # Usage examples
โ โโโ basic_usage.py # Basic operations demo
โ โโโ advanced_usage.py # Advanced features demo
โ โโโ benchmark.py # Performance benchmarking
โโโ tests/ # Unit tests
โ โโโ test_accelera.py # Comprehensive test suite
โโโ DOCUMENTATION.md # Detailed documentation
โโโ requirements.txt # Python dependencies
โโโ setup.py # Package setup
โโโ Makefile # Development commands
```
## ๐งช Running Examples
```bash
# Basic usage example
python examples/basic_usage.py
# Advanced features demonstration
python examples/advanced_usage.py
# Performance benchmarking
python examples/benchmark.py
# Or use make commands
make examples
make benchmark
```
## ๐ง Development
```bash
# Install development dependencies
make dev-install
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Clean build artifacts
make clean
```
## ๐ Documentation
- **[Complete Documentation](DOCUMENTATION.md)** - Detailed API reference and usage guide
- **[Examples](examples/)** - Practical usage examples
- **[Tests](tests/)** - Unit tests and integration tests
## ๐ฏ Use Cases
- **๐ง Deep Learning**: Training large neural networks on consumer GPUs
- **๐ฌ Scientific Computing**: Large matrix operations in research
- **๐ Data Processing**: Batch processing of large datasets
- **๐ฎ Computer Graphics**: Large transformation matrices
- **๐ Financial Modeling**: Risk calculations with large covariance matrices
## โ ๏ธ System Requirements
- **NVIDIA GPU** (optional)
- **CUDA** (not sure about minimum version)
## ๐ค Contributing
Following the guidelines in [`claude.md`](claude.md):
1. **Fork the repository**
2. **Create a feature branch**: `git checkout -b feature-name`
3. **Follow the coding standards**: Small commits, clear intent, boring solutions
4. **Add tests** for new functionality
5. **Submit a pull request** with clear description
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- **PyTorch** team for the excellent tensor library
- **NVIDIA** for CUDA and GPU computing
- **Community** feedback and contributions
---
**๐ก Pro Tip**: Start with the basic example, then explore advanced features. The framework is designed to be simple by default but powerful when needed!
```python
# Get started in 3 lines
import accelera as acc
engine = acc.MatrixEngine()
result = engine.matmul(large_matrix_A, large_matrix_B)
```