https://github.com/dellano54/extreme-matmul
fastest matmul operation than numpy for intel CPU (xeon)
https://github.com/dellano54/extreme-matmul
matmul matrix matrix-library matrix-multiplication numpy numpy-arrays numpy-library numpy-matrix pytorch pytorch-implementation pytorch-lightning
Last synced: 6 months ago
JSON representation
fastest matmul operation than numpy for intel CPU (xeon)
- Host: GitHub
- URL: https://github.com/dellano54/extreme-matmul
- Owner: dellano54
- License: mit
- Created: 2025-06-08T19:50:56.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-06-08T22:22:37.000Z (6 months ago)
- Last Synced: 2025-06-08T22:28:51.339Z (6 months ago)
- Topics: matmul, matrix, matrix-library, matrix-multiplication, numpy, numpy-arrays, numpy-library, numpy-matrix, pytorch, pytorch-implementation, pytorch-lightning
- Language: C
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ⚡ extreme_matmul
Fast matrix multiplication in Python using Intel MKL
*Because sometimes NumPy just isn't fast enough.*
## What's This About?
Ever found yourself waiting around for large matrix multiplications to finish? Yeah, me too. That's why I built **Extreme MatMul** - a custom Python extension that leverages Intel's Math Kernel Library (MKL) to make matrix multiplication blazingly fast.
This project started as a deep dive into understanding how high-performance computing libraries work under the hood. What I discovered was pretty eye-opening: with the right approach, we can achieve some serious performance gains over standard NumPy operations.
## Performance Results
Here's what really matters - the numbers don't lie:
### Large Matrices (8164 × 8164 × 2048)
```
===============================================
extreme_matmul.fast_matmul : 1.74 seconds ⚡
NumPy @ operator : 7.14 seconds 🐌
torch optim : 1.97 seconds
===============================================
```
### Smaller Matrices (128 × 256 × 256)
```
===============================================
extreme_matmul.fast_matmul : 0.35ms ⚡
NumPy @ operator : 19.25ms
torch optim : 9.96ms
===============================================
```
(Tests were ran on Intel(R) Xeon(R) CPU @ 2.00GHz)
**That's roughly 4x faster than NumPy for large matrices and 55x faster for smaller ones!**
## How It Works
The magic happens through a few key optimizations:
1. **Intelligent Algorithm Selection**: For tiny matrices (≤32×32), we use a simple triple-loop implementation that's actually faster due to reduced overhead. For everything else, we call Intel MKL's optimized BLAS routines.
2. **Direct MKL Integration**: Instead of going through NumPy's abstraction layers, we talk directly to Intel's Math Kernel Library - the same engine that powers many scientific computing applications.
3. **Memory Layout Optimization**: The code ensures data is contiguous in memory and properly aligned for SIMD operations.
4. **Flexible Broadcasting**: Supports 1D, 2D, and 3D arrays with proper broadcasting rules, just like NumPy.
## Installation
### Prerequisites
You'll need Intel MKL installed on your system. Here's how to get everything set up:
```bash
# Update your system
sudo apt update
sudo apt install -y gpg-agent wget
# Add Intel's repository
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
# Install Intel MKL
sudo apt update
sudo apt install intel-oneapi-mkl-devel
# Set up the environment and install
source /opt/intel/oneapi/setvars.sh
pip install git+https://github.com/dellano54/extreme-matmul.git
```
### Build from Source
```bash
git clone https://github.com/yourusername/extreme-matmul.git
cd extreme-matmul
source /opt/intel/oneapi/setvars.sh # Make sure MKL is in your environment
python setup.py build_ext --inplace
```
## Usage
It's designed to be a drop-in replacement for NumPy's matrix multiplication:
```python
import numpy as np
import extreme_matmul
# Create some test matrices
A = np.random.rand(1000, 1000).astype(np.float32)
B = np.random.rand(1000, 1000).astype(np.float32)
# Use extreme_matmul instead of np.matmul
result = extreme_matmul.matmul(A, B)
# That's it! Same API, much faster performance
```
### Supported Operations
- **Vector × Vector**: Dot product
- **Matrix × Vector**: Matrix-vector multiplication
- **Matrix × Matrix**: Standard matrix multiplication
- **Batch Operations**: 3D arrays with batch dimensions
- **Mixed Dimensions**: Flexible broadcasting like NumPy
### Important Notes
- **Float32 Only**: Currently optimized for 32-bit floating point operations
- **Memory Requirements**: Arrays are converted to contiguous format if needed
- **Array Limits**: Supports 1D to 3D arrays (batch processing for 3D)
## Running the Benchmarks
Want to see the performance difference yourself?
```python
python benchmark.py
```
This will run the same tests I used to generate the performance numbers above. The benchmark tests both large and small matrix scenarios to show how the algorithm selection works.
## Technical Deep Dive
### Algorithm Selection Strategy
The code uses a size-based heuristic to choose between algorithms:
- **Tiny matrices** (≤32×32): Simple triple-loop implementation
- **Larger matrices**: Intel MKL's `cblas_sgemm` with full optimizations
### Memory Management
- Automatic conversion to contiguous arrays when needed
- Proper reference counting to prevent memory leaks
- Aligned memory access for optimal SIMD performance
### Error Handling
Comprehensive input validation including:
- Type checking (float32 requirement)
- Dimension compatibility verification
- Memory allocation error handling
## Why This Matters
This project demonstrates several important concepts:
1. **The Power of Specialized Libraries**: MKL is heavily optimized for Intel processors with years of engineering behind it
2. **Algorithm Selection**: Sometimes simpler algorithms win for small inputs due to reduced overhead
3. **C Extensions**: How to write efficient Python extensions that rival compiled languages
4. **Memory Layout**: The importance of cache-friendly data structures
## Limitations & Future Work
- **Platform Dependency**: Currently requires Intel MKL (Linux/Intel processors)
- **Data Type Limitation**: Only supports float32 (could be extended)
- **GPU Support**: No CUDA implementation yet (interesting future direction)
## Contributing
Found a bug or have an idea for improvement? Feel free to open an issue or submit a pull request. This started as a learning project, but I'm always interested in making it better!
## License
MIT License - feel free to use this in your own projects.
---
*Built with curiosity and a need for speed. If you find this useful, consider giving it a star! ⭐*