https://github.com/dellano54/extreme-matmul

fastest matmul operation than numpy for intel CPU (xeon)
https://github.com/dellano54/extreme-matmul

matmul matrix matrix-library matrix-multiplication numpy numpy-arrays numpy-library numpy-matrix pytorch pytorch-implementation pytorch-lightning

Last synced: 6 months ago
JSON representation

fastest matmul operation than numpy for intel CPU (xeon)

Host: GitHub
URL: https://github.com/dellano54/extreme-matmul
Owner: dellano54
License: mit
Created: 2025-06-08T19:50:56.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-06-08T22:22:37.000Z (6 months ago)
Last Synced: 2025-06-08T22:28:51.339Z (6 months ago)
Topics: matmul, matrix, matrix-library, matrix-multiplication, numpy, numpy-arrays, numpy-library, numpy-matrix, pytorch, pytorch-implementation, pytorch-lightning
Language: C
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # ⚡ extreme_matmul

Fast matrix multiplication in Python using Intel MKL

*Because sometimes NumPy just isn't fast enough.*

## What's This About?

Ever found yourself waiting around for large matrix multiplications to finish? Yeah, me too. That's why I built **Extreme MatMul** - a custom Python extension that leverages Intel's Math Kernel Library (MKL) to make matrix multiplication blazingly fast.

This project started as a deep dive into understanding how high-performance computing libraries work under the hood. What I discovered was pretty eye-opening: with the right approach, we can achieve some serious performance gains over standard NumPy operations.

## Performance Results

Here's what really matters - the numbers don't lie:

### Large Matrices (8164 × 8164 × 2048)

```

===============================================

extreme_matmul.fast_matmul    : 1.74 seconds ⚡

NumPy @ operator             : 7.14 seconds 🐌

torch optim                  : 1.97 seconds

===============================================

```

### Smaller Matrices (128 × 256 × 256)

```

===============================================

extreme_matmul.fast_matmul    : 0.35ms ⚡

NumPy @ operator             : 19.25ms

torch optim                  : 9.96ms

===============================================

```

(Tests were ran on Intel(R) Xeon(R) CPU @ 2.00GHz)


**That's roughly 4x faster than NumPy for large matrices and 55x faster for smaller ones!**

## How It Works

The magic happens through a few key optimizations:

1. **Intelligent Algorithm Selection**: For tiny matrices (≤32×32), we use a simple triple-loop implementation that's actually faster due to reduced overhead. For everything else, we call Intel MKL's optimized BLAS routines.

2. **Direct MKL Integration**: Instead of going through NumPy's abstraction layers, we talk directly to Intel's Math Kernel Library - the same engine that powers many scientific computing applications.

3. **Memory Layout Optimization**: The code ensures data is contiguous in memory and properly aligned for SIMD operations.

4. **Flexible Broadcasting**: Supports 1D, 2D, and 3D arrays with proper broadcasting rules, just like NumPy.

## Installation

### Prerequisites

You'll need Intel MKL installed on your system. Here's how to get everything set up:

```bash

# Update your system

sudo apt update

sudo apt install -y gpg-agent wget

# Add Intel's repository

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

# Install Intel MKL

sudo apt update

sudo apt install intel-oneapi-mkl-devel

# Set up the environment and install

source /opt/intel/oneapi/setvars.sh

pip install git+https://github.com/dellano54/extreme-matmul.git

```

### Build from Source

```bash

git clone https://github.com/yourusername/extreme-matmul.git

cd extreme-matmul

source /opt/intel/oneapi/setvars.sh  # Make sure MKL is in your environment

python setup.py build_ext --inplace

```

## Usage

It's designed to be a drop-in replacement for NumPy's matrix multiplication:

```python

import numpy as np

import extreme_matmul

# Create some test matrices

A = np.random.rand(1000, 1000).astype(np.float32)

B = np.random.rand(1000, 1000).astype(np.float32)

# Use extreme_matmul instead of np.matmul

result = extreme_matmul.matmul(A, B)

# That's it! Same API, much faster performance

```

### Supported Operations

- **Vector × Vector**: Dot product

- **Matrix × Vector**: Matrix-vector multiplication  

- **Matrix × Matrix**: Standard matrix multiplication

- **Batch Operations**: 3D arrays with batch dimensions

- **Mixed Dimensions**: Flexible broadcasting like NumPy

### Important Notes

- **Float32 Only**: Currently optimized for 32-bit floating point operations

- **Memory Requirements**: Arrays are converted to contiguous format if needed

- **Array Limits**: Supports 1D to 3D arrays (batch processing for 3D)

## Running the Benchmarks

Want to see the performance difference yourself?

```python

python benchmark.py

```

This will run the same tests I used to generate the performance numbers above. The benchmark tests both large and small matrix scenarios to show how the algorithm selection works.

## Technical Deep Dive

### Algorithm Selection Strategy

The code uses a size-based heuristic to choose between algorithms:

- **Tiny matrices** (≤32×32): Simple triple-loop implementation

- **Larger matrices**: Intel MKL's `cblas_sgemm` with full optimizations

### Memory Management

- Automatic conversion to contiguous arrays when needed

- Proper reference counting to prevent memory leaks

- Aligned memory access for optimal SIMD performance

### Error Handling

Comprehensive input validation including:

- Type checking (float32 requirement)

- Dimension compatibility verification

- Memory allocation error handling

## Why This Matters

This project demonstrates several important concepts:

1. **The Power of Specialized Libraries**: MKL is heavily optimized for Intel processors with years of engineering behind it

2. **Algorithm Selection**: Sometimes simpler algorithms win for small inputs due to reduced overhead

3. **C Extensions**: How to write efficient Python extensions that rival compiled languages

4. **Memory Layout**: The importance of cache-friendly data structures

## Limitations & Future Work

- **Platform Dependency**: Currently requires Intel MKL (Linux/Intel processors)

- **Data Type Limitation**: Only supports float32 (could be extended)

- **GPU Support**: No CUDA implementation yet (interesting future direction)

## Contributing

Found a bug or have an idea for improvement? Feel free to open an issue or submit a pull request. This started as a learning project, but I'm always interested in making it better!

## License

MIT License - feel free to use this in your own projects.

---

*Built with curiosity and a need for speed. If you find this useful, consider giving it a star! ⭐*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dellano54/extreme-matmul

Awesome Lists containing this project

README