Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sangioai/torchpace

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.
https://github.com/sangioai/torchpace

cuda pytorch transformer

Last synced: 11 days ago
JSON representation

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.

Awesome Lists containing this project

README

        

# torchPACE
PyTorch C++ and CUDA extension for PACE's Piecewise Polynomial Approximation(PwPA), a Transformer non-linerarities accelaration engine.

## Introduction
This extension integrates PwPA CUDA kernels for both AoS and SoA coefficients' data structure using a simple unrolling technic.
More details [here](extra/README.md).

## Setup
Built with [PyPA/Build](https://github.com/pypa/build), but you can use Pip or similar.

To build:
```text
python -m build -n
```

To install:
```text
pip install dist\
```

To test:
```text
python test\extension_test.py
```

```text
python test\extension_test.py
```

To use:
```python
import torch_pace
...
# base kernel
y = torch_pace.ops._pwpa(x, coeffs, partition_points, AoS=true)
# optimized kernel
y = torch_pace.ops.pwpa(x, coeffs, partition_points, AoS=true)
# AoS to SoA coefficients rearrangement
coeffs_soa = torch_pace.ops.aos2soa(coeffs, degree)
# optimized kernel with SoA coefficients' data structure
y = torch_pace.ops.pwpa(x, coeffs_soa, partition_points, AoS=false)
```

> [!Important]
> Requirements:
> - torch>=2.4 with CUDA enabled (mine is 2.5.1+cu118)
> - CUDA toolkit (mine is 11.7)
> - Python>=3.8 (mine is 3.12.8)

## Examples

This is the ouput of running [approximation_test.py](test/approximation_test.py):
![immagine](https://github.com/user-attachments/assets/01ecdbec-d232-4e9e-99f5-f5d38cadfeb3)

> [!Note]
> [approximation_test.py](test/approximation_test.py) uses a simple uniform partitioning which divides the X-value range in equal parts.
> More sophisticated partitioning strategies may account for slope trends, yielding more accurate approximations where the function changes more.

## ToDo
A brief list of things to do or fix in this extension:
- [x] PyTorch Half type support
- [ ] Extension Benchmark on non-linearities in plain CUDA code
- [ ] Extension Benchmark on PyTorch non-linearities
- [ ] ILP (Instruction-Level Parallelism) integration
- [x] aos2soa function
- [ ] soa2aos function
- [ ] CUDA SIMD instrics analysis for float16 (PyTorch Half) type
- [ ] PyTorch neural net example

## Credits

Extension backbone inspired by [this tutorial](https://github.com/pytorch/extension-cpp).

## Authors

[Marco Sangiorgi](https://github.com/SangioAI)

*2025©*