Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sangioai/torchpace
PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.
https://github.com/sangioai/torchpace
cuda pytorch transformer
Last synced: 11 days ago
JSON representation
PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.
- Host: GitHub
- URL: https://github.com/sangioai/torchpace
- Owner: SangioAI
- License: apache-2.0
- Created: 2025-01-06T19:57:55.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-01-30T16:17:10.000Z (13 days ago)
- Last Synced: 2025-01-30T16:35:57.101Z (13 days ago)
- Topics: cuda, pytorch, transformer
- Language: Cuda
- Homepage:
- Size: 51.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# torchPACE
PyTorch C++ and CUDA extension for PACE's Piecewise Polynomial Approximation(PwPA), a Transformer non-linerarities accelaration engine.## Introduction
This extension integrates PwPA CUDA kernels for both AoS and SoA coefficients' data structure using a simple unrolling technic.
More details [here](extra/README.md).## Setup
Built with [PyPA/Build](https://github.com/pypa/build), but you can use Pip or similar.To build:
```text
python -m build -n
```
To install:
```text
pip install dist\
```To test:
```text
python test\extension_test.py
``````text
python test\extension_test.py
```To use:
```python
import torch_pace
...
# base kernel
y = torch_pace.ops._pwpa(x, coeffs, partition_points, AoS=true)
# optimized kernel
y = torch_pace.ops.pwpa(x, coeffs, partition_points, AoS=true)
# AoS to SoA coefficients rearrangement
coeffs_soa = torch_pace.ops.aos2soa(coeffs, degree)
# optimized kernel with SoA coefficients' data structure
y = torch_pace.ops.pwpa(x, coeffs_soa, partition_points, AoS=false)
```> [!Important]
> Requirements:
> - torch>=2.4 with CUDA enabled (mine is 2.5.1+cu118)
> - CUDA toolkit (mine is 11.7)
> - Python>=3.8 (mine is 3.12.8)## Examples
This is the ouput of running [approximation_test.py](test/approximation_test.py):
![immagine](https://github.com/user-attachments/assets/01ecdbec-d232-4e9e-99f5-f5d38cadfeb3)> [!Note]
> [approximation_test.py](test/approximation_test.py) uses a simple uniform partitioning which divides the X-value range in equal parts.
> More sophisticated partitioning strategies may account for slope trends, yielding more accurate approximations where the function changes more.## ToDo
A brief list of things to do or fix in this extension:
- [x] PyTorch Half type support
- [ ] Extension Benchmark on non-linearities in plain CUDA code
- [ ] Extension Benchmark on PyTorch non-linearities
- [ ] ILP (Instruction-Level Parallelism) integration
- [x] aos2soa function
- [ ] soa2aos function
- [ ] CUDA SIMD instrics analysis for float16 (PyTorch Half) type
- [ ] PyTorch neural net example## Credits
Extension backbone inspired by [this tutorial](https://github.com/pytorch/extension-cpp).
## Authors
[Marco Sangiorgi](https://github.com/SangioAI)
*2025©*