https://github.com/sangioai/torchpace

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.
https://github.com/sangioai/torchpace

cuda pytorch transformer

Last synced: 9 months ago
JSON representation

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.

Host: GitHub
URL: https://github.com/sangioai/torchpace
Owner: sangioai
License: apache-2.0
Created: 2025-01-06T19:57:55.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-02-18T17:35:10.000Z (10 months ago)
Last Synced: 2025-03-28T04:18:42.297Z (9 months ago)
Topics: cuda, pytorch, transformer
Language: Cuda
Homepage:
Size: 1.77 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # torchPACE

PyTorch C++ and CUDA extension for PACE's Piecewise Polynomial Approximation(PwPA), a Transformer non-linerarities accelaration engine.

## Introduction

This extension integrates PwPA CUDA kernels for both AoS and SoA coefficients' data structure using a simple unrolling technic.

More details [here](extra/README.md).

## Setup

Built with [PyPA/Build](https://github.com/pypa/build), but you can use Pip or similar.

To build: 

```text

python -m build -n

```

    

To install:  

```text

pip install dist\

```

To test:  

```text

python test\extension_test.py

```

```text

python test\extension_test.py

```

To use:  

```python

import torch_pace

...

# base kernel

y = torch_pace.ops._pwpa(x, coeffs, partition_points, AoS=true)

# optimized kernel

y = torch_pace.ops.pwpa(x, coeffs, partition_points, AoS=true)

# AoS to SoA coefficients rearrangement

coeffs_soa = torch_pace.ops.aos2soa(coeffs, degree)

# optimized kernel with SoA coefficients' data structure

y = torch_pace.ops.pwpa(x, coeffs_soa, partition_points, AoS=false)

```

> [!Important]

> Requirements: 

>    - torch>=2.4 with CUDA enabled (mine is 2.5.1+cu118)

>    - CUDA toolkit (mine is 11.7)

>    - Python>=3.8 (mine is 3.12.8)

## Examples

This is the ouput of running [approximation_test.py](test/approximation_test.py):

![immagine](https://github.com/user-attachments/assets/01ecdbec-d232-4e9e-99f5-f5d38cadfeb3)

> [!Note]

> [approximation_test.py](test/approximation_test.py) uses a simple uniform partitioning which divides the X-value range in equal parts.

> More sophisticated partitioning strategies may account for slope trends, yielding more accurate approximations where the function changes more.

## ToDo

A brief list of things to do or fix in this extension:

- [x] PyTorch Half type support

- [ ] Extension Benchmark on non-linearities in plain CUDA code

- [ ] Extension Benchmark on PyTorch non-linearities

- [ ] ILP (Instruction-Level Parallelism) integration

- [x] aos2soa function

- [ ] soa2aos function

- [ ] CUDA SIMD instrics analysis for float16 (PyTorch Half) type  

- [ ] PyTorch neural net example

## Credits

Extension backbone inspired by [this tutorial](https://github.com/pytorch/extension-cpp).

## Authors

[Marco Sangiorgi](https://github.com/SangioAI)

*2025©*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sangioai/torchpace

Awesome Lists containing this project

README