Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fattorib/thunderkittens-simple-gemm
Simple Tensorcore GEMM in ThunderKittens
https://github.com/fattorib/thunderkittens-simple-gemm
cuda gemm gpu thunderkittens
Last synced: 4 days ago
JSON representation
Simple Tensorcore GEMM in ThunderKittens
- Host: GitHub
- URL: https://github.com/fattorib/thunderkittens-simple-gemm
- Owner: fattorib
- License: apache-2.0
- Created: 2025-02-12T13:43:24.000Z (8 days ago)
- Default Branch: main
- Last Pushed: 2025-02-12T13:53:21.000Z (8 days ago)
- Last Synced: 2025-02-16T10:17:13.246Z (4 days ago)
- Topics: cuda, gemm, gpu, thunderkittens
- Language: Cuda
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ThunderKittens - Simple GEMM
This repo contains a performant tensorcore GEMM kernel written in ThunderKittens (*and another slower kernel lol*). For square matrices, the 4-warp, 128x128x32 kernel is within ~98% of cuBLAS and Triton. Thunderkittens is quite nice to use, and while it includes a few example GEMM kernels, these 1) use H100 specific features (WGMMA) and 2) use the author's load-compute-store-finish (LCSF) programming model. This repo intends to provide an example of a simple GEMM kernel that is still fast.
## Benchmarks
Benchmarks performed on an 4096x4096x4096 problem with bfloat16 inputs and float accumulation on an RTX 4070. Triton kernel is taken from [here](https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py):
| Kernel | TFLOPs |
|----------------------------|--------|
| ThunderKittens (this repo) | 61.1 |
| cuBLAS | 61.4 |
| Triton | 62.2 |# Compile
## Setup
Clone repo with:
```bash
git clone --recurse-submodules https://github.com/fattorib/tk-simple-gemm.git
```This code has been tested in the following environment:
- gcc 11.4.0
- nvcc 12.6
- RTX 4070
- ubuntu22.04All development work was performed in the `nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04` docker image.
## Build
To build the GEMM kernel (defaults to 128x128x32 kernel), run:```bash
make gemm
```to run the kernel and benchmark it, run:
```
./gemm.bin
```your output should be something like:
```bash
Problem Size: 4096 x 4096 x 4096
Total Elapsed Time: 0.225039s
TFLOP/s 61.0734
```# Citations
```bibtex
@misc{spector2024thunderkittenssimplefastadorable,
title={ThunderKittens: Simple, Fast, and Adorable AI Kernels},
author={Benjamin F. Spector and Simran Arora and Aaryan Singhal and Daniel Y. Fu and Christopher Ré},
year={2024},
eprint={2410.20399},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.20399},
}
```