https://github.com/nikhilrout/thegemmcoreproject

SystemVerilog Implementation of Nvidia's CUDA/Tensor Core GEMM Operations
https://github.com/nikhilrout/thegemmcoreproject

cuda floating-point gemm gpgpu hybrid-precision-training sparse-matrix systolic-array tensorcore tpu

Last synced: about 2 months ago
JSON representation

SystemVerilog Implementation of Nvidia's CUDA/Tensor Core GEMM Operations

Host: GitHub
URL: https://github.com/nikhilrout/thegemmcoreproject
Owner: NikhilRout
License: bsd-3-clause
Created: 2025-02-25T19:29:00.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-08-14T05:24:50.000Z (about 2 months ago)
Last Synced: 2025-08-14T07:18:31.568Z (about 2 months ago)
Topics: cuda, floating-point, gemm, gpgpu, hybrid-precision-training, sparse-matrix, systolic-array, tensorcore, tpu
Language: Verilog
Homepage:
Size: 17.9 MB
Stars: 5
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# TheGEMMCoreProject
SystemVerilog implementation of Nvidia's SIMT CUDA, Hybrid-Precision Tensor Core, and Google's Systolic Array TPU MXU GEMM Operations.
These modules are by no means really emulating the actual microarchitecture executing CUDA/Tensor Core instructions, instead they're simply performing the same operation for direct usage in FPGA designs.

Go check out my work on the Vortex GPGPU's [Tensor Core Unit (TCU) extension's DRL Floating Point RTL backend](https://github.com/vortexgpgpu/vortex/tree/bug_fixes/hw/rtl/tcu) for a more optimized, realistic microarchitecture implementation.

## Tensor Core Versions
### TensorCore v0: Volta Architecture [FP16MUL FP32ADD]