https://github.com/nikhilrout/thegemmcoreproject
SystemVerilog Implementation of Nvidia's CUDA/Tensor Core GEMM Operations
https://github.com/nikhilrout/thegemmcoreproject
cuda floating-point gemm gpgpu hybrid-precision-training sparse-matrix systolic-array tensorcore tpu
Last synced: about 2 months ago
JSON representation
SystemVerilog Implementation of Nvidia's CUDA/Tensor Core GEMM Operations
- Host: GitHub
- URL: https://github.com/nikhilrout/thegemmcoreproject
- Owner: NikhilRout
- License: bsd-3-clause
- Created: 2025-02-25T19:29:00.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-08-14T05:24:50.000Z (about 2 months ago)
- Last Synced: 2025-08-14T07:18:31.568Z (about 2 months ago)
- Topics: cuda, floating-point, gemm, gpgpu, hybrid-precision-training, sparse-matrix, systolic-array, tensorcore, tpu
- Language: Verilog
- Homepage:
- Size: 17.9 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TheGEMMCoreProject
SystemVerilog implementation of Nvidia's SIMT CUDA, Hybrid-Precision Tensor Core, and Google's Systolic Array TPU MXU GEMM Operations.
These modules are by no means really emulating the actual microarchitecture executing CUDA/Tensor Core instructions, instead they're simply performing the same operation for direct usage in FPGA designs.Go check out my work on the Vortex GPGPU's [Tensor Core Unit (TCU) extension's DRL Floating Point RTL backend](https://github.com/vortexgpgpu/vortex/tree/bug_fixes/hw/rtl/tcu) for a more optimized, realistic microarchitecture implementation.
## Tensor Core Versions
### TensorCore v0: Volta Architecture [FP16MUL FP32ADD]
![]()
![]()
### TensorCore v1: Ampere Architecture [TF32MUL FP32ADD / BF16MUL FP32ADD] + Fine-Grained Structured Sparsity
![]()
![]()
### TensorCore v2: Hopper Architecture [FP8(E5M2/E4M3)MUL FP16ADD]
![]()