https://github.com/yuninxia/awesome-gemm
๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
https://github.com/yuninxia/awesome-gemm
List: awesome-gemm
Last synced: 19 days ago
JSON representation
๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
- Host: GitHub
- URL: https://github.com/yuninxia/awesome-gemm
- Owner: yuninxia
- License: mit
- Created: 2023-12-25T14:57:11.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-21T23:42:30.000Z (2 months ago)
- Last Synced: 2025-01-30T23:20:52.071Z (22 days ago)
- Homepage:
- Size: 120 KB
- Stars: 13
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-gemm - ๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software. (Other Lists / Julia Lists)
README
# Awesome GEMM [](https://awesome.re)

> **๐ Welcome to Awesome GEMM!**
> A curated and continually evolving list of frameworks, libraries, tutorials, and tools for optimizing **General Matrix Multiply (GEMM)** operations. Whether you're a beginner eager to learn the fundamentals, a developer optimizing performance-critical code, or a researcher pushing the limits of hardware, this repository is your launchpad to mastery.---
## Why GEMM Matters ๐ก
General Matrix Multiply is at the core of a wide range of computational tasks: from scientific simulations and signal processing to modern AI workloads like neural network training and inference. Efficiently implementing and optimizing GEMM can lead to dramatic performance improvements across entire systems.
**This repository is a comprehensive resource for:**
- **Students & Beginners:** Learn the basics and theory of matrix multiplication.
- **Engineers & Developers:** Discover frameworks, libraries, and tools to optimize GEMM on CPUs, GPUs, and specialized hardware.
- **Researchers & Performance Experts:** Explore cutting-edge techniques, research papers, and advanced optimization strategies.---
## Quickstart & Highlights ๐ฑ
If youโre new and just want to dive in, start here:
- **For Beginners:**
- [NumPy](https://github.com/numpy/numpy) (CPU, Python) - The go-to library for basic matrix operations.
- [How To Optimize GEMM](https://github.com/flame/how-to-optimize-gemm) - A step-by-step guide to improving performance from a naive implementation.- **For GPU Developers:**
- [NVIDIA cuBLAS](https://developer.nvidia.com/cublas) - Highly optimized BLAS for NVIDIA GPUs.
- [NVIDIA CUTLASS](https://github.com/NVIDIA/cutlass) - Templates and building blocks to write your own CUDA GEMM kernels.- **For Low-Precision & AI Workloads:**
- [FBGEMM](https://github.com/pytorch/FBGEMM) (Meta) - Specialized low-precision GEMM for server inference.
- [gemmlowp](https://github.com/google/gemmlowp) (Google) - Low-precision (integer) GEMM for efficient ML inference.---
## Table of Contents ๐
- [Fundamental Theories and Concepts ๐ง ](#fundamental-theories-and-concepts-)
- [General Optimization Techniques ๐](#general-optimization-techniques-)
- [Frameworks and Development Tools ๐ ๏ธ](#frameworks-and-development-tools-)
- [Libraries ๐๏ธ](#libraries-)
- [CPU Libraries ๐ป](#cpu-libraries-)
- [GPU Libraries โก](#gpu-libraries-)
- [Cross-Platform Libraries ๐](#cross-platform-libraries-)
- [Language-Specific Libraries ๐ค](#language-specific-libraries-)
- [Debugging and Profiling Tools ๐](#debugging-and-profiling-tools-)
- [Learning Resources ๐](#learning-resources-)
- [University Courses & Tutorials ๐](#university-courses--tutorials-)
- [Selected Papers ๐](#selected-papers-)
- [Lecture Notes ๐](#lecture-notes)
- [Blogs ๐๏ธ](#blogs-)
- [Other Resources ๐](#other-resources)
- [Example Implementations ๐ก](#example-implementations-)
- [Contributions ๐ค](#contributions-)
- [License ๐](#license-)---
## Fundamental Theories and Concepts ๐ง
- **What is GEMM?**
- [General Matrix Multiply (Intel)](https://www.intel.com/content/dam/develop/external/us/en/documents/intel-ocl-gemm.pdf) - Intro from Intel.
- [Spatial-lang GEMM](https://spatial-lang.org/gemm) - High-level overview.- **Matrix Multiplication Algorithms:**
- [Strassen's Algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm) - Faster asymptotic complexity for large matrices.
- [Winograd's Algorithm](https://en.wikipedia.org/wiki/Winograd_algorithm) - Reduced multiplication count for improved performance.---
## General Optimization Techniques ๐
- [How To Optimize GEMM](https://github.com/flame/how-to-optimize-gemm) - Hands-on optimization guide.
- [GEMM: From Pure C to SSE Optimized Micro Kernels](https://www.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html) - Detailed tutorial on going from naive to vectorized implementations.---
## Frameworks and Development Tools ๐ ๏ธ
- [BLIS: A modular framework for building high-performance BLAS-like libraries](https://github.com/flame/blis) (BSD-3-Clause)
- [BLISlab: Educational framework for experimenting with BLIS-like GEMM algorithms](https://github.com/flame/blislab)
- [Tensile: AMD ROCm JIT compiler for GPU kernels, specializing in GEMM and tensor contractions](https://github.com/ROCm/Tensile) (MIT)
- [Tile Language: A concise DSL designed to streamline development of high-performance GPU/CPU kernels like GEMM](https://github.com/tile-ai/tilelang) (MIT)---
## Libraries ๐๏ธ
### CPU Libraries ๐ป
- [BLASFEO: Optimized for small- to medium-sized dense matrices](https://github.com/giaf/blasfeo) (BSD-2-Clause)
- [blis_apple: BLIS optimized for Apple M1](https://github.com/xrq-phys/blis_apple) (BSD-3-Clause)
- [FBGEMM: Meta's CPU GEMM for optimized server inference](https://github.com/pytorch/FBGEMM) (BSD-3-Clause)
- [gemmlowp: Google's low-precision GEMM library](https://github.com/google/gemmlowp) (Apache-2.0)
- [Intel MKL: Highly optimized math routines for Intel CPUs](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) (Intel Proprietary)
- [libFLAME: High-performance dense linear algebra library](https://github.com/flame/libflame) (BSD-3-Clause)
- [LIBXSMM: Specializing in small/micro GEMM kernels](https://github.com/hfp/libxsmm) (BSD-3-Clause)
- [OpenBLAS: Optimized BLAS implementation based on GotoBLAS2](https://github.com/xianyi/OpenBLAS) (BSD-3-Clause)### GPU Libraries โก
- [BitBLAS: Mixed-precision BLAS operations on GPUs](https://github.com/microsoft/BitBLAS) (MIT)
- [BitBLAS-Benchmark](https://github.com/LeiWang1999/bitblas-benchmark)
- [clBLAS: BLAS functions on OpenCL for portability](https://github.com/clMathLibraries/clBLAS) (Apache-2.0)
- [CLBlast: Tuned OpenCL BLAS library](https://github.com/CNugteren/CLBlast) (Apache-2.0)
- [hipBLAS: BLAS for AMD GPU platforms (ROCm)](https://github.com/ROCm/hipBLAS) (MIT)
- [hipBLASLt: Lightweight BLAS library on ROCm](https://github.com/ROCm/hipBLASLt) (MIT)
- [NVIDIA cuBLAS: Highly tuned BLAS for NVIDIA GPUs](https://developer.nvidia.com/cublas) (NVIDIA License)
- [NVIDIA cuDNN: Deep learning primitives, including GEMM](https://developer.nvidia.com/cudnn) (NVIDIA License)
- [NVIDIA cuSPARSE: Sparse matrix computations on NVIDIA GPUs](https://developer.nvidia.com/cusparse) (NVIDIA License)
- [NVIDIA CUTLASS: Template library for CUDA GEMM kernels](https://github.com/NVIDIA/cutlass) (BSD-3-Clause)
- [TiledCUDA: Kernel template library designed to elevate CUDA Cโs level of abstraction for processing tiles](https://github.com/TiledTensor/TiledCUDA)
- [TileFusion: Simplifying Kernel Fusion with Tile Processing](https://github.com/microsoft/TileFusion) (MIT)### Cross-Platform Libraries ๐
- [ARM Compute Library: Optimized for ARM platforms](https://github.com/ARM-software/ComputeLibrary) (Apache-2.0/MIT)
- [CUSP: C++ templates for sparse linear algebra](https://github.com/cusplibrary/cusplibrary) (Apache-2.0)
- [CUV: C++/Python for CUDA-based vector/matrix ops](https://github.com/deeplearningais/CUV)
- [Ginkgo: High-performance linear algebra on many-core systems](https://github.com/ginkgo-project/ginkgo) (BSD-3-Clause)
- [LAPACK: Foundational linear algebra routines](https://www.netlib.org/lapack/) (BSD-3-Clause)
- [MAGMA: High-performance linear algebra on GPUs and multicore CPUs](https://github.com/icl-utk-edu/magma) (BSD-3-Clause)
- [oneDNN (MKL-DNN): Cross-platform deep learning primitives with optimized GEMM](https://github.com/oneapi-src/oneDNN) (Apache-2.0)
- [viennacl-dev: OpenCL-based linear algebra library](https://github.com/viennacl/viennacl-dev) (MIT)### Language-Specific Libraries ๐ค
**Python:**
- [JAX](https://github.com/google/jax) (Apache-2.0)
- [NumPy](https://github.com/numpy/numpy) (BSD-3-Clause)
- [PyTorch](https://github.com/pytorch/pytorch) (BSD-3-Clause)
- [SciPy](https://github.com/scipy/scipy) (BSD-3-Clause)
- [TensorFlow](https://github.com/tensorflow/tensorflow) (Apache-2.0) & [XLA](https://www.tensorflow.org/xla)**C++:**
- [Armadillo](https://arma.sourceforge.net/) (Apache-2.0/MIT)
- [Blaze](https://bitbucket.org/blaze-lib/blaze/) (BSD-3-Clause)
- [Boost uBlas](https://www.boost.org/doc/libs/release/libs/numeric/ublas/) (Boost License)
- [Eigen](https://gitlab.com/libeigen/eigen) (MPL2)**Julia:**
- [BLIS.jl](https://github.com/JuliaLinearAlgebra/BLIS.jl) (BSD-3-Clause)
- [GemmKernels.jl](https://github.com/JuliaGPU/GemmKernels.jl) (BSD-3-Clause)---
## Debugging and Profiling Tools ๐
**Intel Tools:**
- [Intel Advisor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html)
- [Intel VTune Profiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)
- [unitrace](https://github.com/intel/pti-gpu/tree/master/tools/unitrace)**NVIDIA Tools:**
- [Nsight Compute](https://developer.nvidia.com/nsight-compute)
- [Nsight Systems](https://developer.nvidia.com/nsight-systems)
- [Nsight Visual Studio Edition](https://developer.nvidia.com/nsight-visual-studio-edition)
- [nvprof](https://docs.nvidia.com/cuda/profiler-users-guide/)**ROCm Tools:**
- [ROCm Profiler (rocprofiler)](https://github.com/ROCm/rocprofiler)**Others:**
- [Extrae](https://tools.bsc.es/extrae)
- [FPChecker](https://github.com/LLNL/FPChecker)
- [gprof](https://sourceware.org/binutils/docs/gprof/)
- [gprofng](https://sourceware.org/binutils/docs/gprofng.html)
- [HPCToolkit](https://gitlab.com/hpctoolkit/hpctoolkit)
- [LIKWID](https://github.com/RRZE-HPC/likwid)
- [MegPeak](https://github.com/MegEngine/MegPeak)
- [Perf (Linux)](https://perf.wiki.kernel.org/)
- [TAU](https://www.cs.uoregon.edu/research/tau/home.php)
- [VAMPIR](https://vampir.eu/)
- [Valgrind (Memcheck)](https://valgrind.org/docs/manual/mc-manual.html)---
## Learning Resources ๐
### University Courses & Tutorials ๐
- [CUDATutorial](https://github.com/PaddleJitLab/CUDATutorial)
- [GPU MODE YouTube Channel](https://www.youtube.com/channel/UCJgIbYl6C5no72a0NUAPcTA)
- [HLS Tutorial & Deep Learning Accelerator Lab1](https://courses.cs.washington.edu/courses/cse599s/18sp/hw/1.html)
- [HPC Garage](https://github.com/hpcgarage)
- [MIT OCW: 6.172 Performance Engineering](https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/)
- [MIT: Optimizing Matrix Multiplication (6.172 Lecture Notes)](https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/6-172-fall-2018/lecture-notes/)
- [NJIT: Optimize Matrix Multiplication](https://web.njit.edu/~apv6/courses/hw1.html)
- [Optimizing Matrix Multiplication using SIMD and Parallelization](https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/6-172-fall-2018/lecture-notes/MIT6_172F18_lec5.pdf)
- [ORNL: CUDA C++ Exercise: Basic Linear Algebra Kernels: GEMM Optimization Strategies](https://bluewaters.ncsa.illinois.edu/liferay-content/image-gallery/content/BLA-final)
- [Purdue: Optimizing Matrix Multiplication](https://www.cs.purdue.edu/homes/grr/cs250/lab6-cache/optimizingMatrixMultiplication.pdf)
- [Stanford: BLAS-level CPU Performance in 100 Lines of C](https://cs.stanford.edu/people/shadjis/blas.html)
- [UC Berkeley: CS267 Parallel Computing](https://sites.google.com/lbl.gov/cs267-spr2023)
- [UCSB CS 240A: Applied Parallel Computing](https://sites.cs.ucsb.edu/~tyang/class/240a17/refer.html)
- [UT Austin: LAFF-On Programming for High Performance](https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html)### Selected Papers ๐
- [BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015)](https://dl.acm.org/doi/10.1145/2764454)
- [Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014)](https://ieeexplore.ieee.org/document/6877334)
- [Model-driven BLAS Performance on Loongson (2012)](https://ieeexplore.ieee.org/document/6413635)
- [High-performance Implementation of the Level-3 BLAS (2008)](https://dl.acm.org/doi/10.1145/1377603.1377607)
- [Anatomy of High-Performance Matrix Multiplication (2008)](https://dl.acm.org/doi/10.1145/1356052.1356053)### Blogs ๐๏ธ
- [A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library](https://research.colfax-intl.com/nvidia-hopper-flashattention-2/)
- [Building a FAST Matrix Multiplication Algorithm](https://v0dro.in/blog/2018/05/01/building-a-fast-matrix-multiplication-algorithm/)
- [CUDA GEMM Optimization](https://github.com/leimao/CUDA-GEMM-Optimization)
- [CUDA Learn Notes](https://github.com/DefTruth/CUDA-Learn-Notes)
- [CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining](https://research.colfax-intl.com/cutlass-tutorial-design-of-a-gemm-kernel/)
- [CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIAยฎ Hopperโข GPUs](https://research.colfax-intl.com/cutlass-tutorial-wgmma-hopper/)
- [CUTLASS Tutorial: Persistent Kernels and Stream-K](https://research.colfax-intl.com/cutlass-tutorial-persistent-kernels-and-stream-k/)
- [Deep Dive on CUTLASS Ping-Pong GEMM Kernel](https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/)
- [Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS](https://research.colfax-intl.com/nvidia-hopper-gemm-cutlass/)
- [Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b)
- [Epilogue Fusion in CUTLASS with Epilogue Visitor Trees](https://research.colfax-intl.com/epilogue_visitor_tree/)
- [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU)
- [Matrix Multiplication Background Guide (NVIDIA)](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)
- [Matrix Multiplication on CPU](https://marek.ai/matrix-multiplication-on-cpu.html)
- [Matrix-Matrix Product Experiments with BLAZE](https://www.mathematik.uni-ulm.de/~lehn/test_blaze/index.html)
- [Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs](https://neuralmagic.com/blog/introducing-machete-a-mixed-input-gemm-kernel-optimized-for-nvidia-hopper-gpus/)
- [Mixed-input matrix multiplication performance optimizations](https://research.google/blog/mixed-input-matrix-multiplication-performance-optimizations/)
- [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: A Worklog](https://siboehm.com/articles/22/CUDA-MMM)
- [Outperforming cuBLAS on H100: a Worklog](https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog)
- [Optimizing Matrix Multiplication](https://coffeebeforearch.github.io/2020/06/23/mmul.html)
- [Optimizing Matrix Multiplication: Cache + OpenMP](https://www.mgaillard.fr/2020/08/29/matrix-multiplication-optimizing.html)
- [perf-book by Denis Bakhvalov](https://github.com/dendibakh/perf-book)
- [Tuning Matrix Multiplication (GEMM) for Intel GPUs](https://www.ibiblio.org/e-notes/webgl/gpu/mul/intel.htm)
- [Why GEMM is at the heart of deep learning](https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/)---
## Example Implementations ๐ก
- [chgemm: Int8 GEMM implementations](https://github.com/tpoisonooo/chgemm)
- [CoralGemm: AMD high-performance GEMM implementations](https://github.com/AMD-HPC/CoralGemm) (MIT)
- [CUDA-INT8-GEMM](https://github.com/jundaf2/CUDA-INT8-GEMM)
- [cuda-sgemm](https://github.com/nicolaswilde/cuda-sgemm)
- [cute_gemm](https://github.com/weishengying/cute_gemm)
- [Cute-Learning](https://github.com/DD-DuDa/Cute-Learning) (MIT)
- [CUTLASS-based Grouped GEMM: Efficient grouped GEMM operations](https://github.com/tgale96/grouped_gemm) (Apache-2.0)
- [CUTLASS GEMM](https://github.com/Bruce-Lee-LY/cutlass_gemm) (BSD-3-Clause)
- [DeepBench](https://github.com/baidu-research/DeepBench) (Apache-2.0)
- [how-to-optimize-gemm (row-major matmul)](https://github.com/tpoisonooo/how-to-optimize-gemm) (GPLv3)
- [NVIDIA_SGEMM_PRACTICE: Step-by-step optimization of CUDA SGEMM](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE)
- [Optimizing-SGEMM-on-NVIDIA-Turing-GPUs](https://github.com/yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs) (GPLv3)
- [SGEMM_CUDA: Step-by-Step Optimization](https://github.com/siboehm/SGEMM_CUDA) (MIT)
- [simple-gemm](https://github.com/williamfgc/simple-gemm) (MIT)
- [TK-GEMM: a Triton FP8 GEMM kernel using SplitK parallelization](https://pytorch.org/blog/accelerating-llama3/)
- [Toy HGEMM (Tensor Cores with MMA/WMMA)](https://github.com/DefTruth/hgemm-tensorcores-mma) (GPLv3)
- [xGeMM: Accelerated General (FP32) Matrix Multiplication](https://github.com/tgautam03/xGeMM) (MIT)---
## Contributions ๐ค
We welcome and encourage contributions! You can help by:
- Adding new libraries, tools, or tutorials.
- Submitting performance benchmarks or example implementations.
- Improving documentation or correcting errors.Submit a pull request or open an issue to get started!
---
## License ๐
This repository is licensed under the [MIT License](LICENSE).
---
*By maintaining this curated list, we hope to empower the community to learn, implement, and optimize GEMM efficiently. Thanks for visiting, and happy computing!*