Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-gemm
๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
https://github.com/jssonx/awesome-gemm
Last synced: about 15 hours ago
JSON representation
-
Frameworks and Development Tools ๐ ๏ธ
-
Libraries ๐๏ธ
-
GPU Libraries โก
- NVIDIA CUTLASS: Template library for CUDA GEMM kernels - 3-Clause)
- NVIDIA cuBLAS: Highly tuned BLAS for NVIDIA GPUs
- NVIDIA cuSPARSE: Sparse matrix computations on NVIDIA GPUs
- TileFusion: Simplifying Kernel Fusion with Tile Processing
- clBLAS: BLAS functions on OpenCL for portability - 2.0)
- CLBlast: Tuned OpenCL BLAS library - 2.0)
- NVIDIA cuDNN: Deep learning primitives, including GEMM
- hipBLAS: BLAS for AMD GPU platforms (ROCm)
- hipBLASLt: Lightweight BLAS library on ROCm
- BitBLAS: Mixed-precision BLAS operations on GPUs
- BitBLAS-Benchmark
- TiledCUDA: Kernel template library designed to elevate CUDA Cโs level of abstraction for processing tiles
-
Cross-Platform Libraries ๐
- CUSP: C++ templates for sparse linear algebra - 2.0)
- CUV: C++/Python for CUDA-based vector/matrix ops
- LAPACK: Foundational linear algebra routines - 3-Clause)
- ARM Compute Library: Optimized for ARM platforms - 2.0/MIT)
- MAGMA: High-performance linear algebra on GPUs and multicore CPUs - 3-Clause)
- oneDNN (MKL-DNN): Cross-platform deep learning primitives with optimized GEMM - 2.0)
- viennacl-dev: OpenCL-based linear algebra library
- Ginkgo: High-performance linear algebra on many-core systems - 3-Clause)
-
Language-Specific Libraries ๐ค
- BLIS.jl - 3-Clause)
- Armadillo - 2.0/MIT)
- Boost uBlas
- NumPy - 3-Clause)
- SciPy - 3-Clause)
- TensorFlow - 2.0) & [XLA](https://www.tensorflow.org/xla)
- JAX - 2.0)
- PyTorch - 3-Clause)
- GemmKernels.jl - 3-Clause)
- Eigen
-
CPU Libraries ๐ป
- blis_apple: BLIS optimized for Apple M1 - 3-Clause)
- libFLAME: High-performance dense linear algebra library - 3-Clause)
- OpenBLAS: Optimized BLAS implementation based on GotoBLAS2 - 3-Clause)
- Intel MKL: Highly optimized math routines for Intel CPUs
- FBGEMM: Meta's CPU GEMM for optimized server inference - 3-Clause)
- gemmlowp: Google's low-precision GEMM library - 2.0)
- BLASFEO: Optimized for small- to medium-sized dense matrices - 2-Clause)
- LIBXSMM: Specializing in small/micro GEMM kernels - 3-Clause)
-
-
Libraries
-
- OpenBLAS
- Eigen
- MAGMA (Matrix Algebra on GPU and Multicore Architectures) - generation linear algebra libraries for heterogeneous computing.
- NumPy
- TensorFlow - source software library for machine learning.
- PyTorch - source software library for machine learning.
- ViennaCL - source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
- Blaze
-
GPU Libraries
- cutlass_fpA_intB_gemm - 2.0`](https://github.com/tlc-pack/cutlass_fpA_intB_gemm/blob/main/LICENSE)
- DGEMM on Int8 Tensor Core
- hipBLAS-common - common/blob/develop/LICENSE.md)
- OpenAI GEMM - gemm/blob/master/LICENSE)
- ArrayFire - purpose GPU library that simplifies GPU computing with high-level functions, including matrix operations. [`BSD-3-Clause`](https://github.com/arrayfire/arrayfire/blob/master/LICENSE)
-
CPU Libraries
- Xianyi Zhang
- libFLAME - performance dense linear algebra library. [`BSD-3-Clause`](https://github.com/flame/libflame/blob/master/LICENSE.txt)
- OpenBLAS - 3-Clause`](https://github.com/xianyi/OpenBLAS/blob/develop/LICENSE)
-
Language-Specific Libraries
-
-
Debugging and Profiling Tools ๐
-
Learning Resources ๐
-
Selected Papers ๐
- High-performance Implementation of the Level-3 BLAS (2008)
- Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014)
- Model-driven BLAS Performance on Loongson (2012)
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015)
- Anatomy of High-Performance Matrix Multiplication (2008)
-
Blogs ๐๏ธ
- perf-book by Denis Bakhvalov
- Matrix Multiplication on CPU
- Tuning Matrix Multiplication (GEMM) for Intel GPUs
- Building a FAST Matrix Multiplication Algorithm
- Matrix Multiplication Background Guide (NVIDIA)
- Outperforming cuBLAS on H100: a Worklog
- Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining
- Fast Multidimensional Matrix Multiplication on CPU from Scratch
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
- CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIAยฎ Hopperโข GPUs
- Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Epilogue Fusion in CUTLASS with Epilogue Visitor Trees
- Matrix-Matrix Product Experiments with BLAZE
- Mixed-input matrix multiplication performance optimizations
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: A Worklog
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
- Optimizing Matrix Multiplication
- Optimizing Matrix Multiplication: Cache + OpenMP
- CUDA Learn Notes
- CUDA GEMM Optimization
- Why GEMM is at the heart of deep learning
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- CUTLASS Tutorial: Persistent Kernels and Stream-K
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
-
University Courses & Tutorials ๐
- HLS Tutorial & Deep Learning Accelerator Lab1
- UCSB CS 240A: Applied Parallel Computing
- UC Berkeley: CS267 Parallel Computing
- ORNL: CUDA C++ Exercise: Basic Linear Algebra Kernels: GEMM Optimization Strategies
- Stanford: BLAS-level CPU Performance in 100 Lines of C
- Purdue: Optimizing Matrix Multiplication
- NJIT: Optimize Matrix Multiplication
- MIT: Optimizing Matrix Multiplication (6.172 Lecture Notes)
- UT Austin: LAFF-On Programming for High Performance
- MIT OCW: 6.172 Performance Engineering
- Optimizing Matrix Multiplication using SIMD and Parallelization
- HPC Garage
- CUDATutorial
-
-
Example Implementations ๐ก
-
Blogs ๐๏ธ
- simple-gemm
- chgemm: Int8 GEMM implementations
- Toy HGEMM (Tensor Cores with MMA/WMMA)
- Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
- xGeMM: Accelerated General (FP32) Matrix Multiplication
- SGEMM_CUDA: Step-by-Step Optimization
- TK-GEMM: a Triton FP8 GEMM kernel using SplitK parallelization
- CUTLASS-based Grouped GEMM: Efficient grouped GEMM operations - 2.0)
- CoralGemm: AMD high-performance GEMM implementations
- how-to-optimize-gemm (row-major matmul)
- DeepBench - 2.0)
- CUDA-INT8-GEMM
- cuda-sgemm
- cute_gemm
- Cute-Learning
- CUTLASS GEMM - 3-Clause)
- NVIDIA_SGEMM_PRACTICE: Step-by-step optimization of CUDA SGEMM
-
-
Example Implementations
-
Other Resources
- YHs_Sample - Li/YHs_Sample/blob/master/LICENSE)
- GEMM
- GEMM Optimization with LIBXSMM - 3-Clause`](https://github.com/libxsmm/libxsmm/blob/main/LICENSE.md)
-
-
Fundamental Theories and Concepts ๐ง
- Spatial-lang GEMM - High-level overview.
- General Matrix Multiply (Intel) - Intro from Intel.
- Strassen's Algorithm - Faster asymptotic complexity for large matrices.
- Winograd's Algorithm - Reduced multiplication count for improved performance.
-
General Optimization Techniques ๐
- GEMM: From Pure C to SSE Optimized Micro Kernels - Detailed tutorial on going from naive to vectorized implementations.
- How To Optimize GEMM - Hands-on optimization guide.
-
Frameworks and Development Tools
-
Development Software: Debugging and Profiling
-
- gprof
- FPChecker - point accuracy problems.
- HPCToolkit
-
Language-Specific Libraries
- Perf - level metrics. [`GPLv2`](https://github.com/torvalds/linux/blob/master/COPYING)
- gprofng-gui - 3.0.html)
- nvprof - line profiler for CUDA applications. [`NVIDIA End User License Agreement`](https://docs.nvidia.com/cuda/eula/index.html)
-
-
Learning Resources
-
University Courses & Tutorials
-
Blogs
- The OpenBLAS Project and Matrix Multiplication Optimization
- OpenBLAS GEMM from Scratch
- The Proper Approach to CUDA for Beginners: How to Optimize GEMM
- ARMv7 4x4kernel Optimization Practice
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- GEMM Caching
-
Other Resources
-
-
University Courses & Tutorials
-
Other Learning Resources
Programming Languages
Categories
Learning Resources ๐
48
Libraries ๐๏ธ
38
Debugging and Profiling Tools ๐
18
Example Implementations ๐ก
17
Libraries
17
Learning Resources
9
Development Software: Debugging and Profiling
6
Fundamental Theories and Concepts ๐ง
4
Example Implementations
3
Frameworks and Development Tools ๐ ๏ธ
3
General Optimization Techniques ๐
2
Frameworks and Development Tools
1
University Courses & Tutorials
1
Other Learning Resources
1
Sub Categories
Blogs ๐๏ธ
47
Language-Specific Libraries ๐ค
28
University Courses & Tutorials ๐
13
GPU Libraries โก
12
CPU Libraries ๐ป
8
Cross-Platform Libraries ๐
8
Blogs
6
Selected Papers ๐
5
GPU Libraries
5
Other Resources
4
Language-Specific Libraries
4
CPU Libraries
3
University Courses & Tutorials
2
Keywords
cuda
11
gemm
7
blas
6
gpu
6
matrix-multiplication
6
python
5
deep-learning
5
blis
4
opencl
4
machine-learning
4
cpp
4
hip
3
neural-network
3
c
3
lapack
3
linear-algebra
3
hpc
3
aarch64
2
armv7
2
armv8
2
linux
2
numpy
2
high-performance-computing
2
code-optimization
2
openmp
2
oneapi
2
performance
2
deep-neural-networks
2
scientific-computing
2
assembly
2
high-performance
2
gpu-computing
2
blas-libraries
2
matrix-library
2
cuda-programming
2
linear-algebra-library
2
matrix-functions
2
lapacke
2
gemm-optimization
2
julia
2
clblas
1
arrayfire
1
c-plus-plus
1
matrix
1
gpgpu
1
matrix-calculations
1
nvidia
1
amd
1
deep-learning-library
1
auto-tuning
1