Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-gemm
A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software.
https://github.com/jssonx/awesome-gemm
Last synced: 5 days ago
JSON representation
-
General Optimization Techniques
- How To Optimize Gemm
- GEMM: From Pure C to SSE Optimized Micro Kernels - depth look into optimizing GEMM from basic C to SSE.
-
Frameworks
- BLIS - performance BLAS-like dense linear algebra libraries.
- BLISlab - like GEMM algorithms.
- SHPC at UT Austin (formerly FLAME)
-
Libraries
- NVIDIA CUTLASS 3.3
- Google gemmlowp: a small self-contained low-precision GEMM library - precision GEMM optimization by Google.
- OpenBLAS
- cutlass_fpA_intB_gemm
- CUSP
- CUV
- Eigen
- MAGMA (Matrix Algebra on GPU and Multicore Architectures) - generation linear algebra libraries for heterogeneous computing.
- LAPACK
- Xianyi Zhang
- NumPy
- TensorFlow - source software library for machine learning.
- PyTorch - source software library for machine learning.
- NVIDIA cuBLAS
- NVIDIA cuSPARSE
- libFLAME
- ViennaCL - source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
- Boost uBlas
- Armadillo
- Blaze
- ARM Compute Library - level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse® and Arm® Mali™ GPUs architectures.
-
Development Software: Debugging and Profiling
- MegPeak
- Memcheck (Valgrind)
- Intel VTune Profiler
- gprof
- FPChecker - point accuracy problems.
- HPCToolkit
-
Selected Papers
- High-performance implementation of the level-3 BLAS
- Anatomy of High-Performance Many-Threaded Matrix Multiplication
- Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality
- Anatomy of high-performance matrix multiplication
-
Blogs
- Step by step optimization of cuda sgemm
- Optimizing Matrix Multiplication
- GEMM caching
- Matrix Multiplication on CPU
- Optimizing matrix multiplication: cache + OpenMP
- Tuning matrix multiplication (GEMM) for Intel GPUs
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Building a FAST matrix multiplication algorithm
- Matrix-Matrix Product Experiments with BLAZE
- The OpenBLAS Project and Matrix Multiplication Optimization
- OpenBLAS gemm from scratch
- The Proper Approach to CUDA for Beginners: How to Optimize GEMM
- ARMv7 4x4kernel Optimization Practice
-
Other Learning Resources
-
Tiny Examples
- SGEMM_CUDA - by-step optimization of matrix multiplication, implemented in CUDA.
- simple-gemm
- YHs_Sample
- how-to-optimize-gemm - major matmul optimization tutorial.
- GEMM
- BLIS.jl - level Julia wrapper for BLIS typed interface.
- blis_apple
- DGEMM on Int8 Tensor Core
- chgemm
-
Fundamental Theories and Concepts
-
University Courses & Tutorials
-
Lecture Notes
Categories
Sub Categories
Keywords
blis
4
gemm
4
matrix-multiplication
4
cuda
4
code-optimization
2
blas
2
armv7
2
gemm-optimization
2
cpp
2
lapacke
1
lapack
1
nvidia
1
gpu
1
deep-learning-library
1
deep-learning
1
optimization
1
matrix-library
1
matrix-functions
1
matrix-calculations
1
matrix
1
linear-algebra-library
1
linear-algebra
1
hpc
1
high-performance-computing
1
high-performance
1
blas-libraries
1
gotoblas
1
sve
1
simd
1
opencl
1
neural-network
1
neon
1
machine-learning
1
linux
1
computer-vision
1
armv8
1
arm
1
android
1
aarch64
1
tensorcores
1
tensorcore
1
mixed-precision
1
wrapper
1
matrix-multiplications
1
julia
1
c
1
vulkan
1
ptx
1
int4
1
cuda-kernel
1