Projects in Awesome Lists tagged with gemm
A curated list of projects in awesome lists tagged with gemm .
https://github.com/opennmt/ctranslate2
Fast inference engine for Transformer models
avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models
Last synced: 14 May 2025
https://github.com/xlite-dev/cuda-learn-notes
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 15 Apr 2025
https://github.com/OpenNMT/CTranslate2
Fast inference engine for Transformer models
avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models
Last synced: 02 Apr 2025
https://github.com/xlite-dev/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 26 Mar 2025
https://github.com/DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 20 Mar 2025
https://github.com/flame/how-to-optimize-gemm
blis code-optimization gemm gotoblas matrix-multiplication
Last synced: 15 May 2025
https://github.com/cnugteren/clblast
Tuned OpenCL BLAS
blas blas-libraries clblas gemm gpu matrix-multiplication opencl
Last synced: 14 May 2025
https://github.com/CNugteren/CLBlast
Tuned OpenCL BLAS
blas blas-libraries clblas gemm gpu matrix-multiplication opencl
Last synced: 15 Mar 2025
https://github.com/flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
blis code-optimization gemm matrix-multiplication
Last synced: 04 Apr 2025
https://github.com/Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core
Last synced: 14 May 2025
https://github.com/bruce-lee-ly/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core
Last synced: 05 Apr 2025
https://github.com/mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
assembler blas compiler-optimization convolution deep-learning gemm high-performance-computing jit matrix-multiplication openmp parallel runtime-cpu-detection simd tensor
Last synced: 08 Apr 2025
https://github.com/rocm/tensile
Stretching GPU performance for GEMMs and tensor contractions.
amd assembly auto-tuning blas dnn gemm gpu gpu-acceleration gpu-computing hip machine-learning matrix-multiplication neural-networks opencl python radeon tensor-contraction tensors
Last synced: 05 Apr 2025
https://github.com/ROCm/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
amd assembly auto-tuning blas dnn gemm gpu gpu-acceleration gpu-computing hip machine-learning matrix-multiplication neural-networks opencl python radeon tensor-contraction tensors
Last synced: 30 Nov 2024
https://github.com/Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core
Last synced: 14 May 2025
https://github.com/enp1s0/ozimmu
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
cuda gemm mixed-precision tensorcore tensorcores
Last synced: 09 Apr 2025
https://github.com/rocm/hipblaslt
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
amd assembly blas gemm gpu-computing hip machine-learning matrix-multiplication rocm
Last synced: 04 Apr 2025
https://github.com/enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
cuda gemm mixed-precision tensorcore tensorcores
Last synced: 04 Apr 2025
https://github.com/bruce-lee-ly/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core
Last synced: 19 Dec 2024
https://github.com/eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
cuda gemm linear-algebra mpi rocm
Last synced: 14 Apr 2025
https://github.com/iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
c gemm gemm-optimization matrix-multiplication
Last synced: 04 Apr 2025
https://github.com/szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
Last synced: 03 Jan 2025
https://github.com/bruce-lee-ly/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
cublas cublaslt cutlass gemm gpu llm matrix-multiply nvidia tensor-core
Last synced: 13 Apr 2025
https://github.com/bruce-lee-ly/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
back2back-gemm back2back-hgemm cublas cuda fused-gemm fused-hgemm gemm gpu hgemm matrix-multiply nvidia tensor-core
Last synced: 13 Apr 2025
https://github.com/enp1s0/cumpsgemm
Fast SGEMM emulation on Tensor Cores
cuda fp32 gemm gpu half-precision mixed-precision tensorcore tensorcores
Last synced: 09 Apr 2025
https://github.com/xiaosong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
dgemm gemm high-performance hpc linear-algebra x86
Last synced: 15 May 2025
https://github.com/dev0x13/gemm-benchmark-2023
Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language
Last synced: 21 Mar 2025
https://github.com/zhangge6/how-to-optimize-playground
High-performance computing (HPC) demos since I was a freshmen.
Last synced: 06 Apr 2025
https://github.com/jiegec/sgemm-optimize
Optimization of sgemm in Kunpeng platform
Last synced: 01 Apr 2025
https://github.com/andreytkachenko/yarblas
Yet another rust BLAS
blas gemm machine-learning math rust rust-lang
Last synced: 21 Mar 2025
https://github.com/daedalus/aiscripts
Scripts created with AI of white papers and publications
ai bloomfilter cassowary-algorithm cuckoo-filter feistel-network gemm gpt karatsuba karatsuba-matrix-multiplication llm meta-gradient-descent noprop phylolm whitepaper zeedlm zeromerge
Last synced: 05 May 2025
https://github.com/jerinphilip/mozintgemm
Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.
Last synced: 20 Feb 2025
https://github.com/tensorbfs/cutropicalgemm.jl
The fastest Tropical number matrix multiplication on GPU
Last synced: 07 Apr 2025
https://github.com/fattorib/thunderkittens-simple-gemm
Simple Tensorcore GEMM in ThunderKittens
Last synced: 14 Apr 2025
https://github.com/teambipartite/csc485b-202409-a4
High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores
Last synced: 19 Feb 2025