Projects in Awesome Lists tagged with gemm

https://github.com/opennmt/ctranslate2

Fast inference engine for Transformer models

avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models

Last synced: 14 May 2025

https://github.com/xlite-dev/cuda-learn-notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 15 Apr 2025

https://github.com/OpenNMT/CTranslate2

Fast inference engine for Transformer models

avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models

Last synced: 02 Apr 2025

https://github.com/xlite-dev/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 26 Mar 2025

https://github.com/DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 20 Mar 2025

https://github.com/flame/how-to-optimize-gemm

blis code-optimization gemm gotoblas matrix-multiplication

Last synced: 15 May 2025

https://github.com/cnugteren/clblast

Tuned OpenCL BLAS

blas blas-libraries clblas gemm gpu matrix-multiplication opencl

Last synced: 14 May 2025

https://github.com/CNugteren/CLBlast

Tuned OpenCL BLAS

blas blas-libraries clblas gemm gpu matrix-multiplication opencl

Last synced: 15 Mar 2025

https://github.com/flame/blislab

BLISlab: A Sandbox for Optimizing GEMM

blis code-optimization gemm matrix-multiplication

Last synced: 04 Apr 2025

https://github.com/Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 14 May 2025

https://github.com/bruce-lee-ly/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 05 Apr 2025

https://github.com/mratsim/laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

assembler blas compiler-optimization convolution deep-learning gemm high-performance-computing jit matrix-multiplication openmp parallel runtime-cpu-detection simd tensor

Last synced: 08 Apr 2025

https://github.com/rocm/tensile

Stretching GPU performance for GEMMs and tensor contractions.

amd assembly auto-tuning blas dnn gemm gpu gpu-acceleration gpu-computing hip machine-learning matrix-multiplication neural-networks opencl python radeon tensor-contraction tensors

Last synced: 05 Apr 2025

https://github.com/ROCm/Tensile

Stretching GPU performance for GEMMs and tensor contractions.

amd assembly auto-tuning blas dnn gemm gpu gpu-acceleration gpu-computing hip machine-learning matrix-multiplication neural-networks opencl python radeon tensor-contraction tensors

Last synced: 30 Nov 2024

https://github.com/yui0/slibs

Single file libraries for C/C++

aac alsa ascii audio blas c codec encoder flac gemm glsl gpgpu kms m4a math mp3 mp4 mpeg opencl single-header-lib

Last synced: 08 May 2025

https://github.com/Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core

Last synced: 14 May 2025

https://github.com/enp1s0/ozimmu

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

cuda gemm mixed-precision tensorcore tensorcores

Last synced: 09 Apr 2025

https://github.com/rocm/hipblaslt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

amd assembly blas gemm gpu-computing hip machine-learning matrix-multiplication rocm

Last synced: 04 Apr 2025

https://github.com/enp1s0/ozIMMU

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

cuda gemm mixed-precision tensorcore tensorcores

Last synced: 04 Apr 2025

https://github.com/bruce-lee-ly/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core

Last synced: 19 Dec 2024

https://github.com/eth-cscs/spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

cuda gemm linear-algebra mpi rocm

Last synced: 14 Apr 2025

https://github.com/iVishalr/GEMM

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

c gemm gemm-optimization matrix-multiplication

Last synced: 04 Apr 2025

https://github.com/szagoruyko/openai-gemm.pytorch

PyTorch bindings for openai-gemm

gemm pytorch

Last synced: 03 Jan 2025

https://github.com/bruce-lee-ly/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

cublas cublaslt cutlass gemm gpu llm matrix-multiply nvidia tensor-core

Last synced: 13 Apr 2025

https://github.com/bruce-lee-ly/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

back2back-gemm back2back-hgemm cublas cuda fused-gemm fused-hgemm gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 13 Apr 2025

https://github.com/enp1s0/cumpsgemm

Fast SGEMM emulation on Tensor Cores

cuda fp32 gemm gpu half-precision mixed-precision tensorcore tensorcores

Last synced: 09 Apr 2025

https://github.com/xiaosong9905/dgemm-knl

DGEMM on KNL, achieve 75% MKL

dgemm gemm high-performance hpc linear-algebra x86

Last synced: 15 May 2025

https://github.com/coderonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

blas blis cublas cuda eigen fortran gemm gonum hpc lapack linear-algebra math mkl mojo numpy openblas pytorch scientific-computing simd tensor

Last synced: 16 Jan 2025

https://github.com/codingonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

blas blis cublas cuda eigen fortran gemm gonum hpc lapack linear-algebra math mkl mojo numpy openblas pytorch scientific-computing simd tensor

Last synced: 25 Apr 2025

https://github.com/dev0x13/gemm-benchmark-2023

Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language

benchmark gemm mojo

Last synced: 21 Mar 2025

https://github.com/zhangge6/how-to-optimize-playground

High-performance computing (HPC) demos since I was a freshmen.

cuda gemm x86

Last synced: 06 Apr 2025

https://github.com/jiegec/sgemm-optimize

Optimization of sgemm in Kunpeng platform

gemm kunpeng

Last synced: 01 Apr 2025

https://github.com/andreytkachenko/yarblas

Yet another rust BLAS

blas gemm machine-learning math rust rust-lang

Last synced: 21 Mar 2025

https://github.com/daedalus/aiscripts

Scripts created with AI of white papers and publications

ai bloomfilter cassowary-algorithm cuckoo-filter feistel-network gemm gpt karatsuba karatsuba-matrix-multiplication llm meta-gradient-descent noprop phylolm whitepaper zeedlm zeromerge

Last synced: 05 May 2025

https://github.com/xiaosong9905/cuda-v100-kernels

CUDA Kernels on V100

cuda gemm gpu hpc reduce scan sgemm transpose

Last synced: 15 May 2025

https://github.com/jerinphilip/mozintgemm

Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.

arm gemm wrapper x86-64

Last synced: 20 Feb 2025