An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with gemm

A curated list of projects in awesome lists tagged with gemm .

https://github.com/xlite-dev/cuda-learn-notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 15 Apr 2025

https://github.com/xlite-dev/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 26 Mar 2025

https://github.com/DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 20 Mar 2025

https://github.com/flame/blislab

BLISlab: A Sandbox for Optimizing GEMM

blis code-optimization gemm matrix-multiplication

Last synced: 04 Apr 2025

https://github.com/Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 14 May 2025

https://github.com/bruce-lee-ly/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 05 Apr 2025

https://github.com/mratsim/laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

assembler blas compiler-optimization convolution deep-learning gemm high-performance-computing jit matrix-multiplication openmp parallel runtime-cpu-detection simd tensor

Last synced: 08 Apr 2025

https://github.com/Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core

Last synced: 14 May 2025

https://github.com/enp1s0/ozimmu

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

cuda gemm mixed-precision tensorcore tensorcores

Last synced: 09 Apr 2025

https://github.com/rocm/hipblaslt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

amd assembly blas gemm gpu-computing hip machine-learning matrix-multiplication rocm

Last synced: 04 Apr 2025

https://github.com/enp1s0/ozIMMU

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

cuda gemm mixed-precision tensorcore tensorcores

Last synced: 04 Apr 2025

https://github.com/bruce-lee-ly/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core

Last synced: 19 Dec 2024

https://github.com/eth-cscs/spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

cuda gemm linear-algebra mpi rocm

Last synced: 14 Apr 2025

https://github.com/iVishalr/GEMM

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

c gemm gemm-optimization matrix-multiplication

Last synced: 04 Apr 2025

https://github.com/szagoruyko/openai-gemm.pytorch

PyTorch bindings for openai-gemm

gemm pytorch

Last synced: 03 Jan 2025

https://github.com/bruce-lee-ly/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

cublas cublaslt cutlass gemm gpu llm matrix-multiply nvidia tensor-core

Last synced: 13 Apr 2025

https://github.com/bruce-lee-ly/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

back2back-gemm back2back-hgemm cublas cuda fused-gemm fused-hgemm gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 13 Apr 2025

https://github.com/enp1s0/cumpsgemm

Fast SGEMM emulation on Tensor Cores

cuda fp32 gemm gpu half-precision mixed-precision tensorcore tensorcores

Last synced: 09 Apr 2025

https://github.com/xiaosong9905/dgemm-knl

DGEMM on KNL, achieve 75% MKL

dgemm gemm high-performance hpc linear-algebra x86

Last synced: 15 May 2025

https://github.com/coderonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

blas blis cublas cuda eigen fortran gemm gonum hpc lapack linear-algebra math mkl mojo numpy openblas pytorch scientific-computing simd tensor

Last synced: 16 Jan 2025

https://github.com/codingonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

blas blis cublas cuda eigen fortran gemm gonum hpc lapack linear-algebra math mkl mojo numpy openblas pytorch scientific-computing simd tensor

Last synced: 25 Apr 2025

https://github.com/dev0x13/gemm-benchmark-2023

Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language

benchmark gemm mojo

Last synced: 21 Mar 2025

https://github.com/zhangge6/how-to-optimize-playground

High-performance computing (HPC) demos since I was a freshmen.

cuda gemm x86

Last synced: 06 Apr 2025

https://github.com/jiegec/sgemm-optimize

Optimization of sgemm in Kunpeng platform

gemm kunpeng

Last synced: 01 Apr 2025

https://github.com/jerinphilip/mozintgemm

Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.

arm gemm wrapper x86-64

Last synced: 20 Feb 2025

https://github.com/tensorbfs/cutropicalgemm.jl

The fastest Tropical number matrix multiplication on GPU

cuda gemm tropical-algebra

Last synced: 07 Apr 2025

https://github.com/fattorib/thunderkittens-simple-gemm

Simple Tensorcore GEMM in ThunderKittens

cuda gemm gpu thunderkittens

Last synced: 14 Apr 2025

https://github.com/teambipartite/csc485b-202409-a4

High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores

cuda data-parallelism gemm

Last synced: 19 Feb 2025