Projects in Awesome Lists tagged with tensor-cores
A curated list of projects in awesome lists tagged with tensor-cores .
https://github.com/deftruth/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.
attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores
Last synced: 06 Apr 2025
https://github.com/xlite-dev/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores
Last synced: 30 Mar 2025
https://github.com/DefTruth/ffpa-attn-mma
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 27 Jan 2025
https://github.com/deftruth/cuhgemm-py
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance
Last synced: 09 Jan 2025
https://github.com/DefTruth/cuffpa-py
📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 08 Jan 2025
https://github.com/deftruth/cuffpa-py
📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 09 Jan 2025
https://github.com/tgautam03/tgemm
General Matrix Multiplication using NVIDIA Tensor Cores
cuda-kernels cuda-programming gpu-computing gpu-programming matrix-multiplication nvidia-cuda nvidia-gpu nvidia-tensor-cores sgemm tensor-cores
Last synced: 15 Apr 2025
https://github.com/deftruth/hgemm-tensorcores-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉
Last synced: 04 Dec 2024
https://github.com/DefTruth/hgemm-tensorcores-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉
Last synced: 06 Dec 2024
https://github.com/neuraladitya/neural_network_c
Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.
artificial-intelligence bayesian-optimization c-programming convolutional-neural-networks cuda deep-learning encryption gpu-computing high-performance-computing machine-learning mpi multi-gpu neural-network openmp parallel-computing quantization real-time-monitoring secure-computing tensor-cores transformers
Last synced: 29 Mar 2025