Projects in Awesome Lists tagged with tensor-cores

https://github.com/deftruth/ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.

attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores

Last synced: 06 Apr 2025

https://github.com/xlite-dev/ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores

Last synced: 30 Mar 2025

https://github.com/DefTruth/ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 27 Jan 2025

https://github.com/deftruth/cuhgemm-py

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance

cuda hgemm tensor-cores

Last synced: 09 Jan 2025

https://github.com/DefTruth/cuffpa-py

📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 08 Jan 2025

https://github.com/deftruth/cuffpa-py

📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 09 Jan 2025

https://github.com/tgautam03/tgemm

General Matrix Multiplication using NVIDIA Tensor Cores

cuda-kernels cuda-programming gpu-computing gpu-programming matrix-multiplication nvidia-cuda nvidia-gpu nvidia-tensor-cores sgemm tensor-cores

Last synced: 15 Apr 2025

https://github.com/deftruth/hgemm-tensorcores-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉

cuda hgemm tensor-cores

Last synced: 04 Dec 2024

https://github.com/DefTruth/hgemm-tensorcores-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉

cuda hgemm tensor-cores

Last synced: 06 Dec 2024

https://github.com/neuraladitya/neural_network_c

Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.

artificial-intelligence bayesian-optimization c-programming convolutional-neural-networks cuda deep-learning encryption gpu-computing high-performance-computing machine-learning mpi multi-gpu neural-network openmp parallel-computing quantization real-time-monitoring secure-computing tensor-cores transformers

Last synced: 29 Mar 2025

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome