Projects in Awesome Lists tagged with flash-attention
A curated list of projects in awesome lists tagged with flash-attention .
https://github.com/qwenlm/qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
chinese flash-attention large-language-models llm natural-language-processing pretrained-models
Last synced: 14 May 2025
https://github.com/QwenLM/Qwen-7B
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
chinese flash-attention large-language-models llm natural-language-processing pretrained-models
Last synced: 14 Dec 2024
https://github.com/QwenLM/Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
chinese flash-attention large-language-models llm natural-language-processing pretrained-models
Last synced: 16 Mar 2025
https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
64k alpaca alpaca-2 alpaca2 flash-attention large-language-models llama llama-2 llama2 llm nlp rlhf yarn
Last synced: 24 Mar 2025
https://github.com/ymcui/chinese-llama-alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
64k alpaca alpaca-2 alpaca2 flash-attention large-language-models llama llama-2 llama2 llm nlp rlhf yarn
Last synced: 14 May 2025
https://github.com/internlm/internlm
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
chatbot chinese fine-tuning-llm flash-attention gpt large-language-model llm long-context pretrained-models rlhf
Last synced: 14 May 2025
https://github.com/InternLM/InternLM
Official release of InternLM2 7B and 20B base and chat models. 200K context support
chatbot chinese fine-tuning-llm flash-attention gpt large-language-model llm long-context pretrained-models rlhf
Last synced: 16 Mar 2025
https://github.com/deftruth/cuda-learn-notes
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
cuda cuda-12 cuda-cpp cuda-demo cuda-kernel cuda-kernels cuda-library cuda-toolkit flash-attention hgemm learn-cuda leet-cuda
Last synced: 14 May 2025
https://github.com/deftruth/awesome-llm-inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
awesome-llm deepseek deepseek-r1 deepseek-v3 flash-attention flash-attention-3 flash-mla llm-inference minimax-01 mla paged-attention tensorrt-llm vllm
Last synced: 04 Apr 2025
https://github.com/xlite-dev/cuda-learn-notes
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 15 Apr 2025
https://github.com/xlite-dev/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 26 Mar 2025
https://github.com/DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm
Last synced: 20 Mar 2025
https://github.com/moonshotai/moba
MoBA: Mixture of Block Attention for Long-Context LLMs
flash-attention llm llm-serving llm-training moe pytorch transformer
Last synced: 14 May 2025
https://github.com/MoonshotAI/MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
flash-attention llm llm-serving llm-training moe pytorch transformer
Last synced: 31 Mar 2025
https://github.com/internlm/internevo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3
Last synced: 14 Apr 2025
https://github.com/InternLM/InternEvo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3
Last synced: 27 Mar 2025
https://github.com/deftruth/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.
attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores
Last synced: 06 Apr 2025
https://github.com/xlite-dev/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores
Last synced: 30 Mar 2025
https://github.com/coincheung/gdgpt
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
baichuan2-7b bloom chatglm3-6b deepspeed flash-attention full-finetune llama2 llm mixtral-8x7b model-parallization nlp pipeline pytorch
Last synced: 07 May 2025
https://github.com/DefTruth/ffpa-attn-mma
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 27 Jan 2025
https://github.com/DAMO-NLP-SG/Inf-CLIP
💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
clip contrastive-learning flash-attention infinite-batch-size memory-efficient ring-attention
Last synced: 28 Mar 2025
https://github.com/damo-nlp-sg/inf-clip
💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
clip contrastive-learning flash-attention infinite-batch-size memory-efficient ring-attention
Last synced: 09 Apr 2025
https://github.com/bruce-lee-ly/decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
cuda cuda-core decoding-attention flash-attention flashinfer flashmla gpu gqa inference large-language-model llm mha mla mqa multi-head-attention nvidia
Last synced: 05 May 2025
https://github.com/bruce-lee-ly/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
cuda cutlass flash-attention flash-attention-2 gpu inference large-language-model llm mha multi-head-attention nvidia tensor-core
Last synced: 13 Apr 2025
https://github.com/DefTruth/cuffpa-py
📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 08 Jan 2025
https://github.com/deftruth/cuffpa-py
📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 09 Jan 2025
https://github.com/erfanzar/jax-flash-attn2
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).
flash-attention flash-attention-2 jax pallas
Last synced: 12 Apr 2025
https://github.com/kyegomez/flashmha
An simple pytorch implementation of Flash MultiHead Attention
artificial-intelligence artificial-neural-networks attention attention-mechanisms attentionisallyouneed flash-attention gpt4 transformer
Last synced: 07 May 2025
https://github.com/kklemon/flashperceiver
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
attention-mechanism deep-learning flash-attention nlp perceiver transformer
Last synced: 16 May 2025
https://github.com/kreasof-ai/homunculus-project
Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.
bitnet deep-learning flash-attention jupyter-notebook large-language-models low-rank-adaptation machine-learning python pytorch pytorch-lightning transformer vision-transformer
Last synced: 17 Dec 2024
https://github.com/masterskepticista/gpt2
Training GPT-2 on FineWeb-Edu in JAX/Flax
fineweb flash-attention flax gpt2 jax
Last synced: 06 Mar 2025
https://github.com/luis355/qw
qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.
cpp cross-platform flash-attention framework llm pretrained-models qt qwik-city tailwind typing-game typing-practice vision-language-model vue web
Last synced: 24 Feb 2025
https://github.com/dcarpintero/pangolin-guard
Open, Lightweight Model for AI Guardrails
ai-safety alternating-attention bert-model fine-tuning flash-attention huggingface-transformers modernbert natural-language-processing prompt-guard
Last synced: 04 Apr 2025
https://github.com/lukasdrews97/dumblellm
Decoder-only LLM trained on the Harry Potter books.
byte-pair-encoding flash-attention grouped-query-attention large-language-model rotary-position-embedding transformer
Last synced: 05 Apr 2025