Projects in Awesome Lists tagged with flash-attention

https://github.com/qwenlm/qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

chinese flash-attention large-language-models llm natural-language-processing pretrained-models

Last synced: 14 May 2025

https://github.com/QwenLM/Qwen-7B

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

chinese flash-attention large-language-models llm natural-language-processing pretrained-models

Last synced: 14 Dec 2024

https://github.com/QwenLM/Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

chinese flash-attention large-language-models llm natural-language-processing pretrained-models

Last synced: 16 Mar 2025

https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

64k alpaca alpaca-2 alpaca2 flash-attention large-language-models llama llama-2 llama2 llm nlp rlhf yarn

Last synced: 24 Mar 2025

https://github.com/ymcui/chinese-llama-alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

64k alpaca alpaca-2 alpaca2 flash-attention large-language-models llama llama-2 llama2 llm nlp rlhf yarn

Last synced: 14 May 2025

https://github.com/internlm/internlm

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

chatbot chinese fine-tuning-llm flash-attention gpt large-language-model llm long-context pretrained-models rlhf

Last synced: 14 May 2025

https://github.com/InternLM/InternLM

Official release of InternLM2 7B and 20B base and chat models. 200K context support

chatbot chinese fine-tuning-llm flash-attention gpt large-language-model llm long-context pretrained-models rlhf

Last synced: 16 Mar 2025

https://github.com/deftruth/cuda-learn-notes

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

cuda cuda-12 cuda-cpp cuda-demo cuda-kernel cuda-kernels cuda-library cuda-toolkit flash-attention hgemm learn-cuda leet-cuda

Last synced: 14 May 2025

https://github.com/deftruth/awesome-llm-inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

awesome-llm deepseek deepseek-r1 deepseek-v3 flash-attention flash-attention-3 flash-mla llm-inference minimax-01 mla paged-attention tensorrt-llm vllm

Last synced: 04 Apr 2025

https://github.com/xlite-dev/cuda-learn-notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 15 Apr 2025

https://github.com/xlite-dev/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 26 Mar 2025

https://github.com/DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cuda-programming cuda-toolkit cudnn cutlass flash-attention flash-mla gemm gemv hgemm

Last synced: 20 Mar 2025

https://github.com/moonshotai/moba

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 14 May 2025

https://github.com/MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 31 Mar 2025

https://github.com/internlm/internevo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3

Last synced: 14 Apr 2025

https://github.com/InternLM/InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3

Last synced: 27 Mar 2025

https://github.com/deftruth/ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.

attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores

Last synced: 06 Apr 2025

https://github.com/xlite-dev/ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

attention cuda deepseek deepseek-r1 deepseek-v3 flash-attention flash-mla fused-mla mla mlsys sdpa tensor-cores

Last synced: 30 Mar 2025

https://github.com/coincheung/gdgpt

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

baichuan2-7b bloom chatglm3-6b deepspeed flash-attention full-finetune llama2 llm mixtral-8x7b model-parallization nlp pipeline pytorch

Last synced: 07 May 2025

https://github.com/DefTruth/ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 27 Jan 2025

https://github.com/DAMO-NLP-SG/Inf-CLIP

💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

clip contrastive-learning flash-attention infinite-batch-size memory-efficient ring-attention

Last synced: 28 Mar 2025

https://github.com/damo-nlp-sg/inf-clip

💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

clip contrastive-learning flash-attention infinite-batch-size memory-efficient ring-attention

Last synced: 09 Apr 2025

https://github.com/bruce-lee-ly/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

cuda cuda-core decoding-attention flash-attention flashinfer flashmla gpu gqa inference large-language-model llm mha mla mqa multi-head-attention nvidia

Last synced: 05 May 2025

https://github.com/bruce-lee-ly/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

cuda cutlass flash-attention flash-attention-2 gpu inference large-language-model llm mha multi-head-attention nvidia tensor-core

Last synced: 13 Apr 2025

https://github.com/DefTruth/cuffpa-py

📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 08 Jan 2025

https://github.com/deftruth/cuffpa-py

📚[WIP] FFPA: Yet another Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, ~1.5x🎉faster than SDPA EA.

attention cuda flash-attention mlsys sdpa tensor-cores

Last synced: 09 Jan 2025

https://github.com/erfanzar/jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

flash-attention flash-attention-2 jax pallas

Last synced: 12 Apr 2025

https://github.com/kyegomez/flashmha

An simple pytorch implementation of Flash MultiHead Attention

artificial-intelligence artificial-neural-networks attention attention-mechanisms attentionisallyouneed flash-attention gpt4 transformer

Last synced: 07 May 2025

https://github.com/kklemon/flashperceiver

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

attention-mechanism deep-learning flash-attention nlp perceiver transformer

Last synced: 16 May 2025

https://github.com/kreasof-ai/homunculus-project

Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.

bitnet deep-learning flash-attention jupyter-notebook large-language-models low-rank-adaptation machine-learning python pytorch pytorch-lightning transformer vision-transformer

Last synced: 17 Dec 2024

https://github.com/masterskepticista/gpt2

Training GPT-2 on FineWeb-Edu in JAX/Flax

fineweb flash-attention flax gpt2 jax

Last synced: 06 Mar 2025

https://github.com/luis355/qw

qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.

cpp cross-platform flash-attention framework llm pretrained-models qt qwik-city tailwind typing-game typing-practice vision-language-model vue web

Last synced: 24 Feb 2025

https://github.com/dcarpintero/pangolin-guard

Open, Lightweight Model for AI Guardrails

ai-safety alternating-attention bert-model fine-tuning flash-attention huggingface-transformers modernbert natural-language-processing prompt-guard

Last synced: 04 Apr 2025

https://github.com/lukasdrews97/dumblellm

Decoder-only LLM trained on the Harry Potter books.

byte-pair-encoding flash-attention grouped-query-attention large-language-model rotary-position-embedding transformer

Last synced: 05 Apr 2025