Awesome-LLM-Long-Context-Modeling
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
https://github.com/Xnhyacinth/Awesome-LLM-Long-Context-Modeling
Last synced: 2 days ago
JSON representation
-
📜 Papers
-
2. Efficient Attention
- **SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning.**
- **SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning.**
- **Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning.**
- **HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading.**
- **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression.**
- **KVCrush: Key value cache size-reduction using similarity in head-behaviour.**
- **Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA.** - ai/transformer-tricks)](https://github.com/OpenMachine-ai/transformer-tricks)
- **Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction.**
- **Softmax Attention with Constant Cost per Token.**
- **Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.**
- **Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective.**
- **Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention.**
- **Attention as an RNN.**
- **You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet.**
- **Learning to (Learn at Test Time): RNNs with Expressive Hidden States.** - time-training/ttt-lm-pytorch](https://github.com/test-time-training/ttt-lm-pytorch)
- **Gated Slot Attention for Efficient Linear-Time Sequence Modeling.** - linear-attention](https://github.com/sustcsonglin/flash-linear-attention)
- 
- **Self-attention Does Not Need O(n^2) Memory.**
- **Faster Causal Attention Over Large Sequences Through Sparse Flash Attention.**
- **TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer.**
- **FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.** - AILab/flash-attention)](https://github.com/Dao-AILab/flash-attention)
- **Efficient LLM Inference with Kcache.**
- **You Only Cache Once: Decoder-Decoder Architectures for Language Models.**
- **Fast Transformer Decoding: One Write-Head is All You Need.**
- **Layer-Condensed KV Cache for Efficient Inference of Large Language Models.**
- **GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.** - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai._ Arxiv 2023.
- **PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.**
- **Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.**
- **Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression.** - Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen._ Arxiv 2024.
- **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models.**
- **PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.**
- **Effectively Compress KV Heads for LLM.**
- **Beyond KV Caching: Shared Attention for Efficient LLMs.**
- **A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression.**
- **Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks.**
- **Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization.**
- **PQCache: Product Quantization-based KVCache for Long Context LLM Inference.**
- **LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.**
- **Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope.**
- **RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.**
- **Cross-layer Attention Sharing for Large Language Models.**
- **NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time.** - NACL)
- **Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters.**
- **FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.**
- **ThinK: Thinner Key Cache by Query-Driven Pruning.**
- **A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder.** - rae Jo, Dongkun Shin._ Arxiv 2024. [](https://github.com/Dirac-Notation/A2SF)
- **CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios.**
- **UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference.**
- **LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy.**
- **In-context KV-Cache Eviction for LLMs via Attention-Gate.**
- **A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference.**
- **MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection.**
- **VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration.**
- **EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models.**
- **MagicPIG: LSH Sampling for Efficient LLM Generation.** - AI-Lab/MagicPIG)](https://github.com/Infini-AI-Lab/MagicPIG)
- **Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning.**
- **Star Attention: Efficient LLM Inference over Long Sequences.** - Attention)](https://github.com/NVIDIA/Star-Attention)
- **When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training.**
- **Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity.**
- **Squeezed Attention: Accelerating Long Context Length LLM Inference.**
- **Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache.**
- **TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection.**
- **Generating Long Sequences with Sparse Transformers.**
- **Blockwise selfattention for long document understanding.** - tau Yih, Sinong Wang, Jie Tang._ EMNLP 2020. [](https://github.com/xptree/BlockBERT)
- **Longformer: The Long-Document Transformer.**
- **ETC: Encoding Long and Structured Inputs in Transformers.**
- **Big Bird: Transformers for Longer Sequences.** - research/bigbird)](https://github.com/google-research/bigbird)
- **Reformer: The efficient transformer.** - pytorch)](https://github.com/lucidrains/reformer-pytorch)
- **Sparse Sinkhorn Attention.** - Cheng Juan._ ICML 2020. [](https://github.com/lucidrains/sinkhorn-transformer)
- **Sparse and continuous attention mechanisms.**
- **Efficient Long-Text Understanding with Short-Text Models.**
- **Parallel Context Windows for Large Language Models.** - Brown, Yoav Shoham._ ACL 2023. [](https://github.com/AI21Labs/Parallel-Context-Windows)
- **Efficient Content-Based Sparse Attention with Routing Transformers.** - transformer)](https://github.com/lucidrains/routing-transformer)
- **LongT5: Efficient text-to-text transformer for long sequences.** - Hsuan Sung, Yinfei Yang._ NAACL 2022. [](https://github.com/google-research/longt5)
- **Unlimiformer: Long-Range Transformers with Unlimited Length Input.**
- **LONGNET: Scaling Transformers to 1,000,000,000 Tokens.**
- **Blockwise Parallel Transformer for Long Context Large Models.** - Parallel-Transformer)](https://github.com/lhao499/llm_large_context)
- **MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers.** - pytorch)](https://github.com/lucidrains/MEGABYTE-pytorch)
- **Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers.**
- **Long-range Language Modeling with Self-retrieval.**
- **Max-Margin Token Selection in Attention Mechanism.**
- **Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers.**
- **Sparse Token Transformer with Attention Back Tracking.**
- **Training-Free Long-Context Scaling of Large Language Models.**
- **LongHeads: Multi-Head Attention is Secretly a Long Context Processor.**
- **Empower Your Model with Longer and Better Context Comprehension.** - transition)](https://github.com/yileijin/attention-transition)
- **Ring Attention with Blockwise Transformers for Near-Infinite Context.**
- **HyperAttention: Long-context Attention in Near-Linear Time.**
- **Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention.**
- **Sequence can Secretly Tell You What to Discard.**
- **HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning.** - AI/hip-attention)](https://github.com/DeepAuto-AI/hip-attention)
- **MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.**
- **Selective Attention Improves Transformer.**
- **Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens.**
- **Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers.**
- **Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention.**
- **FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding.**
- **Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix.**
- **Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures.**
- **Selective Attention: Enhancing Transformer through Principled Context Control.** - Chowdhury, Jiasi Chen, Samet Oymak._ NeurIPS 2024.
- **Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.** - transformers)](https://github.com/idiap/fast-transformers)
- **SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.** - Hay So, Ting Cao, Fan Yang, Mao Yang._ Arxiv 2024. [](https://github.com/microsoft/SeerAttention)
- **Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations.**
- **Rethinking attention with performers.** - pytorch)](https://github.com/lucidrains/performer-pytorch)
- **Luna: Linear unified nested attention.** - transformer)](https://github.com/sooftware/luna-transformer)
- **Fnet: Mixing tokens with fourier transforms.** - Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon._ Arxiv 2021. [](https://github.com/jaketae/fnet)
- **Random Feature Attention.** - ARK/RFA)](https://github.com/Noahs-ARK/RFA)
- **Gated Linear Attention Transformers with Hardware-Efficient Training.**
- **Simple linear attention language models balance the recall-throughput tradeoff.**
- **Latent Attention for Linear Time Transformers.**
- **Linear Attention Sequence Parallelism.**
- **HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing.** - Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos._ Arxiv 2024.
- **LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid.** - MoE](https://github.com/OpenSparseLLMs/Linear-MoE)
- **TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling.**
- **CaliDrop: KV Cache Compression with Calibration.**
- **Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.** - kv-compression)](https://github.com/LLMkvsys/rethink-kv-compression)
- **SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching.**
- **LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important.** - Lab-China-Merchants-Bank/LagKV)](https://github.com/AI-Lab-China-Merchants-Bank/LagKV)
- **CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation.** - HWAI/CompressKV)](https://github.com/TUDa-HWAI/CompressKV)
- **LeanK: Learnable K Cache Channel Pruning for Efficient Decoding.**
- **Trainable Dynamic Mask Sparse Attention.** - dmattn)](https://github.com/SmallDoges/flash-dmattn)
- **X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression.**
- **LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference.**
- **KV-Distill: Nearly Lossless Learnable Context Compression for LLMs.** - distill)](https://github.com/vnchari/kv-distill)
- **Radar: Fast Long-Context Decoding for Any Transformer.**
- **PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention.**
- **RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression.** - An Tsai, Zhiding Yu, Alexey Tumanov._ Arxiv 2025.
- **Tensor Product Attention Is All You Need.** - Chih Yao._ Arxiv 2025. [](https://github.com/tensorgi/T6)
- **GTA: Grouped-head latenT Attention.**
- **ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition.**
- **Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification.**
- **Multi-head Temporal Latent Attention.** - Keqi/mtla)](https://github.com/D-Keqi/mtla)
- **HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference.**
- **CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences.**
- **Cross-Self KV Cache Pruning for Efficient Vision-Language Inference.**
- **Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern.**
- **Rectified Sparse Attention.**
- **SeerAttention-R: Sparse Attention Adaptation for Long Reasoning.** - Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang._ Arxiv 2025. [](https://github.com/microsoft/SeerAttention)
- **Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models.** - Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang._ Arxiv 2025.
- **Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning.**
- **Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query.**
- **R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration.** - Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu._ Arxiv 2025. [](https://github.com/Zefan-Cai/R-KV)
- **Inference-Time Hyper-Scaling with KV Cache Compression.**
- **SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers.**
- **HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs.**
- **FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression.**
- **EvolKV: Evolutionary KV Cache Compression for LLM Inference.**
- **LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation.** - Tu._ Arxiv 2025.
- **UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression.**
- **EpiCache: Episodic KV Cache Management for Long Conversational Question Answering.**
- **KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing.**
- **Lossless KV Cache Compression to 2%.**
- **TokenButler: Token Importance is Predictable.** - Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah._ Arxiv 2025. [](https://github.com/abdelfattah-lab/TokenButler)
- **Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference.**
- **Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval.**
- **GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction.**
- **SnapKV: LLM Knows What You are Looking for Before Generation.**
- **Core Context Aware Attention for Long Context Language Modeling.**
- **AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.** - Lab/AIM)](https://github.com/LaVi-Lab/AIM)
- **ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression.**
- **BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching.**
- **DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs.**
- **AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference.**
- **TreeKV: Smooth Key-Value Cache Compression with Tree Structures.**
- **LoLA: Low-Rank Linear Attention With Sparse Caching.**
- **dKV-Cache: The Cache for Diffusion Language Models.** - Cache)](https://github.com/horseee/dKV-Cache)
- **SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training.**
- **TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization.**
- **Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference.**
- **Scale-invariant Attention.**
- **Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing.**
- **The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.** - frontier)](https://github.com/PiotrNawrot/sparse-frontier)
- **SageAttention2++: A More Efficient Implementation of SageAttention2.** - ml/SageAttention)](https://github.com/thu-ml/SageAttention)
- **KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference.**
- **MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models.**
- **CAOTE: KV Caching through Attention Output Error based Token Eviction.**
- **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.** - han-lab/omniserve)](https://github.com/mit-han-lab/omniserve)
- **Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding.** - theta-attention)](https://github.com/kostyanoob/top-theta-attention)
- **Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective.**
- **KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference.** - Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan._ Arxiv 2025.
- **Online Scheduling for LLM Inference with KV Cache Constraints.**
- **BalanceKV: KV Cache Compression through Discrepancy Theory.**
- **TransMLA: Multi-Head Latent Attention Is All You Need.**
- **Inference-time sparse attention with asymmetric indexing.** - Emmanuel Mazaré, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Hervé Jégou, Matthijs Douze._ Arxiv 2025.
- **MoBA: Mixture of Block Attention for Long-Context LLMs.**
- **A2ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization.**
- **EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance.**
- **ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty.**
- **SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator.**
- **XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference.**
- **SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs.**
- **SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs.** - Hong Deng, Jing Han._ Arxiv 2025.
- **KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference.**
- **xKV: Cross-Layer SVD for KV-Cache Compression.** - Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah._ Arxiv 2025. [](https://github.com/abdelfattah-lab/xKV)
- **WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference.**
- **BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.** - DuDa/BitDecoding)](https://github.com/DD-DuDa/BitDecoding)
- **Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression.**
- **Efficient Streaming Language Models with Attention Sinks.** - han-lab/streaming-llm)](https://github.com/mit-han-lab/streaming-llm)
- **Cost-Optimal Grouped-Query Attention for Long-Context LLMs.** - optimal-gqa)](https://github.com/THUNLP/cost-optimal-gqa)
- **TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention.**
- **Masked language modeling for proteins via linearly scalable long-context transformers.**
- **Linformer: Self-attention with linear complexity.** - attention-transformer)](https://github.com/lucidrains/linear-attention-transformer)
- **Efficient Memory Management for Large Language Model Serving with PagedAttention.** - project/vllm)](https://github.com/vllm-project/vllm)
- **More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression.**
- **Boosting Long-Context Information Seeking via Query-Guided Activation Refilling.**
- **SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation.** - ai/SCOPE)](https://github.com/Linking-ai/SCOPE)
- **KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.** - NLP-Chang/KVLink)](https://github.com/UCSB-NLP-Chang/KVLink)
- **FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference.**
- **SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention.** - Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan._ Arxiv 2025.
- **Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference.**
- **DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance.**
- **Neural Attention Search.**
- **Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.**
- **QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache.**
- **APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs.**
- **AdaSplash: Adaptive Sparse Flash Attention.** - spin/adasplash)](https://github.com/deep-spin/adasplash)
- **InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU.** - AI/hip-attention)](https://github.com/DeepAuto-AI/hip-attention)
- **MoM: Linear Sequence Modeling with Mixture-of-Memories.**
- **AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference.**
- **Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs.** - decoding)](https://github.com/ryansynk/topk-decoding)
- **CoKV: Optimizing KV Cache Allocation via Cooperative Game.**
- **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models.** - Yiu Yau, Hoi-To Wai, Yang (Katie)Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong._ Arxiv 2025.
- **MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference.** - MLSys-Lab/MEDA)](https://github.com/AIoT-MLSys-Lab/MEDA)
- **FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference.**
- **QuickLLaMA: Query-aware Inference Acceleration for Large Language Models.** - research/Q-LLM)](https://github.com/dvlab-research/Q-LLM)
- **MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.** - Yew Lin, Yuqing Yang, Lili Qiu._ Arxiv 2024. [](https://github.com/microsoft/MInference)
- **MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding.** - Hsu Yen, Beidi Chen._ Arxiv 2024. [](https://github.com/Infini-AI-Lab/MagicDec/)
- **RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.**
- **Landmark Attention: Random-Access Infinite Context Length for Transformers.** - attention)](https://github.com/epfml/landmark-attention)
- **Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention.** - Transformer)](https://github.com/ZiweiHe/Fovea-Transformer)
- **SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models.** - GT-86/SinkLoRA)](https://github.com/Dexter-GT-86/SinkLoRA)
- **Neurocache: Efficient Vector Retrieval for Long-range Language Modeling.**
- **LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation.**
- **Weighted Grouped Query Attention in Transformers.**
- **When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models.** - EIC/Linearized-LLM](https://github.com/GATECH-EIC/Linearized-LLM)
- **Hierarchical Neural Network Approaches for Long Document Classification.**
- **Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.** - han-lab/Quest)](https://github.com/mit-han-lab/Quest)
- **Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters.**
- **CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling.** - Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung._ Arxiv 2024.
- **D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.**
- **LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference.** - M)](https://github.com/SUSTechBruce/LOOK-M)
- **Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache.**
- **InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference.**
- **CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs.**
- **Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction.** - Phi Nguyen, Yingyu Liang, Shafiq Joty._ Arxiv 2024. [](https://github.com/SalesforceAIResearch/GemFilter)
- **Inference-Friendly Models With MixAttention.**
- **KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head.** - kvcompress)](https://github.com/IsaacRe/vllm-kvcompress)
- **Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads.**
- **InfiniPot: Infinite Context Processing on Memory-Constrained LLMs.**
- **DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.** - han-lab/duo-attention)](https://github.com/mit-han-lab/duo-attention)
- **SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.** - sg/SimLayerKV)](https://github.com/sail-sg/SimLayerKV)
- **Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning.**
- **ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs.** - Lab/ZeroMerge)](https://github.com/SusCom-Lab/ZeroMerge)
- **ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.** - Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen._ Arxiv 2024. [](https://github.com/bytedance/ShadowKV)
- **BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference.** - llm)](https://github.com/JunqiZhao888/buzz-llm)
- **Recycled Attention: Efficient inference for long-context language models.** - attention)](https://github.com/carriex/recycled-attention)
- **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs.** - Ushio/MHA2MLA)](https://github.com/JT-Ushio/MHA2MLA)
- **AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models.**
- **ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference.**
- **FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation.** - Joon Kim._ Arxiv 2025. [](https://github.com/dongwonjo/FastKV)
- **Can LLMs Maintain Fundamental Abilities under KV Cache Compression?.**
- **Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation.**
- **Ring Attention with Blockwise Transformers for Near-Infinite Context.**
- **Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention.** - Transformer)](https://github.com/ZiweiHe/Fovea-Transformer)
- **Empower Your Model with Longer and Better Context Comprehension.** - transition)](https://github.com/yileijin/attention-transition)
- **Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization.** - Young Kim, Jongse Park._ Arxiv 2025.
- **Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference.**
- **PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference.**
- **SQuat: Subspace-orthogonal KV Cache Quantization.**
- **FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling.**
- **Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving.**
- **OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs**
- **XAttention: Block Sparse Attention with Antidiagonal Scoring.** - han-lab/x-attention)](https://github.com/mit-han-lab/x-attention)
- **EDiT: Efficient Diffusion Transformers with Linear Compressed Attention.**
- **WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models.**
- **Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs.**
- **EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection.**
- **Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving.**
- **Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers.**
- **Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs.**
- **BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference.** - Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath._ Arxiv 2025.
- **TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering.**
- **DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration.** - Ulrica/DAM)](https://github.com/HanzhiZhang-Ulrica/DAM)
- **Efficient Long-Context LLM Inference via KV Cache Clustering.**
- **Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache.**
- **Multipole Attention for Efficient Long Context Reasoning.**
- **Lag-Relative Sparse Attention In Long Context Training.**
- **Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization.** - zhong Xu, Xitong Gao._ Arxiv 2025.
- **Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?.** - pli/PruLong)](https://github.com/princeton-pli/PruLong)
- **LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning.**
- **KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.** - Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song._ Arxiv 2025. [](https://github.com/snu-mllab/KVzip)
- **FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension.**
- **Fast and Simplex: 2-Simplicial Attention in Triton.**
- **XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization.**
- **Sparse Attention across Multiple-context KV Cache.**
- **Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning.**
- **Retrospective Sparse Attention for Efficient Long-Context Generation.** - Joon Kim._ Arxiv 2025.
- **KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs.**
- **PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs.** - nics/PM-KVQ)](https://github.com/thu-nics/PM-KVQ)
- **VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models.**
- **Hardware-Efficient Attention for Fast Decoding.** - AILab/grouped-latent-attention)](https://github.com/Dao-AILab/grouped-latent-attention)
- **Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion.** - sun Seo, Zhiru Zhang, Udit Gupta._ Arxiv 2025.
- **ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration.**
- **AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.**
- **Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs.**
- **Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference.**
- **KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache.**
- **Latent Multi-Head Attention for Small Language Models.**
- **AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference**
- **CommVQ: Commutative Vector Quantization for KV Cache Compression.** - Embodied-AGI/CommVQ)](https://github.com/UMass-Embodied-AGI/CommVQ)
- **Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing.**
- **Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores.**
- **Think Clearly: Improving Reasoning via Redundant Token Pruning.**
- **LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues.**
- **LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models.** - EIC/LaCache)](https://github.com/GATECH-EIC/LaCache)
- **KVCompose: Efficient Structured KV Cache Compression with Composite Tokens.**
- **OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule.** - Yu Chen._ Arxiv 2025.
- **ProxyAttn: Guided Sparse Attention via Representative Heads.**
-
9. Compress
- **AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation.**
- **Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models.** - Aware-Automated-Machine-Learning)](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning)
- **TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models.**
- **AdaSVD: Adaptive Singular Value Decomposition for Large Language Models.**
- **DeepSeek-OCR: Contexts Optical Compression.** - ai/DeepSeek-OCR)](https://github.com/deepseek-ai/DeepSeek-OCR)
- **See the Text: From Tokenization to Visual Reading.**
- **Saliency-driven Dynamic Token Pruning for Large Language Models.**
- **DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models.**
- **EFPC: Towards Efficient and Flexible Prompt Compression.** - Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu._ Arxiv 2025.
- **AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation.**
- **Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models.**
- **Understanding and Improving Information Preservation in Prompt Compression for LLMs.**
- **Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck.**
- **DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models.**
- **SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression.** - MLSys-Lab/SVD-LLM)](https://github.com/AIoT-MLSys-Lab/SVD-LLM)
- **Learning to Compress Prompt in Natural Language Formats.** - Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, Xia Hu._ Arxiv 2024.
- **LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression.** - Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang._ Arxiv 2024. [](https://github.com/microsoft/LLMLingua)
- **PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models.** - for-Prompt-Compression)](https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression)
- **LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.** - Yew Lin, Yuqing Yang, Lili Qiu._ Arxiv 2023. [](https://github.com/microsoft/LLMLingua)
- **Compressing Context to Enhance Inference Efficiency of Large Language Models.**
- **LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression.** - Yew Lin, Yuqing Yang, Lili Qiu._ Arxiv 2023. [](https://github.com/microsoft/LLMLingua)
- **Compressed Context Memory for Online Language Model Interaction.** - Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song._ ICLR 2024. [](https://github.com/snu-mllab/context-memory)
- **PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression.**
- **Training LLMs over Neurally Compressed Text.** - Dickstein, Noah Constant._ Arxiv 2024.
- **Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models.**
- **Adapting LLMs for Efficient Context Processing through Soft Prompt Compression.**
- **Imagination Augmented Generation: Learning to Imagine Richer Context for Question Answering over Large Language Models.**
- **System 2 Attention (is something you might need too).**
- **Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon.**
- **Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization.**
- **Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression.** - COCO)](https://github.com/OpenMatch/Gist-COCO)
- **DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization.**
- **xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token.** - Qing Chen, Furu Wei, Huishuai Zhang, Dongyan Zhao._ Arxiv 2024. [](https://github.com/Hannibal046/xRAG)
- **Evaluating Zero-Shot Long-Context LLM Compression.**
- **In-Context Learning State Vector with Inner and Momentum Optimization.** - TMG/ICL-State-Vector)](https://github.com/HITsz-TMG/ICL-State-Vector)
- **SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself.**
- **Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation.**
- **Improving Long Text Understanding with Knowledge Distilled from Summarization Model.**
- **OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning.** - v2)](https://github.com/OpenNLG/OpenBA-v2)
- **XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference.**
- **Recurrent Context Compression: Efficiently Expanding the Context Window of LLM.** - G/RCC_Transformer)](https://github.com/WUHU-G/RCC_Transformer)
- **Compressing Lengthy Context With UltraGist.** - Pt/UltraGist)](https://github.com/namespace-Pt/UltraGist)
- **Your Transformer is Secretly Linear.** - Institute/LLM-Microscope)](https://github.com/AIRI-Institute/LLM-Microscope)
- **In-Context Former: Lightning-fast Compressing Context for Large Language Model.**
- **UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs.** - xmu/UIO-LLMs)](https://github.com/wenhaoli-xmu/UIO-LLMs)
- **PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning.**
- **AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.** - Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han._ MLSys 2024 Best Paper Award. [](https://github.com/mit-han-lab/llm-awq)
- **InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models.** - Do, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura._ Arxiv 2024. [](https://github.com/JuseonDo/InstructCMP)
- **Concise and Precise Context Compression for Tool-Using Language Models.**
- **Context Embeddings for Efficient Answer Generation in RAG.**
- **Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models.**
- **QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression.**
- **SentenceVAE: Faster, Longer and More Accurate Inference with Next-sentence Prediction for Large Language Models.**
- **QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention.**
- **AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models.**
- **Characterizing Prompt Compression Methods for Long Context Inference.**
- **Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference.**
- **Familiarity-aware Evidence Compression for Retrieval Augmented Generation.** - group/FaviComp)](https://github.com/luka-group/FaviComp)
- **TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning.**
- **Parse Trees Guided LLM Prompt Compression.**
- **FineZip: Pushing the Limits of Large Language Models for Practical Lossless Text Compression.**
- **Perception Compressor:A training-free prompt compression method in long context scenarios.** - Tao Zheng._ Arxiv 2024.
- **From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression.**
- **Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability.** - Yan Yeung._ EMNLP 2024.
- **Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles.**
- **Adapting Language Models to Compress Contexts.** - nlp/AutoCompressors)](https://github.com/princeton-nlp/AutoCompressors)
- **Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.**
- **ProCut: LLM Prompt Compression via Attribution Estimation.**
- **Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity.**
- **Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers.**
- **ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning.**
- **Merging Feed-Forward Sublayers for Compressed Transformers.** - ffs-compression)](https://github.com/nverma1/merging-ffs-compression/)
- **Efficient Long CoT Reasoning in Small Language Models.**
- **DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers.**
- **SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression.**
- **Hybrid Latent Reasoning via Reinforcement Learning.**
- **Activation-Guided Consensus Merging for Large Language Models.**
- **ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations.**
- **Efficient LLMs with AMP: Attention Heads and MLP Pruning.**
- **ParamΔ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost.**
- **Dynamic Compressing Prompts for Efficient Inference of Large Language Models.**
- **ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models.** - Whan Lee._ Arxiv 2025.
- **MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores.**
- **An Empirical Study on Prompt Compression for Large Language Models.** - for-Prompt-Compression)](https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression)
- **Token Sequence Compression for Efficient Multimodal Computing.**
- **QwenLong-CPRS: Towards ∞-LLMs with Dynamic Context Optimization.** - Zhiwen/QwenLong-CPRS)](https://github.com/Tongyi-Zhiwen/QwenLong-CPRS)
- **Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models.** - liu16/VidCom2)](https://github.com/xuyang-liu16/VidCom2)
- **Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning.**
- **70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float.**
- **From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs.**
- **ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs.**
- **Not All Tokens Are What You Need In Thinking.** - All-Thinking-Tokens)](https://github.com/Faustrazor/Not-All-Thinking-Tokens)
- **FlashThink: An Early Exit Method For Efficient Reasoning.**
- **On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration.**
- **Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning.** - Joon Kim._ Arxiv 2025. [](https://github.com/jiwonsong-dev/ReasoningPathCompression)
- **TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling.** - Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan._ Arxiv 2025.
- **Efficient Reasoning via Chain of Unconscious Thought.** - GRH/CoUT)](https://github.com/Rohan-GRH/CoUT)
- **A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression.**
- **Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers.** - EIC/DiffRatio-MoD)](https://github.com/GATECH-EIC/DiffRatio-MoD)
- **Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.**
- **Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models.** - Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie._ CVPR 2025. [](https://github.com/lntzm/HICom)
- **Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs.**
- **IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining.**
- **Position-Aware Depth Decay Decoding (D3): Boosting Large Language Model Inference Efficiency.**
- **System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts.**
- **SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval.** - blue)](https://speechprune.github.io/)
- **FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing.**
- **CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation.**
- **EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation.**
- **Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference.**
- **PISCO: Pretty Simple Compression for Retrieval-Augmented Generation.**
- **Provence: efficient and robust context pruning for retrieval-augmented generation.** - blue)](https://huggingface.co/naver/provence-reranker-debertav3-v1)
- **FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing.** - Heng Lin, Shikhar Tuli, Haris Jeelani, Shangqian Gao, Yilin Shen, Hongxia Jin, Yen-Chang Hsu._ NAACL 2025.
- **TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs.**
- **You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning.**
- **QuEST: Stable Training of LLMs with 1-Bit Weights and Activations.** - DASLab/QuEST)](https://github.com/IST-DASLab/QuEST)
- **DarwinLM: Evolutionary Structured Pruning of Large Language Models.**
- **Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters.**
- **Contextual Compression Encoding for Large Language Models: A Novel Framework for Multi-Layered Parameter Space Pruning.**
- **Forget the Data and Fine-Tuning! Just Fold the Network to Compress.** - folding-universal)](https://github.com/nanguoyu/model-folding-universal)
- **NestQuant: Nested Lattice Quantization for Matrix Products and LLMs.**
- **Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models.**
- **When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models.**
- **Optimizing Singular Spectrum for Large Language Model Compression.** - Hsuan Yang._ Arixv 2025.
- **Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer.**
- **Activation-Informed Merging of Large Language Models.**
- **Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training.**
- **LLM-Pruner: On the Structural Pruning of Large Language Models.** - Pruner)](https://github.com/horseee/LLM-Pruner)
- **Knowing When to Stop: Dynamic Context Cutoff for Large Language Models.** - to-stop)](https://github.com/ruoyuxie/when-to-stop)
- **LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs.**
- **DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens.** - tao Zheng._ Arxiv 2025.
- **Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?.**
- **Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity.**
- **LLoCO: Learning Long Contexts Offline.**
- **In-context Autoencoder for Context Compression in a Large Language Model.** - Qing Chen, Furu Wei._ ICLR 2024. [](https://github.com/getao/icae)
- **LoCoCo: Dropping In Convolutions for Long Context Compression.** - Group/LoCoCo)](https://github.com/VITA-Group/LoCoCo)
- **Compressing Large Language Models by Streamlining the Unimportant Layer.**
- **Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization.**
- **Efficiently Editing Mixture-of-Experts Models with Compressed Experts.** - he/Compressed-Experts)](https://github.com/yifei-he/Compressed-Experts)
- **Large Language Model Compression via the Nested Activation-Aware Decomposition.**
- **DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression.**
- **Compression Laws for Large Language Models.**
- **FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression.**
- **Delta Decompression for MoE-based LLMs Compression.**
- **The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?.**
- **Compressing Language Models for Specialized Domains.** - compression)](https://github.com/mlsw/domain-compression)
- **Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners.**
- **Optimizing Length Compression in Large Reasoning Models.** - R1)](https://github.com/zxiangx/LC-R1)
- **METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding.**
- **Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention.**
- **Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression.**
- **Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning.**
- **Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation.** - dhy/EDC-2-RAG)](https://github.com/Tsinghua-dhy/EDC-2-RAG)
- **SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression.**
- **OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference.**
- **When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks.**
- **Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression.**
- **One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression.** - benchmark)](https://github.com/janumiko/pruning-benchmark)
- **LLM Compression: How Far Can We Go in Balancing Size and Performance?.**
- **DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning.**
- **Lossless Token Sequence Compression via Meta-Tokens.**
- **Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework.**
- **Simple Context Compression: Mean-Pooling and Multi-Ratio Training.** - lab/simple-context-compression)](https://github.com/lil-lab/simple-context-compression)
- **From Long to Short: LLMs Excel at Trimming Own Reasoning Chains.**
- **From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement.**
- **R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning.**
- **The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging.**
- **LongCodeZip: Compress Long Context for Code Language Models.**
- 
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Memory Capacity of Attention.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **Generalizing an LLM from 8k to 1M Context using Qwen-Agent.**
- **FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.**
- LCLM-Survey - ->
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
-
10. Long Video and Image
- **SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding.**
- **Accelerating Vision Transformers with Adaptive Patch Sizes.**
- **ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models.**
- **LongLive: Real-time Interactive Long Video Generation.**
- **SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference.**
- **Video-LLMs with Temporal Visual Screening.**
- **Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering.**
- **MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs.**
- **Atlas: Multi-Scale Attention Improves Long Context Image Modeling.**
- **LongVILA: Scaling Long-Context Visual Language Models for Long Videos.**
- **DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework.**
- **Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding.** - Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai._ ECCV 2024 Workshop. [](https://github.com/joslefaure/HERMES)
- **EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture.** - apps/EasyAnimate)](https://github.com/aigc-apps/EasyAnimate)
- **VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos.** - Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal._ Arxiv 2024.
- **PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization.**
- **Towards Event-oriented Long Video Understanding.** - Rong Wen._ Arxiv 2024. [](https://github.com/RUCAIBox/Event-Bench)
- **KeyVideoLLM: Towards Large-scale Video Keyframe Selection.**
- **An End-to-End Speech Summarization Using Large Language Model.**
- **OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding.**
- **Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies.** - Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu._ Arxiv 2024. [](https://github.com/ander1119/TiM)
- **Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models.**
- **SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.** - Wei Chang, Lingjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu, Lijuan Wang._ Arxiv 2024. [](https://github.com/slowfast-vgen/slowfast-vgen)
- **MATE: Meet At The Embedding -- Connecting Images with Long Texts.**
- **mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models.** - PLUG/mPLUG-Owl)](https://github.com/X-PLUG/mPLUG-Owl)
- **LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation.**
- **VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges.** - nlco/VideoLLaMB)](https://github.com/bigai-nlco/VideoLLaMB)
- **Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation.**
- **LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture.**
- **VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models.**
- **T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs.**
- **Temporal Preference Optimization for Long-Form Video Understanding.** - Levy._ Arxiv 2025. [](https://github.com/ruili33/TPO)
- **Latent Swap Joint Diffusion for Long-Form Audio Generation.** - blue)](https://swapforward.github.io/)
- **A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models.** - Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou._ Arxiv 2025. [](https://github.com/HVision-NKU/GlimpsePrune)
- **Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference.** - MoRef)](https://github.com/wkfdb/Free-MoRef)
- **E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation.**
- **HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models.**
- **Representation Shift: Unifying Token Compression with FlashAttention.** - Shift)](https://github.com/mlvlab/Representation-Shift)
- **Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding.**
- **Owl-1: Omni World Model for Consistent Long Video Generation.** - yh/Owl)](https://github.com/huang-yh/Owl)
- **DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models.** - TAO/DyCoke)](https://github.com/KD-TAO/DyCoke)
- **LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression.**
- **Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs.**
- **ADMIRE: ADaptive method to enhance Multiple Image REsolutions in text-rich multi-image understanding.** - Med/admire)](https://github.com/Alipay-Med/admire)
- **Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors.**
- **Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding.**
- **Multimodal Long Video Modeling Based on Temporal Dynamic Context.** - Video)](https://github.com/Hoar012/TDC-Video)
- **MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation.**
- **ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding.** - ReTaKe)](https://github.com/SCZwangxiao/video-ReTaKe)
- **LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token.** - ReTaKe)](https://github.com/SCZwangxiao/video-ReTaKe)
- **VCA: Video Curious Agent for Long Video Understanding.**
- **Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory.**
- **ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos.**
- **Adaptive Keyframe Sampling for Long Video Understanding.**
- **VideoRoPE: What Makes for Good Video Rotary Position Embedding?.**
- **InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding.**
- **MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference.**
- **EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models.**
- **DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding.**
- **Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens.** - Embodied-AGI/Mirage)](https://github.com/UMass-Embodied-AGI/Mirage)
- **Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs.**
- **Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing.**
- **AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding.** - FlexReduc)](https://github.com/SCZwangxiao/video-FlexReduc)
- **DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding.**
- **Growing a Twig to Accelerate Large Vision-Language Models.**
- **VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization.**
- **AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance.**
- **Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models.**
- **Episodic Memory Representation for Long-form Video Understanding.**
- **TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding.** - design/TSPO)](https://github.com/Hui-design/TSPO)
- **Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning.**
- **Dense Video Understanding with Gated Residual Tokenization.**
- **MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs.**
- **FOCUS: Efficient Keyframe Selection for Long Video Understanding.** - HPC-AI-Lab/FOCUS)](https://github.com/NUS-HPC-AI-Lab/FOCUS)
- **FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding.**
- **SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs.**
- **StreamingTOM: Streaming Token Compression for Efficient Video Understanding.**
- **LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding.**
- **VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs.** - NLP/VisiPruner)](https://github.com/EIT-NLP/VisiPruner)
- **VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs.** - Jun Zha._ Arxiv 2025. [](https://github.com/JulietChoo/VisionSelector)
- **MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding.** - Hao Wu, Enmin Zhou, Junxiao Shen._ Arxiv 2025.
- **Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow.**
- **When Thinking Drifts: Evidential Grounding for Robust Video Reasoning.**
- **Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning.**
- **A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering.**
- **VideoNSA: Native Sparse Attention Scales Video Understanding.** - 1119-Song/VideoNSA)](https://github.com/Espere-1119-Song/VideoNSA)
- **From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding.**
- **Video Panels for Long Video Understanding.**
- **Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding.**
- **Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models.**
- **StreamForest: Efficient Online Video Understanding with Persistent Event Memory.**
- **FameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning.** - Wei Chang, Hang Wu, Yujun Cai._ Arxiv 2025.
- **SPIKE-RL: Video-LLMs meet Bayesian Surprise.** - RL)](https://github.com/sahithyaravi/SPIKE-RL)
- **VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding.** - R1)](https://github.com/yizhuoDi/VTPerceprion-R1)
- **LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning.** - Ming Li, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng._ Arxiv 2025.
- **FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting.**
- **D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition.** - CoDe)](https://github.com/hukcc/D-CoDe)
- **StreamingVLM: Real-Time Understanding for Infinite Video Streams.** - han-lab/streaming-vlm)](https://github.com/mit-han-lab/streaming-vlm)
- **Variation-aware Vision Token Dropping for Faster Large Vision-Language Models.** - liu16/V2Drop)](https://github.com/xuyang-liu16/V2Drop)
- **Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding.** - shen/Vgent)](https://github.com/xiaoqian-shen/Vgent)
-
6. Long Term Memory
- **M+: Extending MemoryLLM with Scalable Long-Term Memory.**
- **Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?.** - Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano._ Arxiv 2025.
- **LM2: Large Memory Models.** - ai/lm2)](https://github.com/convergence-ai/lm2)
- **Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents.**
- **MEMORYLLM: Towards Self-Updatable Large Language Models.** - ustc/MemoryLLM)](https://github.com/wangyu-ustc/MemoryLLM)
- **EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts.**
- **Unleashing Infinite-Length Input Capacity for Large-scale Language Models with Self-Controlled Memory System.**
- **MemoryBank: Enhancing Large Language Models with Long-Term Memory.** - SiliconFriend)](https://github.com/zhongwanjun/MemoryBank-SiliconFriend)
- **Improve Long-term Memory Learning Through Rescaling the Error Temporally.**
- **CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs.**
- **Commonsense-augmented Memory Construction and Management in Long-term Conversations via Context-aware Persona Refinement.** - iunn Ong, Seoyeon Kim, Dongha Lee, Jinyoung Yeo._ Arxiv 2024.
- **Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models.**
- **Empowering Working Memory for Large Language Model Agents.**
- **Evolving Large Language Model Assistant with Long-Term Conditional Memory.**
- **StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses.** - Nan Li, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, Rui Yan._ Arxiv 2024.
- **A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts.** - Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, Ian Fischer._ Arxiv 2024.
- **Steering Conversational Large Language Models for Long Emotional Support Conversations.**
- **SirLLM: Streaming Infinite Retentive LLM.**
- **Towards LifeSpan Cognitive Systems.**
- **MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent.** - Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou._ Arxiv 2025. [](https://github.com/BytedTsinghua-SIA/MemAgent)
- **CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding.**
- **SPAR: Personalized Content-Based Recommendation via Long Engagement Attention.** - Mageed, Sinong Wang, Rong Jin, Sem Park, Ning Yao, Bo Long._ Arxiv 2024.
- **Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations.**
- **Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization.**
- **HMT: Hierarchical Memory Transformer for Long Context Language Processing.** - pytorch)](https://github.com/OswaldHe/HMT-pytorch)
- **Toward Conversational Agents with Context and Time Sensitive Long-term Memory.**
- **Position Debiasing Fine-Tuning for Causal Perception in Long-Term Dialogue.** - Ling Mao, Wenfeng Xie, Dangyang Chen._ Arxiv 2024.
- **Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation.**
- **HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model.**
- **InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation.**
- **Cognitive Memory in Large Language Models.**
-
5. Length Extrapolation
- **Scalable-Softmax Is Superior for Attention.**
- **Rope to Nope and Back Again: A New Hybrid Attention Strategy.**
- **A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI).**
- **LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation.**
- **Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification.** - Paz, Kartik Ahuja._ Arxiv 2025.
- **The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval.** - Rui Chiang, Dani Yogatama._ Arxiv 2025.
- **RoFormer: Enhanced Transformer with Rotary Position Embedding.**
- **Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.**
- **KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation.** - Chung Chi, Ting-Han Fan, Peter J. Ramadge, Alexander I. Rudnicky._ Arxiv 2022.
- **Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis.** - Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge._ ACL 2023.
- **Focused Transformer: Contrastive Training for Context Scaling.**
- **Exploring Transformer Extrapolation.**
- **A Length-Extrapolatable Transformer.**
- **The Impact of Positional Encoding on Length Generalization in Transformers.** - NLP/length-generalization)](https://github.com/McGill-NLP/length-generalization)
- **LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models.** - Infinite)](https://github.com/kyegomez/LM-Infinite)
- **Extending Context Window of Large Language Models via Positional Interpolation.**
- **YaRN: Efficient Context Window Extension of Large Language Models.**
- **Scaling Laws of RoPE-based Extrapolation.**
- **PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training.** - pku/PoSE)](https://github.com/dwzhu-pku/PoSE)
- **LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models.** - research/LongLoRA)](https://github.com/dvlab-research/LongLoRA)
- **Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation.** - Chung Chi,Ting-Han Fan,Alexander I. Rudnicky._ Arxiv 2023. [](https://github.com/chijames/Attention-Alignment-Transformer-Length-Extrapolation)
- **CoCA: Fusing position embedding with Collinear Constrained Attention for fine-tuning free context window extending.** - ai/Collinear-Constrained-Attention)](https://github.com/codefuse-ai/Collinear-Constrained-Attention)
- **Structured Packing in LLM Training Improves Long Context Utilization.**
- **E^2-LLM: Efficient and Extreme Length Extension of Large Language Models.**
- **LongRoPE: Extending LLM ContextWindow Beyond 2 Million Tokens.**
- **CLEX: Continuous Length Extrapolation for Large Language Models.** - NLP-SG/CLEX)](https://github.com/DAMO-NLP-SG/CLEX)
- **Resonance RoPE: Improving Context Length Generalization of Large Language Models.**
- **Can't Remember Details in Long Documents? You Need Some R&R.** - and-r)](https://github.com/casetext/r-and-r)
- **Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.** - Group/Ms-PoE)](https://github.com/VITA-Group/Ms-PoE)
- **InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory.**
- **Effective Long-Context Scaling of Foundation Models.**
- **Fewer Truncations Improve Language Modeling.**
- **Naive Bayes-based Context Extension for Large Language Models.** - master)](https://github.com/amurtadha/NBCE-master)
- **In-Context Pretraining: Language Modeling Beyond Document Boundaries.** - tau Yih, Mike Lewis._ ICLR 2024 Spotlight. [](https://github.com/swj0419/in-context-pretraining)
- **Long Context Alignment with Short Instructions and Synthesized Positions.**
- **Length Generalization of Causal Transformers without Position Encoding.**
- **Extending Llama-3's Context Ten-Fold Overnight.**
- **xLSTM: Extended Long Short-Term Memory.**
- **3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding.**
- **LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models.** - Kiong Ng, Zhiwei Jiang, Bryan Hooi._ Arxiv 2024. [](https://github.com/zhiyuanhubj/LongRecipe)
- **ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities.**
- **E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning.**
- **Human-like Episodic Memory for Infinite Context LLMs.** - Ammar, Jun Wang._ Arxiv 2024.
- **Efficient Long-range Language Modeling with Self-supervised Causal Retrieval.**
- **A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts.**
- **Differential Transformer.**
- **Why Does the Effective Context Length of LLMs Fall Short?.**
- **LOGO -- Long cOntext aliGnment via efficient preference Optimization.**
- **Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement.**
- **Two are better than one: Context window extension with multi-grained self-injection.**
- **LongReward: Improving Long-context Large Language Models with AI Feedback.**
- **HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation.**
- **Large Language Models Can Self-Improve in Long-context Reasoning.**
- **Circuit Complexity Bounds for RoPE-based Transformer Architecture.**
- **Transformers Can Do Arithmetic with the Right Embeddings.**
- **What is Wrong with Perplexity for Long-context Language Modeling?.** - ML/LongPPL)](https://github.com/PKU-ML/LongPPL)
- **Adjoint sharding for very long context training of state space models.**
- **An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding.** - nlco/cream)](https://github.com/bigai-nlco/cream)
- **SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling.** - Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg._ Arxiv 2025.
- **DINT Transformer.**
- **Causal Attention with Lookahead Keys.**
- **Positional Encoding via Token-Aware Phase Attention.**
- **Data Engineering for Scaling Language Models to 128K Context.** - Context-Data-Engineering)](https://github.com/FranxYao/Long-Context-Data-Engineering)
- **Transformers Can Achieve Length Generalization But Not Robustly.**
- **Long-Context Language Modeling with Parallel Context Encoding.** - nlp/CEPE)](https://github.com/princeton-nlp/CEPE)
- **Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference.**
- **HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models.**
- **Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks.**
- **Mixture of In-Context Experts Enhance LLMs' Long Context Awareness.**
- **Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count.** - coupling)](https://github.com/HanseulJo/position-coupling)
- **Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models.**
- **Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models.**
- **DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search.**
- **LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data.** - FinAI/LongFaith)](https://github.com/IDEA-FinAI/LongFaith)
- **Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers.**
- **Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation.**
- **Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation.**
- **Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs.** - Woo Ha, Jinwoo Shin._ ICLR 2024. [](https://github.com/alinlab/HOMER)
- **SELF: Self-Extend the Context Length With Logistic Growth Function.** - LLM)](https://github.com/alexeipc/SELF-LLM)
- **LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs.**
- **Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling.**
- **LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning.**
- **NExtLong: Toward Effective Long-Context Training without Long Documents.**
- **SEAL: Scaling to Emphasize Attention for Long-Context Retrieval.** - gyu Jin, Younghyun Cho, Eunhyeok Park._ Arxiv 2025.
- **Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms.** - NEKO/InfoScale)](https://github.com/HT-NEKO/InfoScale)
- **Forgetting Transformer: Softmax Attention with a Forget Gate.** - lin/forgetting-transformer)](https://github.com/zhixuan-lin/forgetting-transformer)
- **Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning.** - synthesis)](https://github.com/NJUNLP/context-synthesis)
- **WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale.** - Qing Chen, Wei Lu, Furu Wei._ Arxiv 2025.
- **Randomized Positional Encodings Boost Length Generalization of Transformers.** - Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness._ ACL 2023. [](https://github.com/google-deepmind/randomized_positional_encodings)
- **LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning.** - Yuan Chang, Huiyuan Chen, Xia Hu._ Arxiv 2024.
- **Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache.**
- **Extending LLMs' Context Window with 100 Samples.** - NLP/Entropy-ABF)](https://github.com/GAIR-NLP/Entropy-ABF)
- **With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation.** - LoRA)](https://github.com/TemporaryLoRA/Temp-LoRA)
- **Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation.**
- **Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens.** - gram)](https://github.com/liujch1998/infini-gram)
- **DAPE: Data-Adaptive Positional Encoding for Length Extrapolation.** - Zheng/DAPE)](https://github.com/chuanyang-Zheng/DAPE)
- **Contextual Position Encoding: Learning to Count What's Important.**
- **Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model.**
- **Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure.** - coupling)](https://github.com/HanseulJo/position-coupling)
- **LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.**
- **Scaling Granite Code Models to 128K Context.** - Hong Dang, Yan Koyfman, Atin Sood, Rogerio Feris, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda._ Arxiv 2024. [](https://github.com/ibm-granite/granite-code-models)
- **Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly.**
- **FocusLLM: Scaling LLM's Context by Parallel Decoding.**
- **Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models.** - the-Knots)](https://github.com/rgtjf/Untie-the-Knots)
- **PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead.** - RAG)](https://github.com/TTArch/PEAR-RAG)
- **Extending Context Window of Large Language Models from a Distributional Perspective.**
- **How to Train Long-Context Language Models (Effectively).** - nlp/ProLong)](https://github.com/princeton-nlp/ProLong)
- **DAPE V2: Process Attention Score as Feature Map for Length Extrapolation.** - Zheng/DAPE)](https://github.com/chuanyang-Zheng/DAPE)
- **LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization.** - NLP-SG/LongPO)](https://github.com/DAMO-NLP-SG/LongPO)
- **ParallelComp: Parallel Long-Context Compressor for Length Extrapolation.**
- **LongAttn: Selecting Long-context Training Data via Token-level Attention.** - wu/LongAttn)](https://github.com/Lyun0912-wu/LongAttn)
- **Sliding Window Attention Training for Efficient Large Language Models.** - wu/LongAttn)](https://anonymous.4open.science/r/SWAT-attention/README.md)
- **LongRoPE2: Near-Lossless LLM Context Window Scaling.**
- **ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs.**
- **Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration.** - PauseTokens-7357)
- **Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences.** - parallelism/README.md)
- **Long-Context Generalization with Sparse Attention.** - spin/bigfish)](https://github.com/deep-spin/bigfish)
- **Token Weighting for Long-Range Language Modeling.** - token-weighting)](https://github.com/UKPLab/naacl2025-token-weighting)
- **From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models.** - blue)](https://ultralong.github.io/)
- **Long-Short Alignment for Effective Long-Context Modeling in LLMs.** - ML/LongShortAlignment)](https://github.com/PKU-ML/LongShortAlignment)
- **Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding.**
- **Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation.** - Chung Chi,Ting-Han Fan,Alexander I. Rudnicky._ Arxiv 2023. [](https://github.com/chijames/Attention-Alignment-Transformer-Length-Extrapolation)
- **Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation.**
- **Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models.** - attention)](https://github.com/OpenNLPLab/lightning-attention)
-
4. State Space Models
- **S2TX: Cross-Attention Multi-Scale State-Space Transformer for Time Series Forecasting.**
- **SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model.**
- **CacheMamba: Popularity Prediction for Mobile Edge Caching Networks via Selective State Spaces.** - Meybodi, Arash Mohammadi._ Arxiv 2025.
- **Mamba: Linear-Time Sequence Modeling with Selective State Spaces.** - spaces/mamba)](https://github.com/state-spaces/mamba)
- **MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts.**
- **MambaByte: Token-free Selective State Space Model.**
- **LOCOST: State-Space Models for Long Document Abstractive Summarization.**
- **State Space Models as Foundation Models: A Control Theoretic Overview.**
- **Jamba: A Hybrid Transformer-Mamba Language Model.** - Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham._ Arxiv 2024.
- **Robustifying State-space Models for Long Sequences via Approximate Diagonalization.**
- **Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.** - spaces/mamba)](https://github.com/state-spaces/mamba)
- **Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling.**
- **MambaForGCN: Enhancing Long-Range Dependency with State Space Model and Kolmogorov-Arnold Networks for Aspect-Based Sentiment Analysis.**
- **Discrete Diffusion Language Model for Long Text Summarization.**
- **ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2.** - Mamba)](https://github.com/WenjunHuang94/ML-Mamba)
- **Jamba-1.5: Hybrid Transformer-Mamba Models at Scale.**
- **SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking State Space Models.**
- **ReMamba: Equip Mamba with Effective Long-Sequence Modeling.**
- **Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling.** - mamba)](https://github.com/thunlp/stuffed-mamba)
- **Taipan: Efficient and Expressive State Space Language Models with Selective Attention.**
- **Rethinking Token Reduction for State Space Models.**
- **B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory.**
- **Attamba: Attending To Multi-Token States.** - lab/attamba)](https://github.com/abdelfattah-lab/attamba)
- **Zamba: A Compact 7B SSM Hybrid Model.**
- **Dynamic Chunking for End-to-End Hierarchical Sequence Modeling.**
- **Zebra-Llama: Towards Extremely Efficient Hybrid Models.**
- **M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models.** - Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao._ Arxiv 2025. [](https://github.com/jxiw/M1)
- **LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement.** - EIC/LongMamba)](https://github.com/GATECH-EIC/LongMamba)
- **Sparsified State-Space Models are Efficient Highway Networks.**
- **Gated Delta Networks: Improving Mamba2 with Delta Rule.**
- **RWKV-X: A Linear Complexity Hybrid Language Model.** - hou/RWKV-X)](https://github.com/howard-hou/RWKV-X)
- **Don't Pay Attention.**
- **A Systematic Analysis of Hybrid Linear Attention.** - Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian._ Arxiv 2025.
- **Mamba Modulation: On the Length Generalization of Mamba.**
- **Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling.**
-
7. RAG and ICL
- **CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation.** - Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na._ Arxiv 2025.
- **Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning.**
- **Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a "Passage".**
- **ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation.**
- **Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts.**
- **FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering.** - Fai Wong._ Arxiv 2025.
- **Feature-Adaptive and Data-Scalable In-Context Learning.** - ICL)](https://github.com/jiahaozhenbang/FADS-ICL)
- **KG-RAG: Bridging the Gap Between Knowledge and Creativity.** - RAG)](https://github.com/dsanmart/KG-RAG)
- **HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.** - NLP-Group/HippoRAG)](https://github.com/OSU-NLP-Group/HippoRAG)
- **Implicit In-context Learning.**
- **Are Long-LLMs A Necessity For Long-Context Tasks?.**
- **Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading.**
- **Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing.**
- **BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models.**
- **Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.** - RAG)](https://github.com/starsuzi/Adaptive-RAG)
- **RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation.** - Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, Jie Fu._ Arxiv 2024. [](https://github.com/chanchimin/RQ-RAG)
- **Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts.**
- **Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation.**
- **Multi-view Content-aware Indexing for Long Document Retrieval.**
- **Retrieval Head Mechanistically Explains Long-Context Factuality.**
- **FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference.**
- **MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery.**
- **In Defense of RAG in the Era of Long-Context Language Models.**
- **You Only Use Reactive Attention Slice For Long Context Retrieval.**
- **SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval.**
- **Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation.**
- **Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection.** - Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen._ Arxiv 2024.
- **Is In-Context Learning Sufficient for Instruction Following in LLMs?.** - epfl/icl-alignment)](https://github.com/tml-epfl/icl-alignment)
- **FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models.**
- **Multi-Head RAG: Solving Multi-Aspect Problems with LLMs.**
- **Demonstration Notebook: Finding the Most Suited In-Context Learning Example from Interactions.**
- **Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding.**
- **FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering.** - blue)](https://huggingface.co/forag)
- **LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs.** - AI-Lab/LongRAG)](https://github.com/TIGER-AI-Lab/LongRAG)
- **Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning.**
- **From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data.**
- **Memory3: Language Modeling with Explicit Memory.**
- **Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting.** - Yu Lee, Tomas Pfister._ Arxiv 2024.
- **Writing in the Margins: Better Inference Pattern for Long Context Retrieval.** - in-the-margins)](https://github.com/writer/writing-in-the-margins)
- **Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach.**
- **Large Language Models Know What Makes Exemplary Contexts.** - ICL)](https://github.com/ruyue0001/RL-ICL)
- **MemLong: Memory-Augmented Retrieval for Long Text Modeling.**
- **LLM×MapReduce: Simplified Long-Sequence Processing using Large Language Models.**
- **Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models.**
- **SEGMENT+: Long Text Processing with Short-Context Language Models.** - 9/segmentplus)](https://github.com/WeiShi-9/segmentplus)
- **Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs.** - uiuc/GoR)](https://github.com/ulab-uiuc/GoR)
- **ChuLo: Chunk-Level Key Information Representation for Long Document Processing.**
- **TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text.**
- **Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism.**
- **LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering.**
- **Reducing Distraction in Long-Context Language Models by Focused Learning.**
- **Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models.**
- **Revisiting In-Context Learning with Long Context Language Models.**
- **Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models.**
- **Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models.** - Text-Computing/DCS)](https://github.com/ECNU-Text-Computing/DCS)
- **R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation.**
- **Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation.**
- **OkraLong: A Flexible Retrieval-Augmented Framework for Long-Text Query Processing.**
- **MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads.**
- **Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations.**
- **Making Long-Context Language Models Better Multi-Hop Reasoners.** - Lab/LongContextReasoner)](https://github.com/LaVi-Lab/LongContextReasoner)
- **RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation.** - science/RAGChecker)](https://github.com/amazon-science/RAGChecker)
- **Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding.**
- **ALR2: A Retrieve-then-Reason Framework for Long-context Question Answering.**
- **Inference Scaling for Long-Context Retrieval Augmented Generation.**
- **GARLIC: LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph for Long Document QA.**
- **Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG.**
- **Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention.** - Jou Li, Yilin Zhang, Graham Neubig, Amanda Bertsch._ Arxiv 2025. [](https://github.com/millix19/dbsa)
- **Long Context Modeling with Ranked Memory-Augmented Retrieval.**
- **Tuning LLMs by RAG Principles: Towards LLM-native Memory.**
- **Conflict-Aware Soft Prompting for Retrieval-Augmented Generation.**
-
11. Benchmark and Evaluation
- **LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems.** - blue)](https://huggingface.co/spaces/UltraRonin/LR2Bench)
- **UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios.**
- **LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation.** - coai/LOT-LongLM)](https://github.com/thu-coai/LOT-LongLM)
- **U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack.** - KGLLM/U-NIAH)](https://github.com/Tongji-KGLLM/U-NIAH)
- **One ruler to measure them all: Benchmarking multilingual long-context language models.**
- **LUQ: Long-text Uncertainty Quantification for LLMs.**
- **Long-context LLMs Struggle with Long In-context Learning.** - AI-Lab/LongICLBench)](https://github.com/TIGER-AI-Lab/LongICLBench)
- **CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems.**
- **XL2Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies.** - nlp/XL2Bench)](https://github.com/nuaa-nlp/XL2Bench)
- **SCROLLS: Standardized CompaRison Over Long Language Sequences.** - nlp/scrolls)](https://github.com/tau-nlp/scrolls)
- **MuLD: The Multitask Long Document Benchmark.**
- **Lost in the Middle: How Language Models Use Long Contexts.** - liu/lost-in-the-middle)](https://github.com/nelson-liu/lost-in-the-middle)
- **L-Eval: Instituting Standardized Evaluation for Long Context Language Models.**
- **LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding.**
- **Content Reduction, Surprisal and Information Density Estimation for Long Documents.**
- **The Impact of Reasoning Step Length on Large Language Models.**
- **LongHealth: A Question Answering Benchmark with Long Clinical Documents.** - Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL. Aerts, Jakob Nikolas Kather, Daniel Truhn, Keno Bressem._ Arxiv 2024.
- **∞Bench: Extending Long Context Evaluation Beyond 100K Tokens.**
- **DocFinQA: A Long-Context Financial Reasoning Dataset.** - Kedziorski, Viet Dac Lai, Chris Tanner._ Arxiv 2024.
- **Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models.** - Task-More-Tokens)](https://github.com/alonj/Same-Task-More-Tokens)
- **Evaluating Very Long-Term Conversational Memory of LLM Agents.** - Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang._ Arxiv 2024. [](https://github.com/snap-research/LoCoMo)
- **LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents.**
- **PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models.**
- **Long-form evaluation of model editing.**
- **In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss.**
- **Needle in a haystack - pressure testing llms.**
- **In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss.**
- **LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K.**
- **Language Models as Science Tutors.** - Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen._ Arxiv 2024. [](https://github.com/princeton-nlp/LM-Science-Tutor)
- **Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models.** - Stars)](https://github.com/nick7nlp/Counting-Stars)
- **NovelQA: A Benchmark for Long-Range Novel Question Answering.**
- **Long-form factuality in large language models.** - deepmind/long-form-factuality)](https://github.com/google-deepmind/long-form-factuality)
- **CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models.**
- **LongEmbed: Extending Embedding Models for Long Context Retrieval.** - pku/LongEmbed)](https://github.com/dwzhu-pku/LongEmbed)
- **Make Your LLM Fully Utilize the Context.** - Guang Lou._ Arxiv 2024. [](https://github.com/microsoft/FILM)
- **Many-shot Jailbreaking.**
- **Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors.**
- **S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models.**
- **In-Context Learning with Long-Context Models: An In-Depth Exploration.** - context-icl)](https://github.com/abertsch72/long-context-icl)
- **Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks.** - compass/Ada-LEval)](https://github.com/open-compass/Ada-LEval)
- **RULER: What's the Real Context Size of Your Long-Context Language Models?.** - Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg._ Arxiv 2024. [](https://github.com/hsiehjackson/RULER)
- **Language Models Need Inductive Biases to Count Inductively.**
- **BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack.**
- **Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!.**
- **What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling.**
- **Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding.** - Seng Chua._ Arxiv 2024.
- **DOLOMITES: Domain-Specific Long-Form Methodical Tasks.**
- **Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis.**
- **FinTextQA: A Dataset for Long-form Financial Question Answering.**
- **A Multi-Perspective Analysis of Memorization in Large Language Models.**
- **Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective.**
- **Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models.**
- **Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?.** - Wei Chang, Kelvin Guu._ Arxiv 2024. [](https://github.com/google-deepmind/loft)
- **LongIns: A Challenging Long-context Instruction-based Exam for LLMs.**
- **Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA.**
- **Entity-Level Sentiment: More than the Sum of Its Parts.**
- **Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction.**
- **Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell.** - dont-tell)](https://github.com/TaiMingLu/know-dont-tell)
- **RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension.**
- **Attribute or Abstain: Large Language Models as Long Document Assistants.** - attribute-or-abstain)](https://github.com/UKPLab/arxiv2024-attribute-or-abstain)
- **How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities.**
- **DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems.** - Zou/DocBench)](https://github.com/Anni-Zou/DocBench)
- **USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations.**
- **VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation.** - Song/VeriScore)](https://github.com/Yixiao-Song/VeriScore)
- **KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.** - Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu._ Arxiv 2024. [](https://github.com/henryzhongsc/longctx_bench)
- **ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models.**
- **Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP.**
- **Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems.** - Sheng Wu._ Arxiv 2024. [](https://github.com/salesforce/summary-of-a-haystack)
- **Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack.** - USC/Lifelong-ICL)](https://github.com/INK-USC/Lifelong-ICL)
- **NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?.** - compass/opencompass)](https://github.com/open-compass/opencompass)
- **LongLaMP: A Benchmark for Personalized Long-form Text Generation.** - blue)](https://longlamp-benchmark.github.io/)
- **RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering.** - qa-arena)](https://github.com/awslabs/rag-qa-arena)
- **Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models.** - S Dovonon, Jean Kaddour, Pasquale Minervini._ ICML 2024 TF2M workshop.
- **WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries.**
- **Long Input Benchmark for Russian Analysis.**
- **HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models.** - bench)](https://github.com/Tintri/hello-bench)
- **Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.**
- **Evaluating Long Range Dependency Handling in Code Generation Models using Multi-Step Key Retrieval.** - key-retrieval-code-tasks)](https://github.com/apple/ml-key-retrieval-code-tasks)
- **Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks.**
- **A Controlled Study on Long Context Extension and Generalization in LLMs.**
- **RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues.** - Lin Kuo, Feng-Ting Liao, Mu-Wei Hsieh, Fu-Chieh Chang, Po-Chun Hsu, Da-Shan Shiu._ Arxiv 2024. [](https://github.com/mtkresearch/RAD-Bench)
- **Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation.** - blue)](https://huggingface.co/datasets/google/frames-benchmark)
- **Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.** - Baptiste Lespiau, Nithya Attaluri, Kate Olszewska._ Arxiv 2024.
- **DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels.**
- **LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA.**
- **Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs.** - thu/LongPiBench)](https://github.com/Rachum-thu/LongPiBench)
- **MileBench: Benchmarking MLLMs in Long Context.**
- **MovieSum: An Abstractive Summarization Dataset for Movie Screenplays.**
- **SEED-Story: Multimodal Long Story Generation with Large Language Model.** - Story)](https://github.com/TencentARC/SEED-Story)
- **Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models.** - needle-in-a-haystack)](https://github.com/AmeyHengle/multilingual-needle-in-a-haystack)
- **LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs.** - Wei Lee._ Arxiv 2024. [](https://github.com/mozhu621/LongGenBench/)
- **What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices.**
- **Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation.**
- **Many-Shot In-Context Learning in Multimodal Foundation Models.**
- **MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.** - Peng Lim, Caiming Xiong, Doyen Sahoo._ Arxiv 2024.
- **Hyper-multi-step: The Truth Behind Difficult Long-context Tasks.**
- **RepoQA: Evaluating Long Context Code Understanding.**
- **Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding.**
- **MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding.**
- **Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models.** - ML-Lab/multimodal-needle-in-a-haystack)](https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack)
- **Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts.** - blue)](https://locovqa.github.io/)
- **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.** - XComposer)](https://github.com/InternLM/InternLM-XComposer)
- **MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations.** - Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun._ Arxiv 2024. [](https://github.com/mayubo2333/MMLongBench-Doc)
- **Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge.** - Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Byungsoo Ko, Jonghwan Hyeon, Ho-Jin Choi._ Arxiv 2024. [](https://github.com/passing2961/Stark)
- **SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers.**
- **mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval.** - blue)](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- **LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding.**
- **Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows.** - blue)](https://spider2-sql.github.io/)
- **A Benchmark for Long-Form Medical Question Answering.** - ai/medical-eval-sphere)](https://github.com/lavita-ai/medical-eval-sphere)
- **LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos.**
- **M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework.** - blue)](https://multimodal-documents.github.io/)
- **Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?.** - roberts1/needle-threading)](https://github.com/jonathan-roberts1/needle-threading)
- **DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models.**
- **ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario.**
- **Long Context vs. RAG for LLMs: An Evaluation and Revisits.**
- **LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations.** - deepmind/lm_act)](https://github.com/google-deepmind/lm_act)
- **Neptune: The Long Orbit to Benchmarking Long Video Understanding.** - deepmind/neptune)](https://github.com/google-deepmind/neptune)
- **NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models.**
- **CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning.** - Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan._ Arxiv 2025. [](https://github.com/google/curie)
- **L2M: Mutual Information Scaling Law for Long-Context Language Modeling.**
- **LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models.** - gaurav/LLMThinkBench)](https://github.com/ctrl-gaurav/LLMThinkBench/)
- **SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving.**
- **DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities.**
- **An Empirical Study of Mamba-based Language Models.** - LM)](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)
- **ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists.** - Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang._ Arxiv 2025. [](https://github.com/launchnlp/ExpertLongBench)
- **Does quantization affect models' performance on long-context tasks?.**
- **MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly.**
- **Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks.** - long-context-reasoning)](https://github.com/AmeyHengle/multilingual-long-context-reasoning)
- **LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams.**
- **MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models.** - Lab/MiniLongBench)](https://github.com/MilkThink-Lab/MiniLongBench)
- **100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?.** - LongBench)](https://github.com/uservan/100-LongBench)
- **CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification.**
- **MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning.** - Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen._ Arxiv 2025.
- **LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing.**
- **Does RAG Really Perform Bad For Long-Context Processing?.**
- **LooGLE: Long Context Evaluation for Long-Context Language Models.** - nlco/loogle)](https://github.com/bigai-nlco/loogle)
- **OLAPH: Improving Factuality in Biomedical Long-form Question Answering.** - lab/OLAPH)](https://github.com/dmis-lab/OLAPH)
- **Can LLMs Solve longer Math Word Problems Better?.** - USTC/CoLeG-Math)](https://github.com/XinXU-USTC/CoLeG-Math)
- **MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens.** - fans/MedOdyssey)](https://github.com/JOHNNY-fans/MedOdyssey)
- **Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization.** - Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister._ Arxiv 2024.
- **One Thousand and One Pairs: A "novel" challenge for long-context language models.**
- **ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding.** - art-projection/ScaleLong)](https://github.com/multimodal-art-projection/ScaleLong)
- **THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models.** - Bench)](https://github.com/ZhiyuanLi218/Think-Bench)
- **SCBench: A KV Cache-Centric Analysis of Long-Context Methods.** - blue)](https://hqjiang.com/scbench.html)
- **VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation.**
- **LCFO: Long Context and Long Form Output Dataset and Benchmarking.** - jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood._ Arxiv 2024.
- **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks.**
- **XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation.**
- **RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation.**
- **MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems.** - Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky._ Arxiv 2025. [](https://github.com/ibm/mt-rag-benchmark)
- **Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation.**
- **LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating.** - Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu._ Arxiv 2024. [](https://github.com/google-deepmind/neptune)
- **HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding.**
- **SQLong: Enhanced NL2SQL for Longer Contexts with LLMs.** - Fang Li, Long Duong._ Arxiv 2025.
- **LongSafety: Evaluating Long-Context Safety of Large Language Models.** - coai/LongSafety)](https://github.com/thu-coai/LongSafety)
- **Compression Scaling Laws:Unifying Sparsity and Quantization.**
- **LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion.** - Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen._ Arxiv 2025.
- **Explaining Context Length Scaling and Bounds for Language Models.** - Neng Hwang, Serge Belongie, Lei Li._ Arxiv 2025. [](https://github.com/JingzheShi/NLPCtlScalingAndBounds)
- **Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings.** - blue)](https://catch-tag-release.github.io/)
- **BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation.**
- **NoLiMa: Long-Context Evaluation Beyond Literal Matching.**
- **Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context.**
- **EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges.** - blue)](https://scale.com/leaderboard/enigma_eval)
- **RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?.** - blue)](https://huggingface.co/RedStar-Reasoning)
- **MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents.** - blue)](https://huggingface.co/MMDocIR)
- **Long Range Arena : A Benchmark for Efficient Transformers.** - research/long-range-arena)](https://github.com/google-research/long-range-arena)
- **Base of RoPE Bounds Context Length.**
- **Many-shot In-Context Learning.** - Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle._ Arxiv 2024.
- **CRAG -- Comprehensive RAG Benchmark.** - tau Yih, Xin Luna Dong._ Arxiv 2024. [](https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024)
- **CoverBench: A Challenging Benchmark for Complex Claim Verification.** - David, Uri Shaham, Amir Feder, Mor Geva, Dror Marcus, Avi Caciularu._ Arxiv 2024. [](https://huggingface.co/datasets/google/coverbench)
- **Multilingual Evaluation of Long Context Retrieval and Reasoning.**
- **L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?** - CITEEVAL)](https://github.com/ZetangForward/L-CITEEVAL)
- **HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly.** - nlp/HELMET)](https://github.com/princeton-nlp/HELMET)
- **Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data.**
- **How much do contextualized representations encode long-range context?.** - Ping Hsieh._ Arxiv 2024.
- **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.** - Wei Chang, Dong Yu._ Arxiv 2024. [](https://github.com/xiaowu0162/LongMemEval)
- **When Attention Sink Emerges in Language Models: An Empirical View.** - sg/Attention-Sink)](https://github.com/sail-sg/Attention-Sink)
- **ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage.** - lab/ETHIC)](https://github.com/dmis-lab/ETHIC)
- **Long2RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall.**
- **LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios.**
- 
- **NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables.**
- **DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities.**
- **Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision.** - prog123/LongRePS)](https://github.com/lemon-prog123/LongRePS)
- **Dissecting Long Reasoning Models: An Empirical Study.**
- **Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning.**
- **Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels.**
- **LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework.** - Lab/LOOM-Scope)](https://github.com/LCM-Lab/LOOM-Scope) [](https://loomscope.github.io/)
- **PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts.** - blue)](https://gorov.github.io/prelude/)
- **OptimalThinkingBench: Evaluating Over and Underthinking in LLMs.**
- **LVBench: An Extreme Long Video Understanding Benchmark.** - org/LVBench)](https://github.com/zai-org/LVBench)
- **LongReasonArena: A Long Reasoning Benchmark for Large Language Models.**
- **A Controllable Examination for Long-Context Language Models** - Benchmark)](https://github.com/Thomasyyj/LongBio-Benchmark)
- **Demystifying Long Chain-of-Thought Reasoning in LLMs.** - AI-Lab/gsm)](https://github.com/Infini-AI-Lab/gsm)
- **Retrieval meets Long Context Large Language Models.**
- **MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models.**
-
12. Long Text Generation
- **Suri: Multi-constraint Instruction Following for Long-form Text Generation.**
- **Context-Preserving Gradient Modulation for Large Language Models: A Novel Approach to Semantic Consistency in Long-Form Text Generation.**
- **Think When You Need: Self-Adaptive Chain-of-Thought Learning.**
- **LLM×MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources.**
- **Integrating Planning into Single-Turn Long-Form Text Generation.**
- **LoGU: Long-form Generation with Uncertainty Expressions.**
- **Large Language Models Still Exhibit Bias in Long Text.**
- **Language Models can Self-Lengthen to Generate Long Texts.** - Lengthen)](https://github.com/QwenLM/Self-Lengthen)
- **LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation.** - pli/LongProc)](https://github.com/princeton-pli/LongProc)
- **Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation.**
- **LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.**
- **ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning.**
- **A Cognitive Writing Perspective for Constrained Long-Form Text Generation.**
- **CLIPPER: Compression enables long-context synthetic data generation.**
- **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models.** - Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li._ Arxiv 2025. [](https://github.com/THU-KEG/LongWriter-V)
- **Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key.**
- **DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation.** - LFAG/DeFine_Dataset)](https://github.com/DeFine-LFAG/DeFine_Dataset)
- **Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation.** - Writer)](https://github.com/OnlyAR/RAL-Writer)
- **Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models.** - ai/heterogeneous-recursive-planning)](https://github.com/principia-ai/heterogeneous-recursive-planning)
- **LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information.**
- **Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement.**
- **ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation.**
- 
- **LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm.** - Navarro, Chenghua Lin._ Arxiv 2025. [](https://github.com/Wusiwei0410/LongEval)
- **From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.** - nlco/TokenSwift)](https://github.com/bigai-nlco/TokenSwift)
- **RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery.**
- **From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation.**
- **StoryWriter: A Multi-Agent Framework for Long Story Generation.**
- **Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning.** - Zhiwen/Writing-RL)](https://github.com/Tongyi-Zhiwen/Writing-RL)
- **SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models.** - Wei Lee._ Arxiv 2025. [](https://github.com/mozhu621/SuperWriter)
- **LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning.** - Wei Lee, Juanzi Li._ Arxiv 2025. [](https://huggingface.co/THU-KEG/LongWriter-Zero-32B)
- **LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs.**
- **LongGenBench: Long-context Generation Benchmark.**
- **The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input.** - blue)](https://www.kaggle.com/facts-leaderboard)
-
13. Long CoT
- **Towards Widening The Distillation Bottleneck for Reasoning Models.**
- **What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret.**
- **START: Self-taught Reasoner with Tools.**
- **LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!.** - AI/SkyThought)](https://github.com/NovaSky-AI/SkyThought)
- **Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning.**
- **L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning.**
- **CoT-Valve: Length-Compressible Chain-of-Thought Tuning.** - Valve)](https://github.com/horseee/CoT-Valve)
- **Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity.**
- **Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning.**
- **Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?.**
- **MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving.**
- **"Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding.**
- **Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond.** - R1)](https://github.com/Qihoo360/Light-R1)
- **Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering.**
- **TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance.**
- **SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models.**
- **Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities.**
- **Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval.**
- **PENCIL: Long Thoughts with Short Memory.**
- **AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control.**
- **When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning.**
- **Adaptive Deep Reasoning: Triggering Deep Thinking When Needed.**
- **Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model.** - Policy-Preference-Optimization)](https://github.com/Danield21/Dual-Policy-Preference-Optimization)
- **OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling.** - NLP/OctoThinker)](https://github.com/GAIR-NLP/OctoThinker)
- **A\*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings.** - Thought)](https://github.com/AI9Stars/AStar-Thought)
- **Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning.**
- **Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework.**
- **ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute.** - Qin Zhang, Yuanchun Li._ Arxiv 2025. [](https://github.com/MobileLLM/ParaThinker)
- **AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time.** - Group/AlphaOne)](https://github.com/ASTRAL-Group/AlphaOne)
- **AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models.** - Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu._ Arxiv 2025.
- **ARM: Adaptive Reasoning Model.** - ARM/ARM)](https://github.com/TEAM-ARM/ARM)
- **Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens.** - or-not)](https://github.com/chicosirius/think-or-not)
- **QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning.** - Zhiwen/QwenLong-L1)](https://github.com/Tongyi-Zhiwen/QwenLong-L1)
- **Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN.** - Zhiwen/Shift-FFN)](https://anonymous.4open.science/r/Shift-FFN)
- **Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards.**
- **Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection.** - To-Reason)](https://github.com/goodmanpzh/Route-To-Reason)
- **RM-R1: Reward Modeling as Reasoning.** - R1-UIUC/RM-R1)](https://github.com/RM-R1-UIUC/RM-R1)
- **Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents.**
- **DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models.**
- **Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning.**
- **ReTool: Reinforcement Learning for Strategic Tool Use in LLMs.** - RL/ReTool)](https://github.com/ReTool-RL/ReTool)
- **Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models.**
- **LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception.** - Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna._ Arxiv 2025. [](https://github.com/andrewliao11/LongPerceptualThoughts)
- **Process Reward Models That Think.**
- **ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy.**
- **Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning.**
- **Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?.**
- **TokenSkip: Controllable Chain-of-Thought Compression in LLMs.**
- **SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities.** - blue)](https://safe-chain.github.io/)
- **Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning.**
- **Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning.**
- **DRT: Deep Reasoning Translation via Long Chain-of-Thought.** - o1)](https://github.com/krystalan/DRT-o1)
- **Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.**
- **O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?.** - NLP/O1-Journey)](https://github.com/GAIR-NLP/O1-Journey)
- **OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning.** - BJTU/OpenRFT)](https://github.com/ADaM-BJTU/OpenRFT)
- **When More is Less: Understanding Chain-of-Thought Length in LLMs.**
- **Monte Carlo Tree Diffusion for System 2 Planning.**
- **InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models.**
- **LightThinker: Thinking Step-by-Step Compression.**
- **SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild.** - nlp/simpleRL-reason)](https://github.com/hkust-nlp/simpleRL-reason)
- **Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning.** - Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye._ Arxiv 2025. [](https://github.com/sun-hailong/TVC)
- **Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models.** - Yan Yeung, Arman Cohan._ ACL 2025. [](https://github.com/wujunjie1998/Ref-Long)
- **LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization.** - real/lapo)](https://github.com/zju-real/lapo)
- **Hierarchical Budget Policy Optimization for Adaptive Reasoning.** - real/hbpo)](https://github.com/zju-real/hbpo)
- **Through the Valley: Path to Effective Long CoT Training for Small Language Models.**
- **Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency.**
- **TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression.** - Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Ying Nian Wu, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu._ Arxiv 2025. [](https://github.com/zzli2022/TLDR)
- **Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning.** - Peaks)](https://github.com/ChnQ/MI-Peaks)
- **Long or short CoT? Investigating Instance-level Switch of Large Reasoning Models.**
- **Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models.**
- **Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning.**
- **Kinetics: Rethinking Test-Time Scaling Laws.** - AI-Lab/Kinetics)](https://github.com/Infini-AI-Lab/Kinetics)
- **Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning.**
- **Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space.** - ai-lab/Soft-Thinking)](https://github.com/eric-ai-lab/Soft-Thinking)
- **Learn to Reason Efficiently with Adaptive Length-based Reward Shaping.** - nlp/Laser)](https://github.com/hkust-nlp/Laser)
- **AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization.**
- **Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs.**
- **ThinkSwitcher: When to Think Hard, When to Think Fast.**
- **AdapThink: Adaptive Thinking Preferences for Reasoning Language Model.**
- **SABER: Switchable and Balanced Training for Efficient LLM Reasoning.**
- **Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal.**
- **Train Long, Think Short: Curriculum Learning for Efficient Reasoning.** - Zeid, Marzyeh Ghassemi, Bernard Ghanem._ Arxiv 2025. [](https://github.com/hammoudhasan/curriculum_grpo)
- **Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.**
- **Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models.**
- **SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression.**
- **Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning.**
- **Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit.**
- **DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models.**
- **BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens.** - Qin Zhang, Yuanchun Li._ Arxiv 2025. [](https://github.com/MobileLLM/BudgetThinker)
- **Early Stopping Chain-of-thoughts in Large Language Models.**
- **SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration.** - research/SmartSwitch)](https://github.com/dvlab-research/SmartSwitch)
- **Fast Thinking for Large Language Models.**
- **ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models.**
- **Long Is More Important Than Difficult for Training Reasoning Models.**
- **Dynamic Early Exit in Reasoning Models.**
-
3. Recurrent Transformers
- **EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices.**
- 
- **Memformer: The memory-augmented transformer.**
- **Compressive Transformers for Long-Range Sequence Modelling.** - transformer-pytorch)](https://github.com/lucidrains/compressive-transformer-pytorch)
- **ERNIE-Doc: A Retrospective Long-Document Modeling Transformer.** - IJCNLP 2021.
- **Memorizing Transformers.** - transformers-pytorch)](https://github.com/lucidrains/memorizing-transformers-pytorch)
- **Recurrent Attention Networks for Long-text Modeling.**
- 
- **TRAMS: Training-free Memory Selection for Long-range Language Modeling.**
- **Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence.** - Jie Zhu._ Arxiv 2024. [](https://github.com/RWKV/RWKV-LM)
- **Extensible Embedding: A Flexible Multipler For LLM's Context Length.**
- 
- **Linearizing Large Language Models.** - ML/linear_open_lm)](https://github.com/TRI-ML/linear_open_lm)
- **GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression.** - paper)](https://github.com/recursal/GoldFinch-paper)
- **VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models.** - hou/VisualRWKV)](https://github.com/howard-hou/VisualRWKV)
- **xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference.** - AI/xlstm)](https://github.com/NX-AI/xlstm)
- **Associative Recurrent Memory Transformer.** - recurrent-memory-transformer)](https://github.com/RodkinIvan/associative-recurrent-memory-transformer)
- **Analysis of Argument Structure Constructions in a Deep Recurrent Language Model.**
- **RecurrentGemma: Moving Past Transformers for Efficient Open Language Models.** - Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas._ Arxiv 2024.
- **RWKV: Reinventing RNNs for the Transformer Era.** - Jie Zhu._ Arxiv 2023. [](https://github.com/BlinkDL/RWKV-LM)
- **Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models.** - Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre._ Arxiv 2024.
- **Artificial Hippocampus Networks for Efficient Long-Context Modeling** - Seed/AHN)](https://github.com/ByteDance-Seed/AHN)
-
8. Agent
- **A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis.**
- **LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration.**
- **AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents.** - Austin-RPL/amago)](https://github.com/UT-Austin-RPL/amago)
- **Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.**
- **GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models.**
- **PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents.**
- **Synergistic Multi-Agent Framework with Trajectory Learning for Knowledge-Intensive Tasks.**
- **Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks.** - VL/Optimus-1)](https://github.com/JiuTian-VL/Optimus-1)
-
1. Survey Papers
- **Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey.** - llms-learning)](https://github.com/Strivin0311/long-llms-learning)
- **The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey.**
- **Length Extrapolation of Transformers: A Survey from the Perspective of Position Encoding.**
- **Efficient Transformers: A Survey.**
- **State Space Model for New-Generation Network Alternative to Transformers: A Survey.** - AHU/Mamba_State_Space_Model_Paper_List)](https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List)
- **A Survey on Efficient Inference for Large Language Models.** - Ping Zhang, Yuhan Dong, Yu Wang._ Arxiv 2024.
- **A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models.** - Seng Chua, Qing Li._ Arxiv 2024.
- **Evaluation of Retrieval-Augmented Generation: A Survey.** - RAG-Evaluation)](https://github.com/YHPeter/Awesome-RAG-Evaluation)
- **The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving.**
- **Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption.** - charlie/Awesome-KV-Cache)](https://github.com/zcli-charlie/Awesome-KV-Cache)
- **Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely.**
- **Prompt Compression for Large Language Models: A Survey.**
- **A Survey on Mamba Architecture for Vision Applications.**
- **Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models.** - Fai Wong._ Arxiv 2025. [](https://github.com/DevoAllen/Awesome-Reasoning-Economy-Papers)
- **Efficient Inference for Large Reasoning Models: A Survey.** - Efficient-Inference-for-LRMs)](https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs)
- **Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models.**
- **Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques.**
- **A Survey on Transformer Context Extension: Approaches and Evaluation.**
- **A Survey on Knowledge-Oriented Retrieval-Augmented Generation.** - Papers-Retrieval-Augmented-Generation)](https://github.com/USTCAGI/Awesome-Papers-Retrieval-Augmented-Generation)
- **Shifting AI Efficiency From Model-Centric to Data-Centric Compression.** - liu16/Awesome-Token-level-Model-Compression)](https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression)
- **A Survey of RWKV.** - Survey)](https://github.com/MLGroupJLU/RWKV-Survey)
- **A Survey on Large Language Model Acceleration based on KV Cache Management.** - Lab/Awesome-KV-Cache-Management)](https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management)
- **Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques.**
- **Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models.** - Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu._ Arxiv 2025. [](https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs)
- **Thus Spake Long-Context Large Language Model.** - Spake-Long-Context-LLM)](https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM)
- **A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond.** - Sheng Hua, Bowen Zhou, Yu Cheng._ Arxiv 2025. [](https://github.com/XiaoYee/Awesome_Efficient_LRM_Reasoning)
- **A Survey on Long Text Modeling with Transformers.**
- **Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art.**
- **Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey.** - Compression)](https://github.com/SrGrace/Contextual-Compression)
- **Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models.** - Long-Chain-of-Thought-Reasoning)](https://github.com/LightChen233/Awesome-Long-Chain-of-Thought-Reasoning)
- **A Survey on Structured State Space Sequence (S4) Models.**
- **A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency.** - Soo Kim, Jemin Lee._ Arxiv 2025. [](https://github.com/sihyeong/Awesome-LLM-Inference-Engine)
- **Speed Always Wins: A Survey on Efficient Architectures for Large Language Models.** - Efficient-Arch)](https://github.com/weigao266/Awesome-Efficient-Arch)
- **Speed Always Wins: A Survey on Efficient Architectures for Large Language Models.**
- **Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models.** - Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu._ Arxiv 2025. [](https://github.com/yunlong10/Awesome-Video-LMM-Post-Training)
- **Explain Before You Answer: A Survey on Compositional Visual Reasoning.** - Visual-Reasoning-Survey)](https://github.com/pokerme7777/Compositional-Visual-Reasoning-Survey)
-
15. Technical Report
- **SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment.**
- **Gemma 3 Technical Report.** - bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini et al.._ Arxiv 2025.
- **EXAONE Deep: Reasoning Enhanced Language Models.**
- **MiniMax-01: Scaling Foundation Models with Lightning Attention.** - AI/MiniMax-01)](https://github.com/MiniMax-AI/MiniMax-01)
- **Qwen2.5-1M Technical Report.**
- **Llama-Nemotron: Efficient Reasoning Models.** - Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam et al.._ Arxiv 2025.
- **Kimi K2: Open Agentic Intelligence.**
- **Phi-4-reasoning Technical Report.**
- **Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.** - Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju et al._ Arxiv 2025.
- **Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding.**
- **DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.** - AI._ Arxiv 2024. [](https://github.com/deepseek-ai/DeepSeek-V2)
- **Skywork Open Reasoner 1 Technical Report.**
- **Qwen2.5 Technical Report.**
- **Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs.** - Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou._ Arxiv 2025.
- **MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.** - AI/MiniMax-M1)](https://github.com/MiniMax-AI/MiniMax-M1)
- **MiniCPM4: Ultra-Efficient LLMs on End Devices.**
- **Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math.** - Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen._ Arxiv 2025.
- **GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.** - 4.1V-Thinking)](https://github.com/THUDM/GLM-4.1V-Thinking)
- **NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model.**
- **InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.**
- **DeepSeek-V3 Technical Report.** - AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R.J. Chen, R.L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S.S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W.L. Xiao, Wangding Zeng et al. (100 additional authors not shown)._ Arxiv 2025. [](https://github.com/deepseek-ai/DeepSeek-V3)
- **LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training.**
-
14. Speculative Decoding
- **Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling.** - Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma._ Arxiv 2025.
- **Efficient Reasoning for LLMs through Speculative Chain-of-Thought.**
- **Mamba Drafters for Speculative Decoding.**
- **L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models.** - Kiong Ng, Tat-Seng Chua._ Arxiv 2025.
- **SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences.**
- **LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification.** - sg/LongSpec)](https://github.com/sail-sg/LongSpec)
- **Long-Context Inference with Retrieval-Augmented Speculative Decoding.** - AI-Lab/RAPID)](https://github.com/John-AI-Lab/RAPID)
- **SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding.**
-
-
📢 News
-
Week Papers
- Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
- VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
- Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
- E3-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models
- TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
- VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
- Context Cascade Compression: Exploring the Upper Limits of Text Compression
- LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
- TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
- StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
- Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
-
Month Papers
-
-
2. Efficient Attention
-
2.1 Sparse Attention
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - research/bigbird)
- ![GitHub Repo stars - pytorch)
- ![GitHub Repo stars - transformer)
- ![GitHub Repo stars - research/longt5)
- ![GitHub Repo stars
- ![GitHub Repo stars - Context-Windows)
- ![GitHub Repo stars
- ![GitHub Repo stars - transformer)
- ![GitHub Repo stars - attention)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - pytorch)
- ![GitHub Repo stars - Transformer)
- ![GitHub Repo stars - transition)
- ![GitHub Repo stars - han-lab/streaming-llm)
- ![GitHub Repo stars
- ![GitHub Repo stars - GT-86/SinkLoRA)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - theta-attention)
- ![GitHub Repo stars - optimal-gqa)
- ![GitHub Repo stars
-
2.4 IO-Aware Attention
- ![GitHub Repo stars - lab/TokenButler)
- ![GitHub Repo stars - ai/transformer-tricks)
- ![GitHub Repo stars - AILab/flash-attention)
- ![GitHub Repo stars - project/vllm)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - han-lab/Quest)
- ![GitHub Repo stars - M)
- ![GitHub Repo stars - research/Q-LLM)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Notation/A2SF)
- ![GitHub Repo stars - NACL)
- ![GitHub Repo stars - AI-Lab/MagicDec/)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - kvcompress)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - AI-Lab/MagicPIG)
- ![GitHub Repo stars
- ![GitHub Repo stars - llm)
- ![GitHub Repo stars - han-lab/duo-attention)
- ![GitHub Repo stars - sg/SimLayerKV)
- ![GitHub Repo stars - attention)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Attention)
- ![GitHub Repo stars
- ![GitHub Repo stars - Lab/AIM)
- ![GitHub Repo stars
- ![GitHub Repo stars - decoding)
- ![GitHub Repo stars
- ![GitHub Repo stars - distill)
- ![GitHub Repo stars - Lab/ZeroMerge)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - spin/adasplash)
- ![GitHub Repo stars - han-lab/omniserve)
- ![GitHub Repo stars - Ushio/MHA2MLA)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - ai/SCOPE)
- ![GitHub Repo stars - NLP-Chang/KVLink)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - MLSys-Lab/MEDA)
- ![GitHub Repo stars
- ![GitHub Repo stars
-
2.2 Linear Attention
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - transformers)
- ![GitHub Repo stars
- ![GitHub Repo stars - attention-transformer)
- ![GitHub Repo stars - pytorch)
- ![GitHub Repo stars - ARK/RFA)
- ![GitHub Repo stars - transformer)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
-
2.3 Hierarchical Attention
-
-
5. Length Extrapolation
-
2.4 IO-Aware Attention
- ![GitHub Repo stars - PauseTokens-7357)
- ![GitHub Repo stars - attention/README.md)
- ![GitHub Repo stars
- ![GitHub Repo stars - Alignment-Transformer-Length-Extrapolation)
- ![GitHub Repo stars - ai/Collinear-Constrained-Attention)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - deepmind/randomized_positional_encodings)
- ![GitHub Repo stars - NLP/length-generalization)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - research/LongLoRA)
- ![GitHub Repo stars - Infinite)
- ![GitHub Repo stars
- ![GitHub Repo stars - pku/PoSE)
- ![GitHub Repo stars - NLP/Entropy-ABF)
- ![GitHub Repo stars - gram)
- ![GitHub Repo stars - LoRA)
- ![GitHub Repo stars
- ![GitHub Repo stars - Context-Data-Engineering)
- ![GitHub Repo stars - nlp/CEPE)
- ![GitHub Repo stars - NLP-SG/CLEX)
- ![GitHub Repo stars
- ![GitHub Repo stars - Group/Ms-PoE)
- ![GitHub Repo stars - master)
- ![GitHub Repo stars - context-pretraining)
- ![GitHub Repo stars - and-r)
- ![GitHub Repo stars
- ![GitHub Repo stars - Zheng/DAPE)
- ![GitHub Repo stars
- ![GitHub Repo stars - coupling)
- ![GitHub Repo stars
- ![GitHub Repo stars - granite/granite-code-models)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - the-Knots)
- ![GitHub Repo stars - RAG)
- ![GitHub Repo stars
- ![GitHub Repo stars - nlp/ProLong)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - ML/LongPPL)
- ![GitHub Repo stars
- ![GitHub Repo stars - attention)
- ![GitHub Repo stars - token-weighting)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - FinAI/LongFaith)
- ![GitHub Repo stars - NLP-SG/LongPO)
- ![GitHub Repo stars - nlco/cream)
- ![GitHub Repo stars - synthesis)
- ![GitHub Repo stars - NEKO/InfoScale)
- ![GitHub Repo stars - lin/forgetting-transformer)
-
-
7. RAG and ICL
-
2.4 IO-Aware Attention
- ![GitHub Repo stars
- ![GitHub Repo stars - RAG)
- ![GitHub Repo stars - RAG)
- ![GitHub Repo stars
- ![GitHub Repo stars - ICL)
- ![GitHub Repo stars - RAG)
- ![GitHub Repo stars - NLP-Group/HippoRAG)
- ![GitHub Repo stars
- ![GitHub Repo stars - 9/segmentplus)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - epfl/icl-alignment)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Lab/LongContextReasoner)
- ![GitHub Repo stars - ICL)
- ![GitHub Repo stars - science/RAGChecker)
- ![GitHub Repo stars - AI-Lab/LongRAG)
- ![GitHub Repo stars
- ![GitHub Repo stars - in-the-margins)
- ![GitHub Repo stars
- ![GitHub Repo stars - uiuc/GoR)
- ![GitHub Repo stars
- ![GitHub Repo stars
-
-
11. Benchmark and Evaluation
-
11.2 MLLM
- ![GitHub Repo stars - project/LVLM-compress-bench)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Doc)
- ![GitHub Repo stars - XComposer)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - ML-Lab/multimodal-needle-in-a-haystack)
- ![GitHub Repo stars - Story)
- ![GitHub Repo stars
- ![GitHub Repo stars - deepmind/lm_act)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - deepmind/neptune)
- ![GitHub Repo stars
-
11.1 LLM
- ![GitHub Repo stars - research/long-range-arena)
- ![GitHub Repo stars
- ![GitHub Repo stars - AI-Lab/LongICLBench)
- ![GitHub Repo stars - nlp/scrolls)
- ![GitHub Repo stars
- ![GitHub Repo stars - liu/lost-in-the-middle)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - nlco/loogle)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - coai/LOT-LongLM)
- ![GitHub Repo stars - research/LoCoMo)
- ![GitHub Repo stars - Task-More-Tokens)
- ![GitHub Repo stars - nlp/LM-Science-Tutor)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Stars)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - nlp/XL2Bench)
- ![GitHub Repo stars - deepmind/long-form-factuality)
- ![GitHub Repo stars - pku/LongEmbed)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - context-icl)
- ![GitHub Repo stars - USTC/CoLeG-Math)
- ![GitHub Repo stars
- ![GitHub Repo stars - compass/Ada-LEval)
- ![GitHub Repo stars - lab/OLAPH)
- ![GitHub Repo stars - deepmind/loft)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - dont-tell)
- ![GitHub Repo stars - fans/MedOdyssey)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Song/VeriScore)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - of-a-haystack)
- ![GitHub Repo stars - needle-in-a-haystack)
- ![GitHub Repo stars
- ![GitHub Repo stars - attribute-or-abstain)
- ![GitHub Repo stars - Zou/DocBench)
- ![GitHub Repo stars - compass/opencompass)
- ![GitHub Repo stars - qa-arena)
- ![GitHub Repo stars - USC/Lifelong-ICL)
- ![GitHub Repo stars - Bench)
- ![GitHub Repo stars - key-retrieval-code-tasks)
- ![GitHub Repo stars
- ![GitHub Repo stars - bench)
- ![GitHub Repo stars - CITEEVAL)
- ![GitHub Repo stars - nlp/HELMET)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - sg/Attention-Sink)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - thu/LongPiBench)
- ![GitHub Repo stars - lab/ETHIC)
- ![GitHub Repo stars - roberts1/needle-threading)
- ![GitHub Repo stars
- ![GitHub Repo stars - ai/medical-eval-sphere)
- ![GitHub Repo stars - LM/tree/ssm/examples/mamba)
- ![GitHub Repo stars - AI-Lab/gsm)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - rag-benchmark)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - long-cot)
- ![GitHub Repo stars
- ![GitHub Repo stars - coai/LongSafety)
- ![GitHub Repo stars - prog123/LongRePS)
- ![GitHub Repo stars - KGLLM/U-NIAH)
-
-
12. Long Text Generation
-
11.2 MLLM
- ![GitHub Repo stars - LFAG/DeFine_Dataset)
- ![GitHub Repo stars - Writer)
- ![GitHub Repo stars - ai/heterogeneous-recursive-planning)
- ![GitHub Repo stars
- ![GitHub Repo stars - Lengthen)
- ![GitHub Repo stars
- ![GitHub Repo stars - pli/LongProc)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - KEG/LongWriter-V)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - nlco/TokenSwift)
-
-
16. Blogs
-
3. Recurrent Transformers
-
2.4 IO-Aware Attention
- ![GitHub Repo stars - xl)
- ![GitHub Repo stars - transformer-pytorch)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - LM)
- ![GitHub Repo stars - transformers-pytorch)
- ![GitHub Repo stars - recurrent-transformer-pytorch)
- ![GitHub Repo stars
- ![GitHub Repo stars - LM)
- ![GitHub Repo stars - ML/linear_open_lm)
- ![GitHub Repo stars - hou/VisualRWKV)
- ![GitHub Repo stars - linear-attention)
- ![GitHub Repo stars - recurrent-memory-transformer)
- ![GitHub Repo stars - paper)
- ![GitHub Repo stars - AI/xlstm)
-
-
4. State Space Models
-
2.4 IO-Aware Attention
- ![GitHub Repo stars - spaces/mamba)
- ![GitHub Repo stars
- ![GitHub Repo stars - lab/attamba)
- ![GitHub Repo stars
- ![GitHub Repo stars - Mamba)
- ![GitHub Repo stars - mamba)
- ![GitHub Repo stars
- ![GitHub Repo stars
-
-
6. Long Term Memory
-
2.4 IO-Aware Attention
- ![GitHub Repo stars
- ![GitHub Repo stars - SiliconFriend)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - pytorch)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - ustc/MemoryLLM)
- ![GitHub Repo stars - ai/lm2)
- ![GitHub Repo stars
-
-
10. Long Video and Image
-
9.2 Model
- ![GitHub Repo stars - PLUG/mPLUG-Owl)
- ![GitHub Repo stars
- ![GitHub Repo stars - apps/EasyAnimate)
- ![GitHub Repo stars
- ![GitHub Repo stars - Bench)
- ![GitHub Repo stars - vgen/slowfast-vgen)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - nlco/VideoLLaMB)
- ![GitHub Repo stars
- ![GitHub Repo stars - FlexReduc)
- ![GitHub Repo stars
- ![GitHub Repo stars - ReTaKe)
- ![GitHub Repo stars
- ![GitHub Repo stars - yh/Owl)
- ![GitHub Repo stars
- ![GitHub Repo stars
-
2.4 IO-Aware Attention
-
-
9. Compress
-
9.1 Prompt
- ![GitHub Repo stars
- ![GitHub Repo stars - for-Prompt-Compression)
- ![GitHub Repo stars - mllab/context-memory)
- ![GitHub Repo stars - TMG/ICL-State-Vector)
- ![GitHub Repo stars - v2)
- ![GitHub Repo stars - Pt/UltraGist)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - G/RCC_Transformer)
- ![GitHub Repo stars - Group/LoCoCo)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - COCO)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - xmu/UIO-LLMs)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - group/FaviComp)
- ![GitHub Repo stars
- ![GitHub Repo stars - nlp/AutoCompressors)
- ![GitHub Repo stars - EIC/DiffRatio-MoD)
- ![GitHub Repo stars - to-stop)
- ![GitHub Repo stars
-
9.2 Model
- ![GitHub Repo stars
- ![GitHub Repo stars - Institute/LLM-Microscope)
- ![GitHub Repo stars - DASLab/QuEST)
- ![GitHub Repo stars - folding-universal)
- ![GitHub Repo stars - MLSys-Lab/SVD-LLM)
- ![GitHub Repo stars - Aware-Automated-Machine-Learning)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Pruner)
- ![GitHub Repo stars - han-lab/llm-awq)
- ![GitHub Repo stars - ffs-compression/)
- ![GitHub Repo stars - compression)
- ![GitHub Repo stars - he/Compressed-Experts)
-
-
8. Agent
-
2.4 IO-Aware Attention
- ![GitHub Repo stars
- ![GitHub Repo stars - Austin-RPL/amago)
- ![GitHub Repo stars - VL/Optimus-1)
-
-
13. Blogs
-
11.2 MLLM
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
-
-
Month Papers
-
1. Survey Papers
- ![GitHub Repo stars - llms-learning)
- ![GitHub Repo stars - RAG-Evaluation)
- ![GitHub Repo stars - AHU/Mamba_State_Space_Model_Paper_List)
- ![GitHub Repo stars - Compression)
- ![GitHub Repo stars - charlie/Awesome-KV-Cache)
- ![GitHub Repo stars - Spake-Long-Context-LLM)
- ![GitHub Repo stars - Long-Chain-of-Thought-Reasoning)
- **A Comprehensive Survey on Long Context Language Modeling.**
- ![GitHub Repo stars - Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling)
- ![GitHub Repo stars - Papers-Retrieval-Augmented-Generation)
- ![GitHub Repo stars - Lab/Awesome-KV-Cache-Management)
- ![GitHub Repo stars - Survey)
-
13. Long CoT
-
13.1 LLM
- ![GitHub Repo stars - AI/SkyThought)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - Valve)
- ![GitHub Repo stars - R1)
- ![GitHub Repo stars - BJTU/OpenRFT)
- ![GitHub Repo stars
- ![GitHub Repo stars
- ![GitHub Repo stars - o1)
- ![GitHub Repo stars - NLP/O1-Journey)
- ![GitHub Repo stars
- ![GitHub Repo stars
-
13.2 MLLM
- ![GitHub Repo stars - hailong/TVC)
-
-
15. Blogs
-
14. Blogs
-
11.2 MLLM
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
- **The Secret Sauce behind 100K context window in LLMs: all tricks in one place.**
-
-
Acknowledgements
-
Star History
- ![Star History Chart - LLM-Long-Context-Modeling/stargazers)
- ![Star History Chart - LLM-Long-Context-Modeling/stargazers)
-
-
15. Technical Report
-
13.2 MLLM
- ![GitHub Repo stars - ai/DeepSeek-V2)
- ![GitHub Repo stars
- ![GitHub Repo stars - ai/DeepSeek-V3)
- ![GitHub Repo stars - AI/MiniMax-01)
-
-
14. Speculative Decoding
-
13.2 MLLM
- ![GitHub Repo stars - sg/LongSpec)
- ![GitHub Repo stars - AI-Lab/RAPID)
-
Programming Languages
Categories
📜 Papers
1,343
2. Efficient Attention
99
11. Benchmark and Evaluation
97
5. Length Extrapolation
54
9. Compress
38
7. RAG and ICL
24
10. Long Video and Image
19
3. Recurrent Transformers
15
📢 News
14
12. Long Text Generation
14
13. Long CoT
13
1. Survey Papers
12
6. Long Term Memory
11
4. State Space Models
8
13. Blogs
7
14. Blogs
5
Month Papers
4
16. Blogs
4
15. Technical Report
4
8. Agent
3
15. Blogs
2
14. Speculative Decoding
2
Acknowledgements
2
Sub Categories
2. Efficient Attention
322
11. Benchmark and Evaluation
198
2.4 IO-Aware Attention
170
9. Compress
170
5. Length Extrapolation
124
10. Long Video and Image
99
13. Long CoT
95
11.1 LLM
80
7. RAG and ICL
71
16. Blogs
60
11.2 MLLM
47
12. Long Text Generation
36
1. Survey Papers
36
4. State Space Models
35
9.2 Model
31
6. Long Term Memory
31
3. Recurrent Transformers
28
2.1 Sparse Attention
26
9.1 Prompt
25
15. Technical Report
22
2.2 Linear Attention
18
13.1 LLM
12
Week Papers
11
13.2 MLLM
9
8. Agent
8
14. Speculative Decoding
8
Month Papers
3
Star History
2
2.3 Hierarchical Attention
1