Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Resource-Efficient-LLM-Papers
a curated list of high-quality papers on resource-efficient LLMs 🌱
https://github.com/tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers
Last synced: 4 days ago
JSON representation
-
LLM Architecture Design
-
Efficient Transformer Architecture
- An Attention Free Transformer
- xFormers - Toolbox to Accelerate Research on Transformers
- FasterTransformer: A Faster Transformer Framework
- Simple linear attention language models balance the recall-throughput tradeoff
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- LoMA: Lossless Compressed Memory Attention
- Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Flashattention: Fast and memory-efficient exact attention with io-awareness
- KDEformer: Accelerating Transformers via Kernel Density Estimation
- Mega: Moving Average Equipped Gated Attention
- Efficient attention: Attention with linear complexities
- An Attention Free Transformer
- Self-attention Does Not Need O(n^2) Memory
- LightSeq: A High Performance Inference Library for Transformers
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- Reformer: The efficient transformer
- Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention
- Local Attention Mechanism: Boosting the Transformer Architecture for Long-Sequence Time Series Forecasting
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
-
Non-transformer Architecture
- You Only Cache Once: Decoder-Decoder Architectures for Language Models
- Scalable MatMul-free Language Modeling
- RWKV: Reinventing RNNs for the Transformer Era - Findings|
- Auto-Regressive Next-Token Predictors are Universal Learners
- Hyena Hierarchy: Towards Larger Convolutional Language models
- Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- Mixture-of-Experts with Expert Choice Routing
- Efficient Large Scale Language Modeling with Mixtures of Experts
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
-
-
System Design
-
Other Systems
-
Support Infrastructure
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
- Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices
- Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model
- Large Language Models Empowered Autonomous Edge AI for Connected Intelligence
- EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation
- ProFormer: Towards On-Device LSH Projection-Based Transformers
- Generate More Features with Cheap Operations for BERT
- SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
- Lite Transformer with Long-Short Range Attention
-
Deployment optimization
-
-
LLM Inference
-
Model Compression
- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
- NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
- SliceGPT: Compress Large Language Models by Deleting Rows and Columns
- Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
- Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
- One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
- Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
- BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
- SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot
- A Simple and Effective Pruning Approach for Large Language Models
- AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers
- LLM-Pruner: On the Structural Pruning of Large Language Models
- LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
- Structured Pruning for Efficient Generative Pre-trained Language Models
- ZipLM: Inference-Aware Structured Pruning of Language Models
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
- Flexround: Learnable rounding based on element-wise division for post-training quantization
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
- OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models
- Gptq: Accurate posttraining quantization for generative pre-trained transformers
- Dynamic Stashing Quantization for Efficient Transformer Training
- Quantization-aware and tensor-compressed training of transformers for natural language understanding
- QLoRA: Efficient Finetuning of Quantized LLMs
- Stable and low-precision training for large-scale vision-language models
- Prequant: A task-agnostic quantization approach for pre-trained language models
- Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization
- Awq: Activationaware weight quantization for llm compression and acceleration
- Spqr: A sparsequantized representation for near-lossless llm weight compression
- SqueezeLLM: Dense-and-Sparse Quantization
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
- Gact: Activation compressed training for generic network architectures
- Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
- Ac-gc: Lossy activation compression with guaranteed convergence
-
Dynamic Acceleration
- Learned Token Pruning for Transformers
- Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference
- PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model
- SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models
- Transkimmer: Transformer Learns to Layer-wise Skim
- TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference
- Efficient sparse attention architecture with cascade token and head pruning
-
-
LLM Pre-Training
-
Memory Efficiency
- FairScale: A general purpose modular PyTorch library for high performance and large scale training
- Mesh-tensorflow: Deep learning for supercomputers
- ProTrain: Efficient LLM Training via Adaptive Memory Management
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
- FP8-LM: Training FP8 Large Language Models
- Palm: Scaling language modeling with pathways
- Bpipe: memory-balanced pipeline parallelism for training large language models
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Zero: Memory optimizations toward training trillion parameter models
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- PipeDream: generalized pipeline parallelism for DNN training
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- Bert: Pre-training of deep bidirectional transformers for language understanding
- Mixed Precision Training
-
Data Efficiency
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
- How to Train Data-Efficient LLMs
- A Survey on Efficient Training of Transformers
- Data-Juicer: A One-Stop Data Processing System for Large Language Models
- INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models
- Machine Learning Force Fields with Data Cost Aware Training
- Beyond neural scaling laws: beating power law scaling via data pruning
- Deep Learning on a Data Diet: Finding Important Examples Early in Training
- Training Deep Models Faster with Robust, Approximate Importance Sampling
- Not All Samples Are Created Equal: Deep Learning with Importance Sampling
- MixGen: A New Multi-Modal Data Augmentation
- Augmentation-Aware Self-Supervision for Data-Efficient GAN Training
- Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis
- FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization
- Challenges and Applications of Large Language Models
- Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
- Scaling Language-Image Pre-training via Masking
- Masked Autoencoders Are Scalable Vision Learners
- MASS: Masked Sequence to Sequence Pre-training for Language Generation
-
-
Resource-Efficiency Evaluation Metrics \& Benchmarks
-
🧮 Computation Metrics
- \[end-to-end latency in seconds\
- \[minutes, days\ - 0069/23-0069.pdf)|
- \[end-to-end latency in seconds\
- \[inference time speed-up\ - up\]](https://github.com/NVIDIA/FasterTransformer)|
-
⚡️ Energy Metrics
-
Benchmarks
- GLUE - 2301.pdf), and [SQuAD](https://arxiv.org/pdf/1606.05250.pdf), etc. | [A Comprehensive Overview of Large Language Models](https://arxiv.org/pdf/2307.06435.pdf)|
- Long Range Arena: A Benchmark for Efficient Transformers
- Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
- MS MARCO - query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems | [Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking](https://arxiv.org/pdf/2212.01340.pdf)|
- GLUE - 2301.pdf), and [SQuAD](https://arxiv.org/pdf/1606.05250.pdf), etc. | [A Comprehensive Overview of Large Language Models](https://arxiv.org/pdf/2307.06435.pdf)|
- Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
- NeurIPS 2020 - efficient QA systems | [NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned](https://proceedings.mlr.press/v133/min21a/min21a.pdf)|
- SustaiNLP 2020 - efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference | [Overview of the SustaiNLP 2020 Shared Task](https://aclanthology.org/2020.sustainlp-1.24.pdf)|
- VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
- Long Range Arena: A Benchmark for Efficient Transformers
- MS MARCO - query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems | [Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking](https://arxiv.org/pdf/2212.01340.pdf)|
-
💾 Memory Metrics
-
📨 Network Communication Metric
-
💡 Other Metrics
-
-
LLM Fine-Tuning
-
Parameter-Efficient Fine-Tuning
- Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively
- BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
- Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning
- Unlearning Bias in Language Models by Partitioning Gradients
- SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
-
Full-Parameter Fine-Tuning
- A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model
- Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification
- Full Parameter Fine-tuning for Large Language Models with Limited Resources
- Fine-Tuning Language Models with Just Forward Passes
- PMC-LLaMA: Towards Building Open-source Language Models for Medicine
- Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
-
Categories
Sub Categories
Model Compression
33
Efficient Transformer Architecture
20
Data Efficiency
19
Memory Efficiency
15
Non-transformer Architecture
12
Support Infrastructure
11
Benchmarks
11
Dynamic Acceleration
8
⚡️ Energy Metrics
8
Full-Parameter Fine-Tuning
6
Parameter-Efficient Fine-Tuning
5
💾 Memory Metrics
4
🧮 Computation Metrics
4
Deployment optimization
3
📨 Network Communication Metric
2
💡 Other Metrics
2
Other Systems
2
Keywords