Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
https://github.com/ModelTC/awesome-lm-system
Last synced: 3 days ago
JSON representation
-
Frameworks
-
Survey
-
-
Papers
-
Compression
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
- A Simple and Effective Pruning Approach for Large Language Models
- On Architectural Compression of Text-to-Image Diffusion Models
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- OWQ: Lessons learned from activation outliers for weight quantization in large language models
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
- QLoRA: Efficient Finetuning of Quantized LLMs
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
- Extreme Compression for Pre-trained Transformers Made Simple and Efficient
- Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
- Understanding and Overcoming the Challenges of Efficient Transformer Quantization
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
- Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
- CBQ: Cross-Block Quantization for Large Language Models
- Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
- RPTQ: Reorder-based Post-training Quantization for Large Language Models
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
- LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
- QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
- AffineQuant: Affine Transformation Quantization for Large Language Models
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
- LLM-Pruner: On the Structural Pruning of Large Language Models
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
- SqueezeLLM: Dense-and-Sparse Quantization
- OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
- On Architectural Compression of Text-to-Image Diffusion Models
-
Inference
- Fast Inference in Denoising Diffusion Models via MMD Finetuning
- EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models
- H<sub>2</sub>O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
-
Training
- Training Diffusion Models with Reinforcement Learning
- Extracting Training Data from Diffusion Models
- DySR: Adaptive Super-Resolution via Algorithm and System Co-design
- Scaling Vision-Language Models with Sparse Mixture of Experts
- MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
- A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- On Optimizing the Communication of Model Parallelism
- Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
- Perception Prioritized Training of Diffusion Models
- Reducing Activation Recomputation in Large Transformer Models - LM](#megatron) |
- 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
- The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
- Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
- Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers
- DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- A Frequency-aware Software Cache for Large Recommendation System Embeddings
- Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM - LM](#megatron) |
- LoRA: Low-Rank Adaptation of Large Language Models
- 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
- ZeRO-Offload: Democratizing Billion-Scale Model Training.
- TeraPipe: Token-Level Pipeline Parallelism for Training Large
- Memory-Efficient Pipeline-Parallel DNN Training
- An Efficient 2D Method for Training Super-Large Deep Learning Models
- Maximizing Parallelism in Distributed Training for Huge Neural Networks
- Sequence Parallelism: Long Sequence Training from System Perspective
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism - LM](#megatron) |
- torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models
- GPipe: efficient training of giant neural networks using pipeline parallelism
- PipeDream: Generalized pipeline parallelism for DNN training
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- DAPPLE: a pipelined data parallel approach for training large models
- Tesseract: Parallelize the Tensor Parallelism Efficiently
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
-
Categories
Sub Categories
Keywords
deep-learning
4
pytorch
3
model-parallelism
3
pipeline-parallelism
3
data-parallelism
2
machine-learning
2
inference
2
distributed-computing
2
big-model
1
ai
1
llm
1
jax
1
high-performance-computing
1
distributed-training
1
compiler
1
auto-parallelization
1
alpa
1
transformer
1
gpt
1
bert
1
foundation-models
1
heterogeneous-training
1
hpc
1
large-scale
1
billion-parameters
1
compression
1
gpu
1
mixture-of-experts
1
trillion-parameters
1
zero
1
large-language-models
1
model-para
1
transformers
1
checkpointing
1
gpipe
1
parallelism
1