Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-ml-model-compression

Awesome machine learning model compression research papers, tools, and learning material.
https://github.com/cedrickchee/awesome-ml-model-compression

Last synced: about 16 hours ago
JSON representation

Papers
- General
  - A Survey of Model Compression and Acceleration for Deep Neural Networks
  - Model compression as constrained optimization, with application to neural nets. Part I: general framework
  - Model compression as constrained optimization, with application to neural nets. Part II: quantization
  - Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
  - FP8 Formats for Deep Learning - FP8 delivered the performance of INT8 with accuracy of FP16. E4M3, a variant of FP8 has the benefits of INT8 with none of the loss in accuracy and throughput.
- Architecture
- Quantization
- Binarization
- Pruning
  - Faster CNNs with Direct Sparse Convolutions and Guided Pruning
  - Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
  - Pruning Convolutional Neural Networks for Resource Efficient Inference
  - Pruning Filters for Efficient ConvNets
  - Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning
  - Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing
  - Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization
  - Learning both Weights and Connections for Efficient Neural Networks
  - ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression
  - Data-Driven Sparse Structure Selection for Deep Neural Networks
  - Soft Weight-Sharing for Neural Network Compression
  - Dynamic Network Surgery for Efficient DNNs
  - Channel pruning for accelerating very deep neural networks
  - AMC: AutoML for model compression and acceleration on mobile devices
  - ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
  - Massive Language Models Can Be Accurately Pruned in One-Shot (2023) - Pruning methods: post-training, layer-wise. Quantization methods: joint sparsification & post-training quantization.
  - UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
  - A Simple and Effective Pruning Approach for Large Language Models - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https://github.com/locuslab/wanda)]
- Distillation
- Low Rank Approximation
  - Speeding up convolutional neural networks with low rank expansions
  - Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
  - Convolutional neural networks with low-rank regularization
  - Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
  - Accelerating Very Deep Convolutional Networks for Classification and Detection
  - Efficient and Accurate Approximations of Nonlinear Convolutional Networks
  - LoRA: Low-Rank Adaptation of Large Language Models - Low-rank adapters were proposed for GPT-like models by Hu et al.
  - QLoRA: Efficient Finetuning of Quantized LLMs
  - LoRA: Low-Rank Adaptation of Large Language Models - Low-rank adapters were proposed for GPT-like models by Hu et al.
- Offloading
  - FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
- Parallelism
  - Does compressing activations help model parallel training? (2023) - They presents the first empirical study on the effectiveness of compression algorithms (pruning-based, learning-based, and quantization-based - using a Transformer architecture) to improve the communication speed of model parallelism. **Summary:** 1) activation compression not equal to gradient compression; 2) training setups matter a lot; 3) don't compress early layers' activation.
Articles
- Howtos
  - How to Quantize Neural Networks with TensorFlow
  - 🤗 PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware (2023) - The Hugging Face PEFT library enables using the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate. Currently supported PEFT methods: LoRA, prefix tuning, prompt tuning, and P-Tuning (which employs trainable continuous prompt embeddings). They'll be exploring more PEFT methods, such as (IA)3 and bottleneck adapters. Results: The number of parameters needed to fine-tune Flan-T5-XXL is now 9.4M, about 7X fewer than AlexNet (source: [Tweet](https://twitter.com/dmvaldman/status/1624143468003221504)).
- Assorted
  - Why the Future of Machine Learning is Tiny
  - Deep Learning Model Compression for Image Analysis: Methods and Architectures
  - A foolproof way to shrink deep learning models - A pruning algorithm: train to completion, globally prune the 20% of weights with the lowest magnitudes (the weakest connections), retrain with **learning rate rewinding** for the original (early training) rate, iteratively repeat until the desired sparsity is reached (model is as tiny as you want).
- Blogs
  - TensorFlow Model Optimization Toolkit — Pruning API
  - Compressing neural networks for image classification and detection - Facebook AI researchers have developed a new method for reducing the memory footprint of neural networks by quantizing their weights, while maintaining a short inference time. They manage to get a 76.1% top-1 ResNet-50 that fits in 5 MB and also compress a Mask R-CNN within 6 MB.
  - All The Ways You Can Compress BERT - An overview of different compression methods for large NLP models (BERT) based on different characteristics and compares their results.
  - Deep Learning Model Compression
  - Do We Really Need Model Compression
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Which Quantization Method Is Best for You?: GGUF, GPTQ, or AWQ - A gentle introduction to three prominent quantization methods — GPTQ, AWQ, and GGUF.
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Comparison between quantization techniques and formats for LLMs - A detailed comparison between GGUF (llama.cpp), GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
  - Breakdown of Nvidia H100s for Transformer Inferencing
Tools
- Libraries
  - NNCP - An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.

Categories

Papers 95 Articles 24 License 3 Tools 1

Sub Categories

Quantization 23 Blogs 19 Pruning 18 Architecture 18 Distillation 15 Low Rank Approximation 9 General 5 Binarization 5 Training & tutorials 3 Assorted 3 Howtos 2 Offloading 1 Parallelism 1 Libraries 1

Keywords

quantization 1 model-quantization 1 model-compression 1 model-acceleration 1 lightweight-neural-network 1 efficient-deep-learning 1 deep-learning 1 binary-network 1 binarized-neural-networks 1 binarization 1 awesome 1