https://github.com/alipay/painlessinferenceacceleration
Accelerate inference without tears
https://github.com/alipay/painlessinferenceacceleration
llm-inference
Last synced: 8 months ago
JSON representation
Accelerate inference without tears
- Host: GitHub
- URL: https://github.com/alipay/painlessinferenceacceleration
- Owner: alipay
- License: mit
- Created: 2023-12-19T13:11:38.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-14T07:18:17.000Z (10 months ago)
- Last Synced: 2025-04-19T13:41:43.764Z (9 months ago)
- Topics: llm-inference
- Language: Python
- Homepage:
- Size: 18.8 MB
- Stars: 312
- Watchers: 5
- Forks: 22
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Painless Inference Acceleration (PIA)
A toolkit to accelerate LLM inference without headache (🤯) and tears (ðŸ˜) .
## NOTE
[2025/03] We have transitioned our open-source license from the `Creative Commons Attribution 4.0 International` to the `MIT` license. This change reflects our assessment that the MIT License is more appropriate for the distribution and utilization of source code.
## *News or Update* 🔥
- [2025/03] We upgrade our inference framework `LOOKAHEAD` to [`FLOOD`](./flood/README.md).
- [2024/05] We release the code of [`IPaD`](./ipad/README.md).
- [2024/01] We support all models of baichuan family (Baichuan-7b & 13b, Baichuan2-7b & 13b) for lookahead.
- [2024/01] We fully support repetition_penalty parameter for lookahead.
- [2024/01] We support Mistral & Mixtral for lookahead.
- [2023/12] We released our latency-oriented inference framework [`LOOKAHEAD`](./lookahead/README.md).
## Introduction
Our repo, PIA (short for Painless Inference Acceleration), is designed for LLM inference, and currently contains three key features:
- [`FLOOD`](./flood/README.md): It employs pure pipeline parallelism to enhance inference throughput, thereby reducing communication costs typically associated with tensor parallelism. `FLOOD` is designed as the successor to our previous framework, `LOOKAHEAD`, in order to achieve optimal performance across both small and large batch sizes.
- [`LOOKAHEAD`](./lookahead/README.md): It uses an on-the-fly trie-tree cache to prepare hierarchical multi-branch drafts, without the demand for assist models (e.g., speculative decoding) or additional head training (e.g., block decoding).
With the efficient hierarchical structure, we can lookahead tens of branches, therefore significantly improve generated token count in a forward pass. It is important to note that `LOOKAHEAD` is fully based on `transformers`, which is inefficient for serving large models. Consequently, we have updated `LOOKAHEAD` to `FLOOD` and will maintain only minimal support for `LOOKAHEAD`.
- [`IPAD`](./ipad/README.md): It applies iterative pruning and distillation techniques to reduce the model size.
Other features, including quantization, KV cache sparsification, will release soon.
## Citations
```
@inproceedings{10.1145/3637528.3671614,
author = {Zhao, Yao and Xie, Zhitian and Liang, Chen and Zhuang, Chenyi and Gu, Jinjie},
title = {Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy},
year = {2024},
isbn = {9798400704901},
publisher = {Association for Computing Machinery},
doi = {10.1145/3637528.3671614},
booktitle = {Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {6344–6355},
series = {KDD '24}
}
```
```
@inproceedings{10.1145/3589335.3648321,
author = {Wang, Maolin and Zhao, Yao and Liu, Jiajia and Chen, Jingdong and Zhuang, Chenyi and Gu, Jinjie and Guo, Ruocheng and Zhao, Xiangyu},
title = {Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation},
year = {2024},
isbn = {9798400701726},
publisher = {Association for Computing Machinery},
doi = {10.1145/3589335.3648321},
booktitle = {Companion Proceedings of the ACM Web Conference 2024},
pages = {235–244},
series = {WWW '24}
}
```