Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/Infini-AI-Lab/TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
https://github.com/Infini-AI-Lab/TriForce

acceleration efficiency inference llm llm-inference long-context speculative-decoding

Last synced: 2 months ago
JSON representation

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Host: GitHub
URL: https://github.com/Infini-AI-Lab/TriForce
Owner: Infini-AI-Lab
Created: 2024-04-04T01:37:19.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-07-04T19:49:12.000Z (7 months ago)
Last Synced: 2024-08-05T08:07:56.669Z (6 months ago)
Topics: acceleration, efficiency, inference, llm, llm-inference, long-context, speculative-decoding
Language: Python
Homepage: https://infini-ai-lab.github.io/TriForce/
Size: 71.7 MB
Stars: 147
Watchers: 1
Forks: 12
Open Issues: 3
Metadata Files:
- Readme: README.md

StarryDivineSky - Infini-AI-Lab/TriForce - 7B-128K、LWM-Text-Chat-128K、Llama2-13B-128K 等）提供服务，在消费类 GPU 上以 0.1 秒的延迟无损（16 位精度，保留原始输出分布）进行长序列生成。我们证明 TriForce 可以在两个 RTX 4090 上有效地为 128K 上下文的 Llama2-13B 提供服务，达到平均令牌间隔时间（TBT）低至 0.22 秒，这比高度优化的卸载系统快 7.8 倍。此外，借助 TriForce，Llama2-7B-128K 可以在两台 RTX 4090 上提供服务，TBT 为 0.11 秒，仅比一台 A100 慢 0.5 倍。此外，TriForce 在单个 RTX 4090 GPU 上执行的性能是 DeepSpeed-Zero-Inference 的 4.86 倍。除了卸载之外，TriForce 还为 A100 等数据中心 GPU 提供了片上解决方案。TriForce 有效地解决了这一挑战，同时通过集成基于检索的绘图和分层推测来证明地保持了模型质量。这种方法利用原始模型权重和检索中的一小部分 KV 缓存作为草稿模型，这可以通过具有 StreamingLLM 缓存的轻量级模型进一步推测，以减少草稿延迟。通过缓解与 KV 缓存和模型权重相关的双重瓶颈，它显著加快了长上下文 LLM 的卸载服务。 (A01_文本生成_文本对话 / 大语言对话模型及数据)