https://github.com/jinhanlei/llm-stream-service
Streaming API and Web page for Large Language Models (Llama3) based on transformers+flask+gradio.
https://github.com/jinhanlei/llm-stream-service
flask gradio huggingface llama2 llama3 python transformers
Last synced: 6 months ago
JSON representation
Streaming API and Web page for Large Language Models (Llama3) based on transformers+flask+gradio.
- Host: GitHub
- URL: https://github.com/jinhanlei/llm-stream-service
- Owner: JinHanLei
- License: mit
- Created: 2024-05-18T12:23:36.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-19T03:10:49.000Z (about 2 years ago)
- Last Synced: 2025-03-16T08:12:14.415Z (over 1 year ago)
- Topics: flask, gradio, huggingface, llama2, llama3, python, transformers
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LLM Stream Service
[](README.md)[](README_zh.md)
**Streaming API** and **Web page** for Large Language Models based on Python.
This repository contains:
1. Transformers streaming generation: **REAL** streaming generation for all pre-trained models (based on transformers).
2. Flask API: streaming response interface.
3. Gradio APP: fast and easy LLM web page.
## Quick Start
Take Llama3 for example:
1. Follow [Llama3 download](https://github.com/meta-llama/llama3?tab=readme-ov-file#download) to download Meta-Llama-3-8B-Instruct model, or from [huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) / [modelscope](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/summary).
2. Follow [Llama3 quick-start](https://github.com/meta-llama/llama3?tab=readme-ov-file#quick-start) to install dependencies for Llama3.
3. Clone this repository and install dependencies:
```bash
git clone https://github.com/JinHanLei/LLM-Stream-Service
pip install flask gradio transformers
```
4. Run Flask service:
```bash
python llama3_service.py --host 0.0.0.0 --port 8800 --ckpts /Meta-Llama-3-8B-Instruct
```
**Note**
- Replace `Meta-Llama-3-8B-Instruct/` with the path to your checkpoint directory.
5. Run Gradio service:
```bash
gradio llama3_app.py
```
**Note**
- Replace the `Address` variable in `llama3_app.py` with your service address.
# Journey and Challenges
1. The initial streaming output scheme adopted by the project was the **TextIteratorStreamer** that comes with the official transformers library. However, the generation speed was still very slow. After researching, I found that the TextIteratorStreamer actually converts print-ready text into a streaming structure, meaning that the LLM first needs to generate the entire text block before converting it, which is not what I wanted. I wanted the LLM to yield each token as it is generated.
2. Subsequently, I came across [LowinLi's project](https://github.com/LowinLi/transformers-stream-generator) that truly implemented streaming output for pretrained models. When I eagerly applied it to the Llama3 model, it threw an error. After debugging, I found that Llama3 has two **eos_tokens**, which caused the loop to generate negative ids. Thus, I made modifications based on this project, cleaned up redundancies, adapted it for Llama3, and made it easier to read and understand.
# Thanks π
- https://github.com/meta-llama/llama3
- https://github.com/TylunasLi/ChatGLM-web-stream-demo
- https://github.com/LowinLi/transformers-stream-generator