https://github.com/chriamue/chat-flame-backend

ChatFlameBackend is an innovative backend solution for chat applications, leveraging the power of the Candle AI framework with a focus on the Mistral model
https://github.com/chriamue/chat-flame-backend

backend-api candle huggingface-inference-endpoint llama2 llm-inference mistral phi rust-lang

Last synced: 6 months ago
JSON representation

ChatFlameBackend is an innovative backend solution for chat applications, leveraging the power of the Candle AI framework with a focus on the Mistral model

Host: GitHub
URL: https://github.com/chriamue/chat-flame-backend
Owner: chriamue
License: mit
Created: 2023-12-18T08:37:26.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-01-21T07:40:57.000Z (about 2 years ago)
Last Synced: 2024-12-13T18:09:23.179Z (about 1 year ago)
Topics: backend-api, candle, huggingface-inference-endpoint, llama2, llm-inference, mistral, phi, rust-lang
Language: Rust
Homepage: https://blog.chriamue.de/chat-flame-backend/chat_flame_backend/
Size: 1.36 MB
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # chat-flame-backend

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Doc](https://img.shields.io/badge/Docs-online-green.svg)](https://blog.chriamue.de/chat-flame-backend/chat_flame_backend/)

[![codecov](https://codecov.io/gh/chriamue/chat-flame-backend/graph/badge.svg?token=MNHB75EJ2Z)](https://codecov.io/gh/chriamue/chat-flame-backend)

ChatFlameBackend is an innovative backend solution for chat applications, leveraging the power of the Candle AI framework with a focus on the Mistral model

## Quickstart

### Installation

```bash

cargo build --release

```

### Running

Run the server

```bash

cargo run --release

```

Run one of the models

```bash

cargo run --release -- --model phi-v2 --prompt 'write me fibonacci in rust'

```

### Docker

```bash

docker-compose up --build

```

Visit http://localhost:8080/swagger-ui for the swagger ui.

## Testing

### Test using the shell

```bash

cargo test

```

or with curl

```bash

curl -X POST http://localhost:8080/generate \

     -H "Content-Type: application/json" \

     -d '{"inputs": "Your text prompt here"}'

```

or the stream endpoint

```bash

curl -X POST -H "Content-Type: application/json" -d '{"inputs": "Your input text"}' http://localhost:8080/generate_stream

```

### Test using python

You can find a detailed documentation on how to use the python client on [huggingface](https://huggingface.co/docs/text-generation-inference/basic_tutorials/consuming_tgi#inference-client).

```bash

virtualenv .venv

source .venv/bin/activate

pip install huggingface-hub

python test.py

```

## Architecture

The backend is written in rust. The models are loaded using the [candle](https://github.com/huggingface/candle) framework.

To serve the models on an http endpoint, axum is used.

Utoipa is used to provide a swagger ui for the api.

## Supported Models

- [x] [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1)

- [x] Zephyr

- [x] OpenChat

- [x] Starling

- [x] [Phi](https://huggingface.co/microsoft/phi-2) (Phi-1, Phi-1.5, Phi-2)

- [ ] GPT-Neo

- [ ] GPT-J

- [ ] Llama

### Mistral

["lmz/candle-mistral"](https://huggingface.co/lmz/candle-mistral)

### Phi

["microsoft/phi-2"](https://huggingface.co/microsoft/phi-2)

## Performance

The following table shows the performance metrics of the model on different systems:

| Model            | System                     | Tokens per Second |

| ---------------- | -------------------------- | ----------------- |

| 7b-open-chat-3.5 | AMD 7900X3D (12 Core) 64GB | 9.4 tokens/s      |

| 7b-open-chat-3.5 | AMD 5600G (8 Core VM) 16GB | 2.8 tokens/s      |

| 13b (llama2 13b) | AMD 7900X3D (12 Core) 64GB | 5.2 tokens/s      |

| phi-2            | AMD 7900X3D (12 Core) 64GB | 20.6 tokens/s     |

| phi-2            | AMD 5600G (8 Core VM) 16GB | 5.3 tokens/s      |

| phi-2            | Apple M2 (10 Core) 16GB    | 24.0 tokens/s     |

### Hint

The performance of the model is highly dependent on the memory bandwidth of the system.

While getting 20.6 tokens/s for the Phi-2 Model on a AMD 7900X3D with 64GB of DDR5-4800 memory,

the performance could be increased to

21.8 tokens/s by overclocking the memory to DDR5-5600.

## Todo

- [x] implement api for https://huggingface.github.io/text-generation-inference/#/

- [x] model configuration

- [x] generate stream

- [x] docker image and docker-compose

- [ ] add tests

- [ ] add documentation

- [ ] fix stop token

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chriamue/chat-flame-backend

Awesome Lists containing this project

README