https://github.com/chuloai/andromeda-chain
Serving hugging face guidance behind a server
https://github.com/chuloai/andromeda-chain
Last synced: 2 months ago
JSON representation
Serving hugging face guidance behind a server
- Host: GitHub
- URL: https://github.com/chuloai/andromeda-chain
- Owner: ChuloAI
- License: mit
- Created: 2023-06-02T07:38:35.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-14T20:23:08.000Z (about 2 years ago)
- Last Synced: 2025-04-13T02:17:05.579Z (3 months ago)
- Language: Python
- Size: 54.7 KB
- Stars: 13
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Andromeda-Chain - mastering all the chains
This repository contains both a server and a client package.
The server is (not very creatively) called `guidance_server`.
The client is called `AndromedaChain`.Why this package/library?
The [guidance](https://github.com/microsoft/guidance) is an awesome library, but has limited support through HTTP APIs like Oobabooga UI.
So I rolled my own HTTP server, which allows me to quickly prototype apps that use guidance templates.I originally created Oasis with a variant of this idea: https://github.com/paolorechia/oasis
## Installation
### AndromedaChain
```bash
pip install andromeda-chain
```### Guidance Server
Serving the guidance library with local models behind a HTTP server.Supported methods:
1. Hugging Face (16bit, 8bit, 4bit)
2. GPTQ with or without CPU offload
3. Experimental LLaMA CPP support based on the work of https://github.com/Maximilian-WinterThe server configuration is passed through enviroment variables, typically through the docker-compose file:
```yaml
GENERAL_BASE_IMAGE: GPU
# CPP Model Example:
# GENERAL_MODEL_PATH: /models/open-llama-7B-open-instruct.ggmlv3.q4_0.bin
# GENERAL_TOKENIZER_PATH: /models/VMware_open-llama-7b-open-instruct
# GENERAL_LOADING_METHOD: CPP# GPTQ Model Example:
GENERAL_MODEL_PATH: /models/vicuna-7B-1.1-GPTQ-4bit-128g
GENERAL_LOADING_METHOD: GPTQ# HF Model Example
# GENERAL_MODEL_PATH: /models/VMware_open-llama-7b-open-instruct
# GENERAL_LOADING_METHOD: HUGGING_FACE# Guidance Settings
GUIDANCE_AFTER_ROLE: "|>"
GUIDANCE_BEFORE_ROLE: "<|"# Tokenizer Settings
TK_BOOL_USE_FAST: false# HuggingFace
HF_BOOL_USE_8_BIT: true
HF_BOOL_USE_4_BIT: false
HF_DEVICE_MAP: auto# GPTQ
GPTQ_INT_WBITS: 4
GPTQ_INT_GROUP_SIZE: 128
GPTQ_INT_PRE_LOADED_LAYERS: 20
GPTQ_DEVICE: "cuda"
GPTQ_BOOL_CPU_OFFLOADING: false# LLaMA CPP
CPP_INT_N_GPU_LAYERS: 300
CPP_INT_N_THREADS: 12
CPP_BOOL_CACHING: false
```#### Docker Container
Requirements:
1. docker-engine
2. docker-compose v2If using GPU also:
3. nvidia-docker: https://github.com/NVIDIA/nvidia-docker
##### Pulling the image
You can find the images tags in [Docker Hub](https://hub.docker.com/repository/docker/paolorechia/guidance_server/general)
The easiest way to pull it is to reuse the docker-compose file.```bash
docker-compose -f docker-compose.gpu.yaml up
```Or use the CPU version
```bash
docker-compose -f docker-compose.cpu.yaml up
```Note that you still need to setup the model (see in usage section).
*LICENSE NOTE*: The GPU image is based on `nvidia/cuda:12.1.1-runtime-ubuntu22.04`, which is subject to the proprietary NVIDIA license.
While the software from Andromeda repository is open source, some layers of the docker container are not.#### Building
Just use the appropriate bash script
```bash
./build_gpu.sh
```Or:
```bash
./build_cpu.sh
```## Usage
1. Download a LLM model you want to use from Hugging Face.
2. Create a 'models' directory locally, and save the model in there.
3. Setup the environment variable `MODEL_PATH` in the `docker-compose.gpu` or `docker-compose.cpu` depending which one you want.
4. Start the server.
5. Use the Andromeda package to query the server.### Using Andromeda Package
```python
from andromeda_chain import AndromedaChain, AndromedaPrompt, AndromedaResponsechain = AndromedaChain()
prompt = AndromedaPrompt(
name="hello",
prompt_template="""Howdy: {{gen 'expert_names' temperature=0 max_tokens=300}}""",
input_vars=[],
output_vars=["expert_names"]
)response: AndromedaResponse = chain.run_guidance_prompt(prompt)
# Use the response
print(response.expanded_generation)
print(response.result_vars)
```