https://github.com/nickpotafiy/illama

A fast, lightweight, parallel inference server for Llama LLMs.
https://github.com/nickpotafiy/illama

exllama exllamav2 flash-attention-2 inference llama llama2 llama3 llm-inference paged-attention server

Last synced: 7 months ago
JSON representation

A fast, lightweight, parallel inference server for Llama LLMs.

Host: GitHub
URL: https://github.com/nickpotafiy/illama
Owner: nickpotafiy
License: mit
Created: 2024-05-23T05:37:08.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-30T14:33:37.000Z (about 1 year ago)
Last Synced: 2024-09-18T18:25:12.388Z (about 1 year ago)
Topics: exllama, exllamav2, flash-attention-2, inference, llama, llama2, llama3, llm-inference, paged-attention, server
Language: Python
Homepage:
Size: 45.9 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# illama
**illama** is a lightweight, fast inference server for Llama and ExLlamav2 based large language models (LLMs).

## Features
- **Continuous batching** - Handles multiple requests simultaneously.
- **Open-AI compatible server** - Use official OpenAI API clients
- **Quantization Support** - Load any quantized ExLlamaV2 compatible models (GPTQ, EXL2, or SafeTensors).
- **GPU Focused** - Distribute model across any number of local GPUs.
- Uses [FlashAttention 2](https://github.com/Dao-AILab/flash-attention) with Paged Attention by default

## Getting Started

To get started, clone the repo.

```bash
git clone https://github.com/nickpotafiy/illama.git
cd illama
```

### With Conda

Optionally, create a new conda environment.

```bash
conda create -n illama python=3.10
conda activate illama
```

## Install PyTorch

Install [Nvidia Cuda Toolkit](https://developer.nvidia.com/cuda-toolkit-archive) and [PyTorch](https://pytorch.org/get-started/locally/). Ideally, both versions should match to minimize incompatibilities. PyTorch CUDA `12.1` is recommended with Nvidia CUDA Toolkit 12.1+.

### Install Torch w/ Pip

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```

### Install Torch w/ Conda

```bash
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
```

Check Torch CUDA version with: `python -c "import torch; print(torch.version.cuda)"`

### Install Illama

First, install setup libraries:

```bash
pip install packaging ninja
```

Then, install the main package:

```bash
pip install .
```

If installation fails, you may need to `set MAX_JOBS=4` or `export MAX_JOBS=4` (or lower) depending on system memory. This is a known `flash-attn` problem.

## Running the Server

To start illama server, run this command:

```bash
python server.py --model-path "" --batch-size 10 --host "0.0.0.0" --port 5000 --verbose
```

Run `python server.py --help` to get a list of all available options.

## Troubleshooting

If you get an error saying `OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root`, that typically means PyTorch was not installed correctly. You can verify PyTorch installation by activating your environment and executing `python`:

```python
import torch
torch.version.cuda
```
If you don't get your PyTorch CUDA version, then it was not installed correctly. You may have installed PyTorch without CUDA (like a Preview build).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nickpotafiy/illama

Awesome Lists containing this project

README