https://github.com/deci-ai/inferyllm-docs

dependency-graph
Last synced: about 1 month ago
JSON representation
Host: GitHub
URL: https://github.com/deci-ai/inferyllm-docs
Owner: Deci-AI
Created: 2023-10-16T11:17:50.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-15T13:37:39.000Z (about 1 year ago)
Last Synced: 2025-04-04T18:12:27.375Z (2 months ago)
Topics: dependency-graph
Language: Jupyter Notebook
Size: 239 KB
Stars: 7
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # InferyLLM

[![CircleCI](https://circleci.com/gh/Deci-AI/infery-llm/tree/master.svg?style=svg&circle-token=70fc750615364de727b3e72c362e02e18aa8b0d0)](https://dl.circleci.com/status-badge/redirect/gh/Deci-AI/infery-llm/tree/master)

## Table of Contents

1. [Introduction](#introduction)

2. [Installation](#installation)

3. [Serving](#serving)

4. [Generation](#generation)

5. [Advanced Usage](#advanced-usage):

   * [CLI](#cli)

   * [Lowering loading time](#lowering-loading-time-the-prepare-command)

   * [Benchmarking models and serving configurations](#benchmarking) 

## Introduction

InferyLLM is a high-performance engine and server for running LLM inference.

### InferyLLM is fast

- Optimized CUDA kernels for MQA, GQA and MHA

- Continuous batching using a paged KV cache and custom paged attention kernels 

- Kernel autotuning capabilities with automatic selection of the optimal kernels and parameters on the given HW

- Support for extremely efficient LLMs, designed to reach SOTA throughput

### InferyLLM is easy to use

- Containerized OR local entrypoint servers

- Simple, minimal-dependency python client

- Seamless integration with 🤗 model hub

### Model support

   * [deci/decilm-6b](https://huggingface.co/Deci/DeciLM-6b)

   * [deci/decilm-7b](https://huggingface.co/Deci/DeciLM-7b)

   * [deci/decicoder-1b](https://huggingface.co/Deci/DeciCoder-1b)

   * [meta-llama/llama-2-7b-hf](https://huggingface.co/meta-llama/llama-2-7b-hf)

   * [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-v0.2)

   * All fine-tuned variants of the above (ex: the instruct variants like [Deci/DeciLM-7B-instruct](https://huggingface.co/Deci/DeciLM-7b-instruct)).

   * More models coming soon... (Mixtral, Falcon, MPT etc)

### Supported GPUs

* Compute capability >= 8.0 (e.g. A100, A10, L4, ...). See full list [here](https://developer.nvidia.com/cuda-gpus)


* Memory requirements depends on the model size. For example:

1. DeciLM-7B - at least 24G. 

2. DeciCoder-1B - 16G is more than enough.

  

## Installation

### Prerequisites

Before you begin, in order to use InferyLLM you need some details from Deci.ai:

1. Artifactory credentials (referred to as ARTIFACTORY USER and ARTIFACTORY TOKEN) in order to *download* and *update* the package in the future.

2. Authentication token for running the server. This will be referred to as INFERY_LLM_DECI_TOKEN below.

Then, ensure you have met the following system requirements:

- General requirements:

  - Python >= 3.11

  - [CUDA ToolKit >= 12.1](https://developer.nvidia.com/cuda-downloads)

- For local serving:

  - GLIBC >= 2.31

  - GCC, G++ >= 11.3

  - `gcc`, `g++` and `nvcc` in your `$PATH` at the time of installation

- For containerized serving:

  - [nvidia-container-runtime >= 1.13.4](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/release-notes.html)

### Installing locally

InferyLLM may be used with a lean (client-only) installation or a full (client+server) installation.

**Client Installation**

```bash

# Install InferyLLM (along with LLMClient)

pip install --extra-index-url=https://[ARTIFACTORY USER]:[ARTIFACTORY TOKEN]@deci.jfrog.io/artifactory/api/pypi/deciExternal/simple infery-llm

```

**Server Installation**

```bash

# Install InferyLLM Server

INFERY_LLM_DECI_TOKEN=[DECI TOKEN] infery-llm install

```

For a more thorough explanation, please refer to the [Advanced Usage](#advanced-usage) and check out the `install` CLI command.

### Pulling the InferyLLM container

To pull an InferyLLM container from Deci's container registry:

```bash

# Log in to Deci's container registry

docker login --username [ARTIFACTORY USER] --password [ARTIFACTORY TOKEN] deci.jfrog.io

# Pull the container. You may be specify a version instead of "latest" (e.g. 0.0.7)

docker pull deci.jfrog.io/deci-external-docker-local/infery-llm:latest

```

## Serving

**Note**: remember you need an authentication token to run the server (INFERY_LLM_DECI_TOKEN)

There are two ways to serve an LLM with InferyLLM:

1. Through a local entrypoint

2. By using the InferyLLM container

By default, InferyLLM serves at `0.0.0.0:8080` this is configurable through passing the `--host` and `--port` flags.       

  Serving with a container (suggested)

Assuming you have pulled the container as shown in the [Installation](#pulling-the-inferyllm-container) section,

running the server is a simple one-liner. You can also use the container to query the serving CLI `help` for all 

available serving flags and defaults:

**Note**: In order to mount any local models already downloaded to the local machine add `-v :`.


In the next example, the `~/.cache/deci/` directory will be mounted to `/deci/` inside the container.

```bash

# Serve Deci/DeciLM-6b (from HF hub) on port 9000

docker run --runtime=nvidia -e INFERY_LLM_DECI_TOKEN=[DECI TOKEN] -v ~/.cache/deci/:/deci/ -p 9000:9000 deci.jfrog.io/deci-external-docker-local/infery-llm:[VERSION TAG] --model-name Deci/DeciLM-6b --port 9000

# Serve Mistral-7B. Set the Max Batch Size to 16 and limit the Maximum Sequence (input + generation) to 2048

docker run --runtime=nvidia -e INFERY_LLM_DECI_TOKEN=[DECI TOKEN] -v ~/.cache/deci/:/deci/ -p 9000:9000 deci.jfrog.io/deci-external-docker-local/infery-llm:[VERSION TAG] --model-name mistralai/Mistral-7B-Instruct-v0.2 --max-seq-length 2048 --max-batch-size 16 

# See all serving CLI options and defaults

docker run --rm --runtime=nvidia -e INFERY_LLM_DECI_TOKEN=[DECI TOKEN] deci.jfrog.io/deci-external-docker-local/infery-llm:[VERSION TAG] --help

```

Notice that a HuggingFace token may be passed as an environment variable (using the docker `-e` flag) or as a CLI parameter

  Serving with a local entry point

Assuming you have installed the `infery-llm` local serving requirements, you may use the InferyLLM CLI as a server entrypoint:

```bash

# Serve Deci/DeciLM-7b (from HF hub) on port 9000

infery-llm serve --model-name Deci/DeciLM-7b --port 9000

# See all serving options

infery-llm serve --help

```

## Generation

Assuming you have a running server listening at `0.0.0.0:9000`, you may submit generation requests in two ways:

1. Through InferyLLM's `LLMClient`:

```python

import asyncio

from infery_llm.client import LLMClient, GenerationParams

client = LLMClient("http://0.0.0.0:9000")

# Prepare GenerationParams (max_new_tokens, temperature, top_p, top_k, stop_tokens, ...)

gen_params = GenerationParams(max_new_tokens=50, top_p=0.95, top_k=100, temperature=0.1)

# Submit a single prompt and query results (along with metadata in this case)

result = client.generate("Large language models are ", generation_params=gen_params, return_metadata=True)

print(f"Output: {result.output}.\nGenerated Tokens :{result.metadata[0]['generated_token_count']}")

# Submit a batch of prompts (a list of results will be returned this time)

prompts = ["A receipe for making spaghetti: ", "5 interesting facts about the President of France are: ", "Write a short story about a dog named Snoopy: "]

result = client.generate(prompts, generation_params=gen_params)

[print(f"Prompt: {output['prompt']}\nGeneration: {output['output']}") for output in result.outputs]

# Use stop tokens

gen_params = GenerationParams(stop_str_tokens=[1524], stop_strs=["add tomatoes"], skip_special_tokens=True)

result = client.generate("A receipe for making spaghetti: ", generation_params=gen_params)

# Stream results

for text in client.generate("Will the real Slim Shady please ", generation_params=gen_params, stream=True):

    print(text, end="")

# Async generation is also supported from within async code:

async def example():

    result = await client.generate_async("AsyncIO is fun because ", generation_params=gen_params)

    print(result.output)

asyncio.run(example())

```

2. Through a `curl` command (assuming you have [cURL](https://curl.se/) installed)

``` bash

curl -X POST http://0.0.0.0:9000/generate \

-H 'Content-Type: application/json' \

-d '{"prompts":["def factorial(n: int) -> int:"], "generation_params":{"max_new_tokens": 500, "temperature":0.5, "top_k":50, "top_p":0.8}, "stream":true}'

```

## Advanced Usage

### CLI

InferyLLM and its CLI are rapidly accumulating more features. For example, the `infery-llm` CLI already allows to `benchmark`

with numerous configurations, to `prepare` model artifacts before serving in order to [cut down loading time](#lowering-loading-time),

and more. To see the available features you may simply pass `--help` to the `infery-llm` CLI or any of its subcommands:

For container users:

```bash

# Query infery-llm CLI help menu

docker run --entrypoint infery-llm --runtime=nvidia deci.jfrog.io/deci-external-docker-local/infery-llm:latest --help

# Query the infery-llm CLI's `benchmark` subcommand help menu

docker run --entrypoint infery-llm --runtime=nvidia deci.jfrog.io/deci-external-docker-local/infery-llm:latest benchmark --help

```

For local installation users:

```bash

# Query infery-llm CLI help menu

infery-llm --help

# Query the infery-llm CLI's `benchmark` subcommand help menu

infery-llm benchmark --help

```

### Lowering loading time (the `prepare` command)

InferyLLM has its own internal format and per-model artifact requirements. While the required artifacts are automatically

generated by the `infery-llm serve` command, you can also generate them ahead of time with the `infery-llm prepare`

command, thus drastically cutting down server start-time.

For container users:

```bash

# Create artifacts for serving a Deci/DeciCoder-1b and place the result in ~ on the host machine

docker run --rm --entrypoint infery-llm -v ~/:/models --runtime=nvidia deci.jfrog.io/deci-external-docker-local/infery-llm:latest prepare --hf-model Deci/DeciCoder-1b --output-dir /models/infery_llm_model

# Now serve the created artifact (specifically here on port 9000)

docker run --runtime=nvidia -e INFERY_LLM_DECI_TOKEN=[DECI TOKEN] -p 9000:9000 -v ~/:/models deci.jfrog.io/deci-external-docker-local/infery-llm:latest --infery-model-dir /models/infery_llm_model --port 9000

```

For local installation users:

```bash

# Create artifacts for serving a Deci/DeciCoder-1b and place the result in ~ on the host machine

infery-llm prepare --hf-model Deci/DeciCoder-1b --output-dir /models/infery_llm_model

# Now serve the created artifact (specifically here on port 9000)

infery-llm serve --infery-model-dir /models/infery_llm_model --port 9000

```

**Important note on caching:** Just like 🤗 caches downloaded model weights in `~/.cache/huggingface`, InferyLLM 

caches the mentioned artifacts in `~/.cache/deci`. It does so for every model served.

This means that relaunching a local or containerized server (if it is the same container, not just the same image) will

automatically lower the loading time.

### Benchmarking

By using InferyLLM's `benchmark` cli, you can test different combinations of parameters of serving parameters for your different models.

Test different batch sizes, sequence lengths, max generated tokens, and more. Each GPU and memory size will also change the performance. 


Starting a new project with `benchmark` should help you find the correct combination for you model deployment.


The metrics benchmark will output: 

1. **E2E Time** - time it took to process the number of requests that were sent

2. **E2E Throughput** - How many tokens/s were generated

3. **Mean Latency** - the mean latency for 1 request

Below are some more examples for benchmarking different models with several serving configurations: 

  Benchmarking on an A10 GPU

In order to use `benchmark` cli, you need a json file with a list of prompts, in the following format:


`[{"prompt": "Your prompt here..", "max_new_tokens": 512}, {"prompt": "Your second prompt..", "max_new_tokens": 256}, {"prompt": "Your 3rd prompt"}]`.


In the next benchmarking example, lets compare Mistral7B and DeciLM7B on a batch size of 16 with sequence lengths that aren't above 1024 tokens (larger prompts will be skipped, and not tested). 

The prompts file will be saved at `/prompts.json`, the local `prepared` models are saved at `/models/Mistral-7B` and `/models/DeciLM-7B`.


The test will run on a A10 GPU instance with 24GB of memory:

```bash

$ docker run --entrypoint infery-llm --runtime=nvidia -v /prompts.json:/prompts.json  -v /models/:/models/  deci.jfrog.io/deci-external-docker-local/infery-llm:0.0.8 benchmark --infery-model-dir /models/DeciLM-7B --num-prompts 100 --prompts-json /prompts.json --verbose --max-batch-size 16 --input-seq-len=1024 --verbose

INFO: 2024-04-11 08:21:46,272 - INFERY_LLM - Found autotune results (/models/DeciLM-7B/autotune_benchmarks.pkl)

Completed warmup requests: 100%|██████████| 1/1 [00:20<00:00, 20.51s/it]

Completed Requests: 100%|██████████| 100/100 [03:37<00:00,  2.18s/it]

INFO: 2024-04-11 08:25:45,044 - INFERY_LLM - 

============================== BENCHMARK SUMMARY ==============================

PARAMETERS:

	Num Prompts:       100

	Input Tokens:      1025

	Generated Tokens:  512

	Max Batch Size:    16

	Block Size:        Model Config Default

RESULTS:

	E2E Time:          163.709 [s]

	E2E Throughput:    312.751 [Tokens/s]

	Mean Latency:      1.837 [s]

===============================================================================

```

We can see that the latency mean of 1 request is 1.837 seconds for this combination.


Now let's test Mistral7B:

```bash

$ docker run --entrypoint infery-llm --runtime=nvidia -v /prompts.json:/prompts.json  -v /models/:/models/  deci.jfrog.io/deci-external-docker-local/infery-llm:0.0.8 benchmark --infery-model-dir /models/Mistral-7B --num-prompts 100 --prompts-json /prompts.json --verbose --max-batch-size 16 --input-seq-len=1024 --verbose

INFO: 2024-04-11 09:21:46,272 - INFERY_LLM - Found autotune results (/models/Mistral-7B/autotune_benchmarks.pkl)

Completed warmup requests: 100%|██████████| 1/1 [00:20<00:00, 20.51s/it]

Completed Requests: 100%|██████████| 100/100 [03:37<00:00,  2.18s/it]

INFO: 2024-04-11 09:25:45,044 - INFERY_LLM - 

============================== BENCHMARK SUMMARY ==============================

PARAMETERS:

	Num Prompts:       100

	Input Tokens:      1025

	Generated Tokens:  512

	Max Batch Size:    16

	Block Size:        Model Config Default

RESULTS:

	E2E Time:          217.893 [s]

	E2E Throughput:    234.977 [Tokens/s]

	Mean Latency:      2.706 [s]

===============================================================================

```

The Tokens per second for batch-size=16 and sequence-len=1024 was 234 TKS.


We now know that **DeciLM-7B is 1.5X faster than Mistral-7B** in the exact same setup and has **33% better throughput!**


What will be the result for Mistral7B if our sequences are much longer (2048), but we use a smaller batch size (8):

```bash

$ docker run --entrypoint infery-llm --runtime=nvidia -v /prompts.json:/prompts.json  -v /models/:/models/  deci.jfrog.io/deci-external-docker-local/infery-llm:0.0.8 benchmark --infery-model-dir /models/Mistral-7B --num-prompts 100 --prompts-json /prompts.json --verbose --max-batch-size 8 --input-seq-len=2048 --verbose

INFO: 2024-04-11 09:16:46,272 - INFERY_LLM - Found autotune results (/models/Mistral-7B/autotune_benchmarks.pkl)

Completed warmup requests: 100%|██████████| 1/1 [00:20<00:00, 20.51s/it]

Completed Requests: 100%|██████████| 100/100 [03:37<00:00,  2.18s/it]

INFO: 2024-04-11 09:20:45,044 - INFERY_LLM - 

============================== BENCHMARK SUMMARY ==============================

PARAMETERS:

	Num Prompts:       100

	Input Tokens:      1501

	Generated Tokens:  512

	Max Batch Size:    8

	Block Size:        Model Config Default

RESULTS:

	E2E Time:          330.644 [s]

	E2E Throughput:    154.849 [Tokens/s]

	Mean Latency:      3.306 [s]

===============================================================================

```

We can see that the total time was longer than before, and that the Mean Latency is longer by almost ~1 second.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deci-ai/inferyllm-docs

Awesome Lists containing this project

README