https://github.com/replicate/cog-vllm

Run LLMs on Replicate with vLLM
https://github.com/replicate/cog-vllm

Last synced: 11 months ago
JSON representation

Run LLMs on Replicate with vLLM

Host: GitHub
URL: https://github.com/replicate/cog-vllm
Owner: replicate
Created: 2023-07-20T21:09:07.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-11T18:15:38.000Z (over 1 year ago)
Last Synced: 2025-05-01T04:46:38.417Z (11 months ago)
Language: Python
Homepage:
Size: 186 KB
Stars: 18
Watchers: 13
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Cog-vLLM: Run vLLM on Replicate

[Cog](https://github.com/replicate/cog)
is an open-source tool that lets you package machine learning models
in a standard, production-ready container.
vLLM is a fast and easy-to-use library for LLM inference and serving.

You can deploy your packaged model to your own infrastructure,
or to [Replicate].

## Highlights

* 🚀 **Run vLLM in the cloud with an API**.
Deploy any [vLLM-supported language model] at scale on Replicate.

* 🏭 **Support multiple concurrent requests**.
Continuous batching works out of the box.

* 🐢 **Open Source, all the way down**.
Look inside, take it apart, make it do exactly what you need.

## Quickstart

Go to [replicate.com/replicate/vllm](https://replicate.com/replicate/vllm)
and create a new vLLM model from a [supported Hugging Face repo][vLLM-supported language model],
such as [google/gemma-2b](https://huggingface.co/google/gemma-2b)

> [!IMPORTANT]
> Gated models require a [Hugging Face API token](https://huggingface.co/settings/tokens),
> which you can set in the `hf_token` field of the model creation form.

Create a new vLLM model on Replicate

Replicate downloads the model files, packages them into a `.tar` archive,
and pushes a new version of your model that's ready to use.

Trained vLLM model on Replicate

From here, you can either use your model as-is,
or customize it and push up your changes.

## Local Development

If you're on a machine or VM with a GPU,
you can try out changes before pushing them to Replicate.

Start by [installing or upgrading Cog](https://cog.run/#install).
You'll need Cog [v0.10.0-alpha11](https://github.com/replicate/cog/releases/tag/v0.10.0-alpha11):

```console
$ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.10.0-alpha11/cog_$(uname -s)_$(uname -m)"
$ sudo chmod +x /usr/local/bin/cog
```

Then clone this repository:

```console
$ git clone https://github.com/replicate/cog-vllm
$ cd cog-vllm
```

Go to the [Replicate dashboard](https://replicate.com/trainings) and
navigate to the training for your vLLM model.
From that page, copy the weights URL from the Download weights button.

Copy weights URL from Replicate training

Set the `COG_WEIGHTS` environment variable with that copied value:

```console
$ export COG_WEIGHTS="..."
```

Now, make your first prediction against the model locally:

```console
$ cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" \
-i prompt="Hello!"
```

The first time you run this command,
Cog downloads the model weights and save them to the `models` subdirectory.

To make multiple predictions,
start up the HTTP server and send it `POST /predictions` requests.

```console
# Start the HTTP server
$ cog run -p 5000 -e "COG_WEIGHTS=$COG_WEIGHTS" python -m cog.server.http

# In a different terminal session, send requests to the server
$ curl http://localhost:5000/predictions -X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"prompt": "Hello!"}}'
```

When you're finished working,
you can push your changes to Replicate.

Grab your token from [replicate.com/account](https://replicate.com/account)
and set it as an environment variable:

```shell
export REPLICATE_API_TOKEN=
```

```console
$ echo $REPLICATE_API_TOKEN | cog login --token-stdin
$ cog push r8.im//
--> ...
--> Pushing image 'r8.im/...'
```

After you push your model, you can try running it on Replicate.

Install the [Replicate Python SDK][replicate-python]:

```console
$ pip install replicate
```

Create a prediction and stream its output:

```python
import replicate

model = replicate.models.get("/")
prediction = replicate.predictions.create(
version=model.latest_version,
input={ "prompt": "Hello" },
stream=True
)

for event in prediction.stream():
print(str(event), end="")
```

[Replicate]: https://replicate.com
[vLLM-supported language model]: https://docs.vllm.ai/en/latest/models/supported_models.html
[replicate-python]: https://github.com/replicate/replicate-python

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/replicate/cog-vllm

Awesome Lists containing this project

README