https://github.com/dwarvesf/llm-hosting
This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Large Language Models with an OpenAI compatible vLLM server.
https://github.com/dwarvesf/llm-hosting
Last synced: 26 days ago
JSON representation
This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Large Language Models with an OpenAI compatible vLLM server.
- Host: GitHub
- URL: https://github.com/dwarvesf/llm-hosting
- Owner: dwarvesf
- Created: 2024-04-23T00:19:53.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-08T10:31:36.000Z (about 2 years ago)
- Last Synced: 2024-05-09T04:31:00.183Z (about 2 years ago)
- Language: Python
- Size: 57.6 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-modal - llm-hosting - compatible vLLM serving with Modal | (LLMs/Multimodal Models)
README
## Overview
This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Large Language Models with an OpenAI compatible vLLM server using Modal.
## Key Components
1. **vllm_llama_70b.py, vllm_deepseek_coder_33b.py, vllm_llama3-8b.py, vllm_seallm_7b_v2_5.py, vllm_sqlcoder_7b_2.py, vllm_duckdb_nsql_7b.py, vllm_codeqwen_110b_v1_5.py**
- These scripts contain the function `openai_compatible_server()` which initiates an OpenAI compatible vLLM server by running a command that instantiates an OpenAI compatible FastAPI server..
- The `BASE_MODEL` variable appears to define the model path for the embedding tool, which is not shown but can be inferred from the context.
2. **infinity_mxbai_embed_large_v1.py, infinity_mxbai_rerank_large_v1.py, infinity_snowflake_arctic_embed_l_335m.py**
- These scripts contain the function `infinity_embeddings_server()` which initiates the Infinity Embed server by running a command that utilizes the Infinity embedding tool with specified options (like CUDA device and Torch engine).
- The `BASE_MODEL` variable appears to define the model path for the embedding tool, which is not shown but can be inferred from the context.
3. **devbox.json**
- This configuration file specifies the programming environment for the repository, including versions of Python, Pip, and Node.js.
- It also defines shell initialization hooks like activating a Python virtual environment and installing necessary Python packages, among other administration scripts.
4. **.env.example**
- This file template shows environment variables that are likely necessary for the project to run (e.g., API keys for Infinity API and VLLM API).
## Prerequisites
Before diving into the project setup, make sure to:
- [Have Devbox installed](https://www.jetify.com/devbox/docs/installing_devbox/), as it manages the development and operation environment for this project.
- Set up necessary API keys by copying `.env.example` to `.env` and filling in the required values for `INFINITY_API_KEY` and `VLLM_API_KEY`.
## Environment Setup
1. **Initializing Development Environment with Devbox:**
- Enter the Devbox shell environment by running:
```bash
devbox shell
```
- This action will set up the environment according to the `init_hook` specified in `devbox.json`, which activates the Python virtual environment and installs the required packages.
## Deployment
The scripts available in the repository can be deployed using the [Modal](https://modal.com/docs/examples/hello_world) tool. Deploy a script by running the corresponding command:
```bash
modal deploy infinity_mxbai_embed_large_v1.py
modal deploy infinity_mxbai_rerank_large_v1.py
modal deploy infinity_snowflake_arctic_embed_l_335m.py
modal deploy vllm_llama3_70b.py
modal deploy vllm_deepseek_coder_33b.py
modal deploy vllm_llama3-8b.py
modal deploy vllm_seallm_7b_v2_5.py
modal deploy vllm_sqlcoder_7b_2.py
modal deploy vllm_duckdb_nsql_7b.py
modal deploy vllm_codeqwen_110b_v1_5.py
```
Each command will deploy the respective script, launching the Infinity embeddings server or an OpenAI compatible vLLM server configured per the script's specifications.
## Inference
Expect cold starts between 30s and 1 minute with Modal. Both the vLLM and Infinity servers take in an API key, specified in your `.env` file. You can use this to make requests for inference on these models:
**Querying LLMs**:
```bash
time curl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{
"model": "TheBloke/deepseek-coder-33B-instruct-AWQ",
"messages": [
{
"role": "user",
"content": "Write me a python snake game."
}
],
"temperature": 0,
"max_tokens": 1024
}'
```
**Querying Embeddings**:
```bash
time curl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{
"model": "Snowflake/snowflake-arctic-embed-l",
"input": ["The quick brown fox jumps over the lazy dog."]
}'
```
**Querying Rerankings**:
```bash
time curl -X 'POST' \
\
-H 'accept: application/json' \
-H "Authorization: Bearer " \
-H 'Content-Type: application/json' \
-d '{
"model": "mixedbread-ai/mxbai-rerank-large-v1",
"query": "What is the python package infinity_emb?",
"documents": [
"This is a document not related to the python package infinity_emb, hence...",
"Paris is in France!",
"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"
],
"return_documents": true
}'
```