{"id":25971271,"url":"https://github.com/ray-project/aviary","last_synced_at":"2026-01-12T02:49:15.431Z","repository":{"id":171404019,"uuid":"647848107","full_name":"ray-project/ray-llm","owner":"ray-project","description":"RayLLM - LLMs on Ray","archived":true,"fork":false,"pushed_at":"2024-05-28T19:31:51.000Z","size":2073,"stargazers_count":1230,"open_issues_count":65,"forks_count":94,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-10-29T15:39:50.523Z","etag":null,"topics":["distributed-systems","large-language-models","llm","llm-inference","llm-serving","llmops","ray","serving","transformers"],"latest_commit_sha":null,"homepage":"https://aviary.anyscale.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ray-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-31T16:42:19.000Z","updated_at":"2024-10-29T04:23:34.000Z","dependencies_parsed_at":"2023-10-10T20:16:43.460Z","dependency_job_id":"d73b20f9-a7d5-4ac2-b851-f0c9117ecd63","html_url":"https://github.com/ray-project/ray-llm","commit_stats":null,"previous_names":["ray-project/aviary","ray-project/ray-llm"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fray-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fray-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fray-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fray-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ray-project","download_url":"https://codeload.github.com/ray-project/ray-llm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241940550,"owners_count":20045881,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-systems","large-language-models","llm","llm-inference","llm-serving","llmops","ray","serving","transformers"],"created_at":"2025-03-05T00:01:08.206Z","updated_at":"2025-03-05T00:01:40.208Z","avatar_url":"https://github.com/ray-project.png","language":"Python","funding_links":[],"categories":["Datasets or Benchmarks","Models and Projects"],"sub_categories":["General","Legacy / Archived"],"readme":"============================\n# Archiving Ray LLM\n\nWe had started RayLLM to simplify setting up and deploying LLMs on top of Ray Serve. In the past few months, vLLM has made significant improvements in ease of use. We are archiving the RayLLM project and instead adding some examples to our [Ray Serve docs](https://docs.ray.io/en/master/serve/tutorials/vllm-example.html) for deploying LLMs with Ray Serve and vLLM. This will reduce another library for the community to learn about and greatly simplify the workflow to serve LLMs at scale. We also recently launched [Hosted Anyscale](https://www.anyscale.com/) where you can serve LLMs with Ray Serve with some more capabilities out of the box like multi-lora with serve multiplexing, JSON mode function calling and further performance enhancements.\n\n\n============================\n# RayLLM - LLMs on Ray\n\nThe hosted Aviary Explorer is not available anymore.\nVisit [Anyscale](https://endpoints.anyscale.com) to experience models served with RayLLM.\n\n[![Build status](https://badge.buildkite.com/d6d7af987d1db222827099a953410c4e212b32e8199ca513be.svg?branch=master)](https://buildkite.com/anyscale/aviary-docker)\n\nRayLLM (formerly known as Aviary) is an LLM serving solution that makes it easy to deploy and manage\na variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It does this by:\n\n- Providing an extensive suite of pre-configured open source LLMs, with defaults that work out of the box.\n- Supporting Transformer models hosted on [Hugging Face Hub](http://hf.co) or present on local disk.\n- Simplifying the deployment of multiple LLMs\n- Simplifying the addition of new LLMs\n- Offering unique autoscaling support, including scale-to-zero.\n- Fully supporting multi-GPU \u0026 multi-node model deployments.\n- Offering high performance features like continuous batching, quantization and streaming.\n- Providing a REST API that is similar to OpenAI's to make it easy to migrate and cross test them.\n- Supporting multiple LLM backends out of the box, including [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).\n\nIn addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.\n\nRayLLM supports continuous batching and quantization by integrating with [vLLM](https://github.com/vllm-project/vllm). Continuous batching allows you to get much better throughput and latency than static batching. Quantization allows you to deploy compressed models with cheaper hardware requirements and lower inference costs. See [quantization guide](models/continuous_batching/quantization/README.md) for more details on running quantized models on RayLLM.\n\nRayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling\nand multi-node deployments. RayLLM can scale to zero and create\nnew model replicas (each composed of multiple GPU workers) in response to demand.\n\n# Getting started\n\n## Deploying RayLLM\n\nThe guide below walks you through the steps required for deployment of RayLLM on Ray Serve.\n\n### Locally\n\nWe highly recommend using the official `anyscale/ray-llm` Docker image to run RayLLM. Manually installing RayLLM is currently not a supported use-case due to specific dependencies required, some of which are not available on pip.\n\n```shell\ncache_dir=${XDG_CACHE_HOME:-$HOME/.cache}\n\ndocker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash\n# Inside docker container\nserve run ~/serve_configs/amazon--LightGPT.yaml\n```\n\n### On a Ray Cluster\n\nRayLLM uses Ray Serve, so it can be deployed on Ray Clusters.\n\nCurrently, we only have a guide and pre-configured YAML file for AWS deployments.\n**Make sure you have exported your AWS credentials locally.**\n\n```bash\nexport AWS_ACCESS_KEY_ID=...\nexport AWS_SECRET_ACCESS_KEY=...\nexport AWS_SESSION_TOKEN=...\n```\n\nStart by cloning this repo to your local machine.\n\nYou may need to specify your AWS private key in the `deploy/ray/rayllm-cluster.yaml` file.\nSee [Ray on Cloud VMs](https://docs.ray.io/en/latest/cluster/vms/index.html) page in\nRay documentation for more details.\n\n```shell\ngit clone https://github.com/ray-project/ray-llm.git\ncd ray-llm\n\n# Start a Ray Cluster (This will take a few minutes to start-up)\nray up deploy/ray/rayllm-cluster.yaml\n```\n\n#### Connect to your Cluster\n\n```shell\n# Connect to the Head node of your Ray Cluster (This will take several minutes to autoscale)\nray attach deploy/ray/rayllm-cluster.yaml\n\n# Deploy the LightGPT model. \nserve run serve_configs/amazon--LightGPT.yaml\n```\n\nYou can deploy any model in the `models` directory of this repo,\nor define your own model YAML file and run that instead.\n\n### On Kubernetes\n\nFor Kubernetes deployments, please see our documentation for [deploying on KubeRay](https://github.com/ray-project/ray-llm/tree/master/docs/kuberay).\n\n## Query your models\n\nOnce the models are deployed, you can install a client outside of the Docker container to query the backend.\n\n```shell\npip install \"rayllm @ git+https://github.com/ray-project/ray-llm.git\"\n```\n\nYou can query your RayLLM deployment in many ways.\n\nIn all cases start out by doing:\n\n```shell\nexport ENDPOINT_URL=\"http://localhost:8000/v1\"\n```\n\nThis is because your deployment is running locally, but you can also access remote deployments (in which case you would set `ENDPOINT_URL` to a remote URL).\n\n### Using curl\n\nYou can use curl at the command line to query your deployed LLM:\n\n```shell\n% curl $ENDPOINT_URL/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"meta-llama/Llama-2-7b-chat-hf\",\n    \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"temperature\": 0.7\n  }'\n```\n\n```text\n{\n  \"id\":\"meta-llama/Llama-2-7b-chat-hf-308fc81f-746e-4682-af70-05d35b2ee17d\",\n  \"object\":\"text_completion\",\"created\":1694809775,\n  \"model\":\"meta-llama/Llama-2-7b-chat-hf\",\n  \"choices\":[\n    {\n      \"message\":\n        {\n          \"role\":\"assistant\",\n          \"content\":\"Hello there! *adjusts glasses* It's a pleasure to meet you! Is there anything I can help you with today? Have you got a question or a task you'd like me to assist you with? Just let me know!\"\n        },\n      \"index\":0,\n      \"finish_reason\":\"stop\"\n    }\n  ],\n  \"usage\":{\"prompt_tokens\":30,\"completion_tokens\":53,\"total_tokens\":83}}\n```\n\n### Connecting directly over python\n\nUse the `requests` library to connect with Python. Use this script to receive a streamed response, automatically parse the outputs, and print just the content.\n\n```python\nimport os\nimport json\nimport requests\n\ns = requests.Session()\n\napi_base = os.getenv(\"ENDPOINT_URL\")\nurl = f\"{api_base}/chat/completions\"\nbody = {\n  \"model\": \"meta-llama/Llama-2-7b-chat-hf\",\n  \"messages\": [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"Tell me a long story with many words.\"}\n  ],\n  \"temperature\": 0.7,\n  \"stream\": True,\n}\n\nwith s.post(url, json=body, stream=True) as response:\n    for chunk in response.iter_lines(decode_unicode=True):\n        if chunk is not None:\n            try:\n                # Get data from reponse chunk\n                chunk_data = chunk.split(\"data: \")[1]\n\n                # Get message choices from data\n                choices = json.loads(chunk_data)[\"choices\"]\n\n                # Pick content from first choice\n                content = choices[0][\"delta\"][\"content\"]\n\n                print(content, end=\"\", flush=True)\n            except json.decoder.JSONDecodeError:\n                # Chunk was not formatted as expected\n                pass\n            except KeyError:\n                # No message was contained in the chunk\n                pass\n    print(\"\")\n```\n\n### Using the OpenAI SDK\n\nRayLLM uses an OpenAI-compatible API, allowing us to use the OpenAI\nSDK to access our deployments. To do so, we need to set the `OPENAI_API_BASE` env var.\n\n```shell\nexport OPENAI_API_BASE=http://localhost:8000/v1\nexport OPENAI_API_KEY='not_a_real_key'\n```\n\n```python\nimport openai\n\n# List all models.\nmodels = openai.Model.list()\nprint(models)\n\n# Note: not all arguments are currently supported and will be ignored by the backend.\nchat_completion = openai.ChatCompletion.create(\n    model=\"meta-llama/Llama-2-7b-chat-hf\",\n    messages=[\n      {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n      {\"role\": \"user\", \"content\": \"Say 'test'.\"}\n    ],\n    temperature=0.7\n)\nprint(chat_completion)\n```\n\n# RayLLM Reference\n\n## Installing RayLLM\n\nTo install RayLLM and its dependencies, run the following command:\n\n```shell\npip install \"rayllm @ git+https://github.com/ray-project/ray-llm.git\"\n```\n\nRayLLM consists of a set of configurations and utilities for deploying LLMs on Ray Serve,\nin addition to a frontend (Aviary Explorer), both of which come with additional\ndependencies. To install the dependencies for the frontend run the following commands:\n\n```shell\npip install \"rayllm[frontend] @ git+https://github.com/ray-project/ray-llm.git\"\n```\n\nThe backend dependencies are heavy weight, and quite large. We recommend using the official\n`anyscale/ray-llm` image. Installing the backend manually is not a supported usecase.\n\n### Usage stats collection\n\nRay collects basic, non-identifiable usage statistics to help us improve the project.\nFor more information on what is collected and how to opt-out, see the\n[Usage Stats Collection](https://docs.ray.io/en/latest/cluster/usage-stats.html) page in\nRay documentation.\n\n## Using RayLLM through the CLI\n\nRayLLM uses the Ray Serve CLI that allows you to interact with deployed models.\n\n```shell\n# Start a new model in Ray Serve from provided configuration\nserve run serve_configs/\u003cmodel_config_path\u003e\n\n# Get the status of the running deployments\nserve status\n\n# Get the current config of current live Serve applications \nserve config\n\n# Shutdown all Serve applications\nserve shutdown\n```\n\n## RayLLM Model Registry\n\nYou can easily add new models by adding two configuration files.\nTo learn more about how to customize or add new models,\nsee the [Model Registry](models/README.md).\n\n# Frequently Asked Questions\n\n## How do I add a new model?\n\nThe easiest way is to copy the configuration of the existing model's YAML file and modify it. See models/README.md for more details.\n\n## How do I deploy multiple models at once?\n\nRun multiple models at once by aggregating the Serve configs for different models into a single, unified config. For example, use this config to run the `LightGPT` and `Llama-2-7b-chat` model in a single Serve application:\n\n```yaml\n# File name: serve_configs/config.yaml\n\napplications:\n- name: router\n  import_path: rayllm.backend:router_application\n  route_prefix: /\n  args:\n    models:\n      - ./models/continuous_batching/amazon--LightGPT.yaml\n      - ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml\n```\n\nThe config includes both models in the `model` argument for the `router`. Additionally, the Serve configs for both model applications are included. Save this unified config file to the `serve_configs/` folder.\n\nRun the config to deploy the models:\n\n```shell\nserve run serve_configs/\u003cconfig.yaml\u003e\n```\n\n## How do I deploy a model to multiple nodes?\n\nAll our default model configurations enforce a model to be deployed on one node for high performance. However, you can easily change this if you want to deploy a model across nodes for lower cost or GPU availability. In order to do that, go to the YAML file in the model registry and change `placement_strategy` to `PACK` instead of `STRICT_PACK`.\n\n## My deployment isn't starting/working correctly, how can I debug?\n\nThere can be several reasons for the deployment not starting or not working correctly. Here are some things to check:\n\n1. You might have specified an invalid model id.\n2. Your model may require resources that are not available on the cluster. A common issue is that the model requires Ray custom resources (eg. `accelerator_type_a10`) in order to be scheduled on the right node type, while your cluster is missing those custom resources. You can either modify the model configuration to remove those custom resources or better yet, add them to the node configuration of your Ray cluster. You can debug this issue by looking at Ray Autoscaler logs ([monitor.log](https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#system-component-logs)).\n3. Your model is a gated Hugging Face model (eg. meta-llama). In that case, you need to set the `HUGGING_FACE_HUB_TOKEN` environment variable cluster-wide. You can do that either in the Ray cluster configuration or by setting it before running `serve run`\n4. Your model may be running out of memory. You can usually spot this issue by looking for keywords related to \"CUDA\", \"memory\" and \"NCCL\" in the replica logs or `serve run` output. In that case, consider reducing the `max_batch_prefill_tokens` and `max_batch_total_tokens` (if applicable). See models/README.md for more information on those parameters.\n\nIn general, [Ray Dashboard](https://docs.ray.io/en/latest/serve/monitoring.html#ray-dashboard) is a useful debugging tool, letting you monitor your Ray Serve / LLM application and access Ray logs.\n\nA good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy _a_ model.\n\n### How do I write a program that accesses both OpenAI and your hosted model at the same time?\n\nThe OpenAI `create()` commands allow you to specify the `API_KEY` and `API_BASE`. So you can do something like this.\n\n```python\n# Call your self-hosted model running on the local host:\nOpenAI.ChatCompletion.create(api_base=\"http://localhost:8000/v1\", api_key=\"\",...)\n\n# Call OpenAI. Set OPENAI_API_KEY to your key and unset OPENAI_API_BASE \nOpenAI.ChatCompletion.create(api_key=\"OPENAI_API_KEY\", ...)\n```\n\n## Getting Help and Filing Bugs / Feature Requests\n\nWe are eager to help you get started with RayLLM. You can get help on:\n\n- Via Slack -- fill in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) to sign up.\n- Via [Discuss](https://discuss.ray.io/c/llms-generative-ai/27).\n\nFor bugs or for feature requests, please submit them [here](https://github.com/ray-project/ray-llm/issues/new).\n\n## Contributions\n\nWe are also interested in accepting contributions. Those could be anything from a new evaluator, to integrating a new model with a yaml file, to more.\nFeel free to post an issue first to get our feedback on a proposal first, or just file a PR and we commit to giving you prompt feedback.\n\nWe use `pre-commit` hooks to ensure that all code is formatted correctly.\nMake sure to `pip install pre-commit` and then run `pre-commit install`.\nYou can also run `./format` to run the hooks manually.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Faviary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fray-project%2Faviary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Faviary/lists"}