{"id":13451143,"url":"https://github.com/predibase/lorax","last_synced_at":"2025-05-12T20:50:48.991Z","repository":{"id":207508756,"uuid":"707818217","full_name":"predibase/lorax","owner":"predibase","description":"Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs","archived":false,"fork":false,"pushed_at":"2025-05-08T20:39:43.000Z","size":7259,"stargazers_count":2972,"open_issues_count":171,"forks_count":211,"subscribers_count":34,"default_branch":"main","last_synced_at":"2025-05-11T01:35:18.395Z","etag":null,"topics":["fine-tuning","gpt","llama","llm","llm-inference","llm-serving","llmops","lora","model-serving","pytorch","transformers"],"latest_commit_sha":null,"homepage":"https://loraexchange.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/predibase.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-10-20T18:19:49.000Z","updated_at":"2025-05-10T18:22:08.000Z","dependencies_parsed_at":"2023-11-24T23:24:47.447Z","dependency_job_id":"44a2a760-2bcf-43fb-ab0d-dddb9dcdc1b3","html_url":"https://github.com/predibase/lorax","commit_stats":{"total_commits":814,"total_committers":64,"mean_commits":12.71875,"dds":0.714987714987715,"last_synced_commit":"373c3e62204e5d050e69abe394e8083b1e8ca989"},"previous_names":["predibase/lorax"],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predibase%2Florax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predibase%2Florax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predibase%2Florax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predibase%2Florax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/predibase","download_url":"https://codeload.github.com/predibase/lorax/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253672642,"owners_count":21945479,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fine-tuning","gpt","llama","llm","llm-inference","llm-serving","llmops","lora","model-serving","pytorch","transformers"],"created_at":"2024-07-31T07:00:48.865Z","updated_at":"2025-05-12T20:50:48.971Z","avatar_url":"https://github.com/predibase.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/predibase/lorax\"\u003e\n    \u003cimg src=\"docs/LoRAX_Main_Logo-Orange.png\" alt=\"LoRAX Logo\" style=\"width:200px;\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n_LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs_\n\n[![](https://dcbadge.vercel.app/api/server/CBgdrGnZjy?style=flat\u0026theme=discord-inverted)](https://discord.gg/CBgdrGnZjy)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/predibase/lorax/blob/master/LICENSE)\n[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/lorax)](https://artifacthub.io/packages/search?repo=lorax)\n\n\u003c/div\u003e\n\nLoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.\n\n## 📖 Table of contents\n\n- [📖 Table of contents](#-table-of-contents)\n- [🌳 Features](#-features)\n- [🏠 Models](#-models)\n- [🏃‍♂️ Getting Started](#️-getting-started)\n  - [Requirements](#requirements)\n  - [Launch LoRAX Server](#launch-lorax-server)\n  - [Prompt via REST API](#prompt-via-rest-api)\n  - [Prompt via Python Client](#prompt-via-python-client)\n  - [Chat via OpenAI API](#chat-via-openai-api)\n  - [Next steps](#next-steps)\n- [🙇 Acknowledgements](#-acknowledgements)\n- [🗺️ Roadmap](#️-roadmap)\n\n## 🌳 Features\n\n- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](https://predibase.github.io/lorax/models/adapters/#huggingface-hub), [Predibase](https://predibase.github.io/lorax/models/adapters/#predibase), or [any filesystem](https://predibase.github.io/lorax/models/adapters/#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](https://predibase.github.io/lorax/guides/merging_adapters/) per request to instantly create powerful ensembles.\n- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.\n- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.\n- 👬 **Optimized Inference:**  high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.\n- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](https://predibase.github.io/lorax/guides/structured_output) (JSON mode).\n- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/predibase/lorax/assets/29719151/f88aa16c-66de-45ad-ad40-01a7874ed8a9\" /\u003e\n\u003c/p\u003e\n\n\n## 🏠 Models\n\nServing a fine-tuned model with LoRAX consists of two components:\n\n- [Base Model](https://predibase.github.io/lorax/models/base_models): pretrained large model shared across all adapters.\n- [Adapter](https://predibase.github.io/lorax/models/adapters): task-specific adapter weights dynamically loaded per request.\n\nLoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](https://predibase.github.io/lorax/models/base_models/#supported-architectures) for a complete list of supported base models. \n\nBase models can be loaded in fp16 or quantized with `bitsandbytes`, [GPT-Q](https://arxiv.org/abs/2210.17323), or [AWQ](https://arxiv.org/abs/2306.00978).\n\nSupported adapters include LoRA adapters trained using the [PEFT](https://github.com/huggingface/peft) and [Ludwig](https://ludwig.ai/) libraries. Any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.\n\n## 🏃‍♂️ Getting Started\n\nWe recommend starting with our pre-built Docker image to avoid compiling custom CUDA kernels and other dependencies.\n\n### Requirements\n\nThe minimum system requirements need to run LoRAX include:\n\n- Nvidia GPU (Ampere generation or above)\n- CUDA 11.8 compatible device drivers and above\n- Linux OS\n- Docker (for this guide)\n\n### Launch LoRAX Server\n\n#### Prerequisites\nInstall [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)\nThen \n - `sudo systemctl daemon-reload`\n - `sudo systemctl restart docker`\n\n```shell\nmodel=mistralai/Mistral-7B-Instruct-v0.1\nvolume=$PWD/data\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\n    ghcr.io/predibase/lorax:main --model-id $model\n```\n\nFor a full tutorial including token streaming and the Python client, see [Getting Started - Docker](https://predibase.github.io/lorax/getting_started/docker).\n\n### Prompt via REST API\n\nPrompt base LLM:\n\n```shell\ncurl 127.0.0.1:8080/generate \\\n    -X POST \\\n    -d '{\n        \"inputs\": \"[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]\",\n        \"parameters\": {\n            \"max_new_tokens\": 64\n        }\n    }' \\\n    -H 'Content-Type: application/json'\n```\n\nPrompt a LoRA adapter:\n\n```shell\ncurl 127.0.0.1:8080/generate \\\n    -X POST \\\n    -d '{\n        \"inputs\": \"[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]\",\n        \"parameters\": {\n            \"max_new_tokens\": 64,\n            \"adapter_id\": \"vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k\"\n        }\n    }' \\\n    -H 'Content-Type: application/json'\n```\n\nSee [Reference - REST API](https://predibase.github.io/lorax/reference/rest_api) for full details.\n\n### Prompt via Python Client\n\nInstall:\n\n```shell\npip install lorax-client\n```\n\nRun:\n\n```python\nfrom lorax import Client\n\nclient = Client(\"http://127.0.0.1:8080\")\n\n# Prompt the base LLM\nprompt = \"[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]\"\nprint(client.generate(prompt, max_new_tokens=64).generated_text)\n\n# Prompt a LoRA adapter\nadapter_id = \"vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k\"\nprint(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)\n```\n\nSee [Reference - Python Client](https://predibase.github.io/lorax/reference/python_client) for full details.\n\nFor other ways to run LoRAX, see [Getting Started - Kubernetes](https://predibase.github.io/lorax/getting_started/kubernetes), [Getting Started - SkyPilot](https://predibase.github.io/lorax/getting_started/skypilot), and [Getting Started - Local](https://predibase.github.io/lorax/getting_started/local).\n\n### Chat via OpenAI API\n\nLoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    api_key=\"EMPTY\",\n    base_url=\"http://127.0.0.1:8080/v1\",\n)\n\nresp = client.chat.completions.create(\n    model=\"alignment-handbook/zephyr-7b-dpo-lora\",\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a friendly chatbot who always responds in the style of a pirate\",\n        },\n        {\"role\": \"user\", \"content\": \"How many helicopters can a human eat in one sitting?\"},\n    ],\n    max_tokens=100,\n)\nprint(\"Response:\", resp.choices[0].message.content)\n```\n\nSee [OpenAI Compatible API](https://predibase.github.io/lorax/reference/openai_api) for details.\n\n### Next steps\n\nHere are some other interesting Mistral-7B fine-tuned models to try out:\n\n- [alignment-handbook/zephyr-7b-dpo-lora](https://huggingface.co/alignment-handbook/zephyr-7b-dpo-lora): Mistral-7b fine-tuned on Zephyr-7B dataset with DPO.\n- [IlyaGusev/saiga_mistral_7b_lora](https://huggingface.co/IlyaGusev/saiga_mistral_7b_lora): Russian chatbot based on `Open-Orca/Mistral-7B-OpenOrca`.\n- [Undi95/Mistral-7B-roleplay_alpaca-lora](https://huggingface.co/Undi95/Mistral-7B-roleplay_alpaca-lora): Fine-tuned using role-play prompts.\n\nYou can find more LoRA adapters [here](https://huggingface.co/models?pipeline_tag=text-generation\u0026sort=trending\u0026search=-lora), or try fine-tuning your own with [PEFT](https://github.com/huggingface/peft) or [Ludwig](https://ludwig.ai).\n\n## 🙇 Acknowledgements\n\nLoRAX is built on top of HuggingFace's [text-generation-inference](https://github.com/huggingface/text-generation-inference), forked from v0.9.4 (Apache 2.0).\n\nWe'd also like to acknowledge [Punica](https://github.com/punica-ai/punica) for their work on the SGMV kernel, which is used to speed up multi-adapter inference under heavy load.\n\n## 🗺️ Roadmap\n\nOur roadmap is tracked [here](https://github.com/predibase/lorax/issues/57).\n","funding_links":[],"categories":["Python","HarmonyOS","openai compatible inference engines","llm","A01_文本生成_文本对话","pytorch","Inference Engine","Fine-tuning \u0026 Quantization (18)","Inference","Inference \u0026 Serving","🔧 Fine-Tuning Platforms"],"sub_categories":["Windows Manager","大语言对话模型及数据","Inference Engine","Inference Engines","Embedding APIs"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredibase%2Florax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpredibase%2Florax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredibase%2Florax/lists"}