{"id":13409210,"url":"https://github.com/bentoml/OpenLLM","last_synced_at":"2025-03-14T14:31:03.325Z","repository":{"id":174787366,"uuid":"629749002","full_name":"bentoml/OpenLLM","owner":"bentoml","description":"Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.","archived":false,"fork":false,"pushed_at":"2024-10-29T06:06:24.000Z","size":42636,"stargazers_count":9941,"open_issues_count":21,"forks_count":633,"subscribers_count":55,"default_branch":"main","last_synced_at":"2024-10-29T09:27:30.591Z","etag":null,"topics":["bentoml","fine-tuning","llama","llama2","llama3-1","llama3-2","llama3-2-vision","llm","llm-inference","llm-ops","llm-serving","llmops","mistral","mlops","model-inference","open-source-llm","openllm","vicuna"],"latest_commit_sha":null,"homepage":"https://bentoml.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bentoml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-19T00:27:52.000Z","updated_at":"2024-10-29T09:06:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"be0b598c-37be-4054-977b-1948bec94a20","html_url":"https://github.com/bentoml/OpenLLM","commit_stats":{"total_commits":1808,"total_committers":35,"mean_commits":51.65714285714286,"dds":0.3224557522123894,"last_synced_commit":"2a06022bb5e361907cf2c09315b717e0fe7bbce7"},"previous_names":["bentoml/openllm"],"tags_count":180,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bentoml%2FOpenLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bentoml%2FOpenLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bentoml%2FOpenLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bentoml%2FOpenLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bentoml","download_url":"https://codeload.github.com/bentoml/OpenLLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243593310,"owners_count":20316164,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bentoml","fine-tuning","llama","llama2","llama3-1","llama3-2","llama3-2-vision","llm","llm-inference","llm-ops","llm-serving","llmops","mistral","mlops","model-inference","open-source-llm","openllm","vicuna"],"created_at":"2024-07-30T20:00:58.861Z","updated_at":"2025-03-14T14:31:03.310Z","avatar_url":"https://github.com/bentoml.png","language":"Python","funding_links":[],"categories":["Models and Tools","🛠️ Popular Open-Source Libraries for LLM Development","Python","Tools for Self-Hosting","*Ops for AI","llm","INFERENCING FRAMEWORKS","Large Scale Deployment","ML Platforms","A01_文本生成_文本对话","LLM Deployment","大语言模型LLMs","HarmonyOS","Deployment and Serving","NLP","推理 Inference","Apps","🧠 AI Applications \u0026 Platforms","Large Language Models ##","开源工具","mlops","🔓 Open Source Inference Engines","其他LLM框架","LLM Inference","Repos","Model Serving Frameworks","Other LLM Frameworks","🚀 Model Serving \u0026 Deployment","Open-Source Local LLM Projects","ML / AI","Tooling Ecosystem","8. Inference Engines","Inference \u0026 Serving","3. Inference Engines \u0026 Serving","LLM Serving / Inference","Inference","🤖 AI \u0026 Machine Learning"],"sub_categories":["LLM Deployment","LLMs","Model Serving \u0026 Inference","ML Platforms","大语言对话模型及数据","Windows Manager","3. Pretraining","AI","Tools","LLM部署和serving","文章","LangManus","Videos Playlists","Desktop / Local","Inference Platforms","Inference Engine"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e🦾 OpenLLM: Self-Hosting LLMs Made Easy\u003c/h1\u003e\n\u003c/div\u003e\n\n[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202-green.svg)](https://github.com/bentoml/OpenLLM/blob/main/LICENSE)\n[![Releases](https://img.shields.io/pypi/v/openllm.svg?logo=pypi\u0026label=PyPI\u0026logoColor=gold)](https://pypi.org/project/openllm)\n[![CI](https://results.pre-commit.ci/badge/github/bentoml/OpenLLM/main.svg)](https://results.pre-commit.ci/latest/github/bentoml/OpenLLM/main)\n[![X](https://badgen.net/badge/icon/@bentomlai/000000?icon=twitter\u0026label=Follow)](https://twitter.com/bentomlai)\n[![Community](https://badgen.net/badge/icon/Community/562f5d?icon=slack\u0026label=Join)](https://l.bentoml.com/join-slack)\n\nOpenLLM allows developers to run **any open-source LLMs** (Llama 3.3, Qwen2.5, Phi3 and [more](#supported-models)) or **custom models** as **OpenAI-compatible APIs** with a single command. It features a [built-in chat UI](#chat-ui), state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Docker, Kubernetes, and [BentoCloud](#deploy-to-bentocloud).\n\nUnderstand the [design philosophy of OpenLLM](https://www.bentoml.com/blog/from-ollama-to-openllm-running-llms-in-the-cloud).\n\n## Get Started\n\nRun the following commands to install OpenLLM and explore it interactively.\n\n```bash\npip install openllm  # or pip3 install openllm\nopenllm hello\n```\n\n![hello](https://github.com/user-attachments/assets/5af19f23-1b34-4c45-b1e0-a6798b4586d1)\n\n## Supported models\n\nOpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add a [model repository to run custom models](#set-up-a-custom-repository) with OpenLLM.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eModel\u003c/th\u003e\n    \u003cth\u003eParameters\u003c/th\u003e\n    \u003cth\u003eRequired GPU\u003c/th\u003e\n    \u003cth\u003eStart a Server\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003edeepseek\u003c/td\u003e\n    \u003ctd\u003er1\u003c/td\u003e\n    \u003ctd\u003e80Gx16\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve deepseek:r1\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003egemma2\u003c/td\u003e\n    \u003ctd\u003e2b\u003c/td\u003e\n    \u003ctd\u003e12G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve gemma2:2b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ehermes-3\u003c/td\u003e\n    \u003ctd\u003edeep-llama3-8b-3622\u003c/td\u003e\n    \u003ctd\u003e80G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve hermes-3:deep-llama3-8b-3622\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ejamba1.5\u003c/td\u003e\n    \u003ctd\u003emini-5e51\u003c/td\u003e\n    \u003ctd\u003e80Gx2\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve jamba1.5:mini-5e51\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ellama3.1\u003c/td\u003e\n    \u003ctd\u003e8b\u003c/td\u003e\n    \u003ctd\u003e24G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve llama3.1:8b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ellama3.2\u003c/td\u003e\n    \u003ctd\u003e1b\u003c/td\u003e\n    \u003ctd\u003e24G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve llama3.2:1b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ellama3.3\u003c/td\u003e\n    \u003ctd\u003e70b\u003c/td\u003e\n    \u003ctd\u003e80Gx2\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve llama3.3:70b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emistral\u003c/td\u003e\n    \u003ctd\u003e8b\u003c/td\u003e\n    \u003ctd\u003e24G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve mistral:8b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emistral-large\u003c/td\u003e\n    \u003ctd\u003e123b\u003c/td\u003e\n    \u003ctd\u003e80Gx4\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve mistral-large:123b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ephi4\u003c/td\u003e\n    \u003ctd\u003e14b\u003c/td\u003e\n    \u003ctd\u003e80G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve phi4:14b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003epixtral\u003c/td\u003e\n    \u003ctd\u003e12b-2409\u003c/td\u003e\n    \u003ctd\u003e80G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve pixtral:12b-2409\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eqwen2.5\u003c/td\u003e\n    \u003ctd\u003e7b\u003c/td\u003e\n    \u003ctd\u003e24G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve qwen2.5:7b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eqwen2.5-coder\u003c/td\u003e\n    \u003ctd\u003e3b\u003c/td\u003e\n    \u003ctd\u003e24G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve qwen2.5-coder:3b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eqwq\u003c/td\u003e\n    \u003ctd\u003e32b\u003c/td\u003e\n    \u003ctd\u003e80G\u003c/td\u003e\n    \u003ctd\u003e\u003ccode\u003eopenllm serve qwq:32b\u003c/code\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\nFor the full model list, see the [OpenLLM models repository](https://github.com/bentoml/openllm-models).\n\n## Start an LLM server\n\nTo start an LLM server locally, use the `openllm serve` command and specify the model version.\n\n\u003e [!NOTE]\n\u003e OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models.\n\u003e\n\u003e 1. Create your Hugging Face token [here](https://huggingface.co/settings/tokens).\n\u003e 2. Request access to the gated model, such as [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).\n\u003e 3. Set your token as an environment variable by running:\n\u003e    ```bash\n\u003e    export HF_TOKEN=\u003cyour token\u003e\n\u003e    ```\n\n```bash\nopenllm serve llama3.2:1b\n```\n\nThe server will be accessible at [http://localhost:3000](http://localhost:3000/), providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:\n\n- **The API host address**: By default, the LLM is hosted at [http://localhost:3000](http://localhost:3000/).\n- **The model name:** The name can be different depending on the tool you use.\n- **The API key**: The API key used for client authentication. This is optional.\n\nHere are some examples:\n\n\u003cdetails\u003e\n\n\u003csummary\u003eOpenAI Python client\u003c/summary\u003e\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url='http://localhost:3000/v1', api_key='na')\n\n# Use the following func to get the available models\n# model_list = client.models.list()\n# print(model_list)\n\nchat_completion = client.chat.completions.create(\n    model=\"meta-llama/Llama-3.2-1B-Instruct\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": \"Explain superconductors like I'm five years old\"\n        }\n    ],\n    stream=True,\n)\nfor chunk in chat_completion:\n    print(chunk.choices[0].delta.content or \"\", end=\"\")\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003eLlamaIndex\u003c/summary\u003e\n\n```python\nfrom llama_index.llms.openai import OpenAI\n\nllm = OpenAI(api_bese=\"http://localhost:3000/v1\", model=\"meta-llama/Llama-3.2-1B-Instruct\", api_key=\"dummy\")\n...\n```\n\n\u003c/details\u003e\n\n## Chat UI\n\nOpenLLM provides a chat UI at the `/chat` endpoint for the launched LLM server at http://localhost:3000/chat.\n\n\u003cimg width=\"800\" alt=\"openllm_ui\" src=\"https://github.com/bentoml/OpenLLM/assets/5886138/8b426b2b-67da-4545-8b09-2dc96ff8a707\"\u003e\n\n## Chat with a model in the CLI\n\nTo start a chat conversation in the CLI, use the `openllm run` command and specify the model version.\n\n```bash\nopenllm run llama3:8b\n```\n\n## Model repository\n\nA model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted at [this GitHub repository](https://github.com/bentoml/openllm-models). To see all available models from the default and any added repository, use:\n\n```bash\nopenllm model list\n```\n\nTo ensure your local list of models is synchronized with the latest updates from all connected repositories, run:\n\n```bash\nopenllm repo update\n```\n\nTo review a model’s information, run:\n\n```bash\nopenllm model get llama3.2:1b\n```\n\n### Add a model to the default model repository\n\nYou can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out this [example pull request](https://github.com/bentoml/openllm-models/pull/1).\n\n### Set up a custom repository\n\nYou can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with a `bentos` directory to store custom LLMs. You need to [build your Bentos with BentoML](https://docs.bentoml.com/en/latest/guides/build-options.html) and submit them to your model repository.\n\nFirst, prepare your custom models in a `bentos` directory following the guidelines provided by [BentoML to build Bentos](https://docs.bentoml.com/en/latest/guides/build-options.html). Check out the [default model repository](https://github.com/bentoml/openllm-repo) for an example and read the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) for details.\n\nThen, register your custom model repository with OpenLLM:\n\n```bash\nopenllm repo add \u003crepo-name\u003e \u003crepo-url\u003e\n```\n\n**Note**: Currently, OpenLLM only supports adding public repositories.\n\n## Deploy to BentoCloud\n\nOpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud.\n\n[Sign up for BentoCloud](https://www.bentoml.com/) for free and [log in](https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html). Then, run `openllm deploy` to deploy a model to BentoCloud:\n\n```bash\nopenllm deploy llama3.2:1b\n```\n\n\u003e [!NOTE]\n\u003e If you are deploying a gated model, make sure to set HF_TOKEN in enviroment variables.\n\nOnce the deployment is complete, you can run model inference on the BentoCloud console:\n\n\u003cimg width=\"800\" alt=\"bentocloud_ui\" src=\"https://github.com/bentoml/OpenLLM/assets/65327072/4f7819d9-73ea-488a-a66c-f724e5d063e6\"\u003e\n\n## Community\n\nOpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 [Join our Slack community!](https://l.bentoml.com/join-slack)\n\n## Contributing\n\nAs an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute:\n\n- Repost a bug by [creating a GitHub issue](https://github.com/bentoml/OpenLLM/issues/new/choose).\n- [Submit a pull request](https://github.com/bentoml/OpenLLM/compare) or help review other developers’ [pull requests](https://github.com/bentoml/OpenLLM/pulls).\n- Add an LLM to the OpenLLM default model repository so that other users can run your model. See the [pull request template](https://github.com/bentoml/openllm-models/pull/1).\n- Check out the [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) to learn more.\n\n## Acknowledgements\n\nThis project uses the following open-source projects:\n\n- [bentoml/bentoml](https://github.com/bentoml/bentoml) for production level model serving\n- [vllm-project/vllm](https://github.com/vllm-project/vllm) for production level LLM backend\n- [blrchen/chatgpt-lite](https://github.com/blrchen/chatgpt-lite) for a fancy Web Chat UI\n- [astral-sh/uv](https://github.com/astral-sh/uv) for blazing fast model requirements installing\n\nWe are grateful to the developers and contributors of these projects for their hard work and dedication.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbentoml%2FOpenLLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbentoml%2FOpenLLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbentoml%2FOpenLLM/lists"}