https://github.com/xorbitsai/inference
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://github.com/xorbitsai/inference
artificial-intelligence chatglm deployment flan-t5 gemma ggml glm4 inference llama llama3 llamacpp llm machine-learning mistral openai-api pytorch qwen vllm whisper wizardlm
Last synced: 6 days ago
JSON representation
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
- Host: GitHub
- URL: https://github.com/xorbitsai/inference
- Owner: xorbitsai
- License: apache-2.0
- Created: 2023-06-14T07:05:04.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-27T14:25:47.000Z (18 days ago)
- Last Synced: 2025-03-28T17:02:24.963Z (17 days ago)
- Topics: artificial-intelligence, chatglm, deployment, flan-t5, gemma, ggml, glm4, inference, llama, llama3, llamacpp, llm, machine-learning, mistral, openai-api, pytorch, qwen, vllm, whisper, wizardlm
- Language: Python
- Homepage: https://inference.readthedocs.io
- Size: 36.8 MB
- Stars: 7,274
- Watchers: 51
- Forks: 594
- Open Issues: 190
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-local-llms - inference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. | 6,054 | 498 | 185 | 89 | 95 | Apache License 2.0 | 1 days, 3 hrs, 13 mins | (Open-Source Local LLM Projects)
- stars - xorbitsai/inference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (HarmonyOS / Windows Manager)
- alan_awesome_llm - Xinference
- awesome-llmops - Xinference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. |  | (Serving / Frameworks/Servers for Serving)
- awesome-ChatGPT-repositories - inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (NLP)
- StarryDivineSky - xorbitsai/inference
- awesome-homelab - Xinference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. | (Apps / AI)
- AiTreasureBox - xorbitsai/inference - 04-07_7397_19](https://img.shields.io/github/stars/xorbitsai/inference.svg)|Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.| (Repos)
- awesome-llm-and-aigc - xorbitsai/inference
- awesome-llm-and-aigc - xorbitsai/inference
- awesomeLibrary - inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (语言资源库 / python)
- awesome-LLM-resourses - Xinference
- alan_awesome_llm - Xinference
- awesome-hacking-lists - xorbitsai/inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any (Python)
README
![]()
# Xorbits Inference: Model Serving Made Easy 🤖
Xinference Cloud ·
Xinference Enterprise ·
Self-hosting ·
Documentation[](https://pypi.org/project/xinference/)
[](https://github.com/xorbitsai/inference/blob/main/LICENSE)
[](https://actions-badge.atrox.dev/xorbitsai/inference/goto?ref=main)
[](https://discord.gg/Xw9tszSkr5)
[](https://twitter.com/xorbitsio)
Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language,
speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy
and serve your or state-of-the-art built-in models using just a single command. Whether you are a
researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full
potential of cutting-edge AI models.## 🔥 Hot Topics
### Framework Enhancements
- [Xllamacpp](https://github.com/xorbitsai/xllamacpp): New llama.cpp Python binding, maintained by Xinference team, supports continuous batching and is more production-ready.: [#2997](https://github.com/xorbitsai/inference/pull/2997)
- Distributed inference: running models across workers: [#2877](https://github.com/xorbitsai/inference/pull/2877)
- VLLM enhancement: Shared KV cache across multiple replicas: [#2732](https://github.com/xorbitsai/inference/pull/2732)
- Support Continuous batching for Transformers engine: [#1724](https://github.com/xorbitsai/inference/pull/1724)
- Support MLX backend for Apple Silicon chips: [#1765](https://github.com/xorbitsai/inference/pull/1765)
- Support specifying worker and GPU indexes for launching models: [#1195](https://github.com/xorbitsai/inference/pull/1195)
- Support SGLang backend: [#1161](https://github.com/xorbitsai/inference/pull/1161)
- Support LoRA for LLM and image models: [#1080](https://github.com/xorbitsai/inference/pull/1080)
### New Models
- Built-in support for [Gemma-3-it](https://blog.google/technology/developers/gemma-3/): [#3077](https://github.com/xorbitsai/inference/pull/3077)
- Built-in support for [QwQ-32B](https://qwenlm.github.io/blog/qwq-32b/): [#3005](https://github.com/xorbitsai/inference/pull/3005)
- Built-in support for [DeepSeek V3 and R1](https://github.com/deepseek-ai/DeepSeek-R1): [#2864](https://github.com/xorbitsai/inference/pull/2864)
- Built-in support for [InternVL2.5](https://internvl.github.io/blog/2024-12-05-InternVL-2.5/): [#2776](https://github.com/xorbitsai/inference/pull/2776)
- Built-in support for [DeepSeek-R1-Distill-Llama](https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-file#deepseek-r1-distill-models): [#2811](https://github.com/xorbitsai/inference/pull/2811)
- Built-in support for [DeepSeek-R1-Distill-Qwen](https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-file#deepseek-r1-distill-models): [#2781](https://github.com/xorbitsai/inference/pull/2781)
- Built-in support for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M): [#2790](https://github.com/xorbitsai/inference/pull/2790)
- Built-in support for [qwen2.5-vl](https://github.com/QwenLM/Qwen2.5-VL): [#2788](https://github.com/xorbitsai/inference/pull/2788)
### Integrations
- [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.
- [FastGPT](https://github.com/labring/FastGPT): a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.
- [RAGFlow](https://github.com/infiniflow/ragflow): is an open-source RAG engine based on deep document understanding.
- [MaxKB](https://github.com/1Panel-dev/MaxKB): MaxKB = Max Knowledge Base, it is a chatbot based on Large Language Models (LLM) and Retrieval-Augmented Generation (RAG).
- [Chatbox](https://chatboxai.app/): a desktop client for multiple cutting-edge LLM models, available on Windows, Mac and Linux.## Key Features
🌟 **Model Serving Made Easy**: Simplify the process of serving large language, speech
recognition, and multimodal models. You can set up and deploy your models
for experimentation and production with a single command.⚡️ **State-of-the-Art Models**: Experiment with cutting-edge built-in models using a single
command. Inference provides access to state-of-the-art open-source models!🖥 **Heterogeneous Hardware Utilization**: Make the most of your hardware resources with
[ggml](https://github.com/ggerganov/ggml). Xorbits Inference intelligently utilizes heterogeneous
hardware, including GPUs and CPUs, to accelerate your model inference tasks.⚙️ **Flexible API and Interfaces**: Offer multiple interfaces for interacting
with your models, supporting OpenAI compatible RESTful API (including Function Calling API), RPC, CLI
and WebUI for seamless model management and interaction.🌐 **Distributed Deployment**: Excel in distributed deployment scenarios,
allowing the seamless distribution of model inference across multiple devices or machines.🔌 **Built-in Integration with Third-Party Libraries**: Xorbits Inference seamlessly integrates
with popular third-party libraries including [LangChain](https://python.langchain.com/docs/integrations/providers/xinference), [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/llm/XinferenceLocalDeployment.html#i-run-pip-install-xinference-all-in-a-terminal-window), [Dify](https://docs.dify.ai/advanced/model-configuration/xinference), and [Chatbox](https://chatboxai.app/).## Why Xinference
| Feature | Xinference | FastChat | OpenLLM | RayLLM |
|------------------------------------------------|------------|----------|---------|--------|
| OpenAI-Compatible RESTful API | ✅ | ✅ | ✅ | ✅ |
| vLLM Integrations | ✅ | ✅ | ✅ | ✅ |
| More Inference Engines (GGML, TensorRT) | ✅ | ❌ | ✅ | ✅ |
| More Platforms (CPU, Metal) | ✅ | ✅ | ❌ | ❌ |
| Multi-node Cluster Deployment | ✅ | ❌ | ❌ | ✅ |
| Image Models (Text-to-Image) | ✅ | ✅ | ❌ | ❌ |
| Text Embedding Models | ✅ | ❌ | ❌ | ❌ |
| Multimodal Models | ✅ | ❌ | ❌ | ❌ |
| Audio Models | ✅ | ❌ | ❌ | ❌ |
| More OpenAI Functionalities (Function Calling) | ✅ | ❌ | ❌ | ❌ |## Using Xinference
- **Cloud **
We host a [Xinference Cloud](https://inference.top) service for anyone to try with zero setup.- **Self-hosting Xinference Community Edition**
Quickly get Xinference running in your environment with this [starter guide](#getting-started).
Use our [documentation](https://inference.readthedocs.io/) for further references and more in-depth instructions.- **Xinference for enterprise / organizations**
We provide additional enterprise-centric features. [send us an email](mailto:[email protected]?subject=[GitHub]Business%20License%20Inquiry) to discuss enterprise needs.## Staying Ahead
Star Xinference on GitHub and be instantly notified of new releases.

## Getting Started
* [Docs](https://inference.readthedocs.io/en/latest/index.html)
* [Built-in Models](https://inference.readthedocs.io/en/latest/models/builtin/index.html)
* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)
* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)
* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)### Jupyter Notebook
The lightest way to experience Xinference is to try our [Jupyter Notebook on Google Colab](https://colab.research.google.com/github/xorbitsai/inference/blob/main/examples/Xinference_Quick_Start.ipynb).
### Docker
Nvidia GPU users can start Xinference server using [Xinference Docker Image](https://inference.readthedocs.io/en/latest/getting_started/using_docker_image.html). Prior to executing the installation command, ensure that both [Docker](https://docs.docker.com/get-docker/) and [CUDA](https://developer.nvidia.com/cuda-downloads) are set up on your system.
```bash
docker run --name xinference -d -p 9997:9997 -e XINFERENCE_HOME=/data -v :/data --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0
```### K8s via helm
Ensure that you have GPU support in your Kubernetes cluster, then install as follows.
```
# add repo
helm repo add xinference https://xorbitsai.github.io/xinference-helm-charts# update indexes and query xinference versions
helm repo update xinference
helm search repo xinference/xinference --devel --versions# install xinference
helm install xinference xinference/xinference -n xinference --version 0.0.1-v
```For more customized installation methods on K8s, please refer to the [documentation](https://inference.readthedocs.io/en/latest/getting_started/using_kubernetes.html).
### Quick Start
Install Xinference by using pip as follows. (For more options, see [Installation page](https://inference.readthedocs.io/en/latest/getting_started/installation.html).)
```bash
pip install "xinference[all]"
```To start a local instance of Xinference, run the following command:
```bash
$ xinference-local
```Once Xinference is running, there are multiple ways you can try it: via the web UI, via cURL,
via the command line, or via the Xinference’s python client. Check out our [docs]( https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html#run-xinference-locally) for the guide.
## Getting involved
| Platform | Purpose |
|-------------------------------------------------------------------------------------------------|---------------------------------------------|
| [Github Issues](https://github.com/xorbitsai/inference/issues) | Reporting bugs and filing feature requests. |
| [Discord](https://discord.gg/Xw9tszSkr5) | Collaborating with other Xinference users. |
| [Twitter](https://twitter.com/xorbitsio) | Staying up-to-date on new features. |## Citation
If this work is helpful, please kindly cite as:
```bibtex
@inproceedings{lu2024xinference,
title = "Xinference: Making Large Model Serving Easy",
author = "Lu, Weizheng and Xiong, Lingfeng and Zhang, Feng and Qin, Xuye and Chen, Yueguo",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-demo.30",
pages = "291--300",
}
```## Contributors
## Star History
[](https://star-history.com/#xorbitsai/inference&Date)