https://github.com/xorbitsai/inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://github.com/xorbitsai/inference

artificial-intelligence chatglm deployment flan-t5 gemma ggml glm4 inference llama llama3 llamacpp llm machine-learning mistral openai-api pytorch qwen vllm whisper wizardlm

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/xorbitsai/inference
Owner: xorbitsai
License: apache-2.0
Created: 2023-06-14T07:05:04.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-05-13T06:56:27.000Z (5 months ago)
Last Synced: 2025-05-13T07:43:00.467Z (5 months ago)
Topics: artificial-intelligence, chatglm, deployment, flan-t5, gemma, ggml, glm4, inference, llama, llama3, llamacpp, llm, machine-learning, mistral, openai-api, pytorch, qwen, vllm, whisper, wizardlm
Language: Python
Homepage: https://inference.readthedocs.io
Size: 44.9 MB
Stars: 7,801
Watchers: 53
Forks: 664
Open Issues: 203
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-local-llms - inference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. | 8,496 | 738 | 166 | 116 | 118 | Apache License 2.0 | 0 days, 13 hrs, 54 mins | (Open-Source Local LLM Projects)
stars - xorbitsai/inference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (HarmonyOS / Windows Manager)
awesome-llmops - Xinference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. | ![GitHub Badge](https://img.shields.io/github/stars/xorbitsai/inference.svg?style=flat-square) | (Serving / Frameworks/Servers for Serving)
awesome-ChatGPT-repositories - inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (NLP)
StarryDivineSky - xorbitsai/inference
awesome-homelab - Xinference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. | (Apps / AI)
alan_awesome_llm - Xinference
awesome-llm-and-aigc - xorbitsai/inference
awesome-llm-and-aigc - xorbitsai/inference
awesomeLibrary - inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (语言资源库 / python)
awesome-LLM-resources - Xinference
alan_awesome_llm - Xinference
Awesome-LLMOps - Xinference - source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. ![Stars](https://img.shields.io/github/stars/xorbitsai/inference.svg?style=flat&color=green) ![Contributors](https://img.shields.io/github/contributors/xorbitsai/inference?color=green) ![LastCommit](https://img.shields.io/github/last-commit/xorbitsai/inference?color=green) (Inference / Inference Engine)
awesome-hacking-lists - xorbitsai/inference - Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any (Python)
AiTreasureBox - xorbitsai/inference - 10-08_8597_0](https://img.shields.io/github/stars/xorbitsai/inference.svg)|Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.| (Repos)

README

          




# Xorbits Inference: Model Serving Made Easy 🤖



  Xinference Cloud ·

  Xinference Enterprise ·

  Self-hosting ·

  Documentation



[![PyPI Latest Release](https://img.shields.io/pypi/v/xinference.svg?style=for-the-badge)](https://pypi.org/project/xinference/)

[![License](https://img.shields.io/pypi/l/xinference.svg?style=for-the-badge)](https://github.com/xorbitsai/inference/blob/main/LICENSE)

[![Build Status](https://img.shields.io/github/actions/workflow/status/xorbitsai/inference/python.yaml?branch=main&style=for-the-badge&label=GITHUB%20ACTIONS&logo=github)](https://actions-badge.atrox.dev/xorbitsai/inference/goto?ref=main)

[![Discord](https://img.shields.io/badge/join_Discord-5462eb.svg?logo=discord&style=for-the-badge&logoColor=%23f5f5f5)](https://discord.gg/Xw9tszSkr5)

[![Twitter](https://img.shields.io/twitter/follow/xorbitsio?logo=x&style=for-the-badge)](https://twitter.com/xorbitsio)



  

  

  








Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language, 

speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy 

and serve your or state-of-the-art built-in models using just a single command. Whether you are a 

researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full 

potential of cutting-edge AI models.



👉 Join our Discord community!



## 🔥 Hot Topics

### Framework Enhancements

- [Xllamacpp](https://github.com/xorbitsai/xllamacpp): New llama.cpp Python binding, maintained by Xinference team, supports continuous batching and is more production-ready.: [#2997](https://github.com/xorbitsai/inference/pull/2997)

- Distributed inference: running models across workers: [#2877](https://github.com/xorbitsai/inference/pull/2877)

- VLLM enhancement: Shared KV cache across multiple replicas: [#2732](https://github.com/xorbitsai/inference/pull/2732)

- Support Continuous batching for Transformers engine: [#1724](https://github.com/xorbitsai/inference/pull/1724)

- Support MLX backend for Apple Silicon chips: [#1765](https://github.com/xorbitsai/inference/pull/1765)

- Support specifying worker and GPU indexes for launching models: [#1195](https://github.com/xorbitsai/inference/pull/1195)

- Support SGLang backend: [#1161](https://github.com/xorbitsai/inference/pull/1161)

- Support LoRA for LLM and image models: [#1080](https://github.com/xorbitsai/inference/pull/1080)

### New Models

- Built-in support for [Qwen3](https://qwenlm.github.io/blog/qwen3/): [#3347](https://github.com/xorbitsai/inference/pull/3347)

- Built-in support for [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni): [#3279](https://github.com/xorbitsai/inference/pull/3279)

- Built-in support for [Skywork-OR1](https://github.com/SkyworkAI/Skywork-OR1): [#3274](https://github.com/xorbitsai/inference/pull/3274)

- Built-in support for [GLM-4-0414](https://github.com/THUDM/GLM-4): [#3251](https://github.com/xorbitsai/inference/pull/3251)

- Built-in support for [SeaLLMs-v3](https://github.com/DAMO-NLP-SG/DAMO-SeaLLMs): [#3248](https://github.com/xorbitsai/inference/pull/3248)

- Built-in support for [paraformer-zh](https://huggingface.co/funasr/paraformer-zh): [#3236](https://github.com/xorbitsai/inference/pull/3236)

- Built-in support for [InternVL3](https://internvl.github.io/blog/2025-04-11-InternVL-3.0/): [#3235](https://github.com/xorbitsai/inference/pull/3235)

- Built-in support for [MegaTTS3](https://github.com/bytedance/MegaTTS3): [#3224](https://github.com/xorbitsai/inference/pull/3224)

### Integrations

- [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.

- [FastGPT](https://github.com/labring/FastGPT): a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.

- [RAGFlow](https://github.com/infiniflow/ragflow): is an open-source RAG engine based on deep document understanding.

- [MaxKB](https://github.com/1Panel-dev/MaxKB): MaxKB = Max Knowledge Base, it is a chatbot based on Large Language Models (LLM) and Retrieval-Augmented Generation (RAG). 

- [Chatbox](https://chatboxai.app/): a desktop client for multiple cutting-edge LLM models, available on Windows, Mac and Linux.

## Key Features

🌟 **Model Serving Made Easy**: Simplify the process of serving large language, speech 

recognition, and multimodal models. You can set up and deploy your models

for experimentation and production with a single command.

⚡️ **State-of-the-Art Models**: Experiment with cutting-edge built-in models using a single 

command. Inference provides access to state-of-the-art open-source models!

🖥 **Heterogeneous Hardware Utilization**: Make the most of your hardware resources with

[ggml](https://github.com/ggerganov/ggml). Xorbits Inference intelligently utilizes heterogeneous

hardware, including GPUs and CPUs, to accelerate your model inference tasks.

⚙️ **Flexible API and Interfaces**: Offer multiple interfaces for interacting

with your models, supporting OpenAI compatible RESTful API (including Function Calling API), RPC, CLI 

and WebUI for seamless model management and interaction.

🌐 **Distributed Deployment**: Excel in distributed deployment scenarios, 

allowing the seamless distribution of model inference across multiple devices or machines.

🔌 **Built-in Integration with Third-Party Libraries**: Xorbits Inference seamlessly integrates

with popular third-party libraries including [LangChain](https://python.langchain.com/docs/integrations/providers/xinference), [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/llm/XinferenceLocalDeployment.html#i-run-pip-install-xinference-all-in-a-terminal-window), [Dify](https://docs.dify.ai/advanced/model-configuration/xinference), and [Chatbox](https://chatboxai.app/).

## Why Xinference

| Feature                                        | Xinference | FastChat | OpenLLM | RayLLM |

|------------------------------------------------|------------|----------|---------|--------|

| OpenAI-Compatible RESTful API                  | ✅ | ✅ | ✅ | ✅ |

| vLLM Integrations                              | ✅ | ✅ | ✅ | ✅ |

| More Inference Engines (GGML, TensorRT)        | ✅ | ❌ | ✅ | ✅ |

| More Platforms (CPU, Metal)                    | ✅ | ✅ | ❌ | ❌ |

| Multi-node Cluster Deployment                  | ✅ | ❌ | ❌ | ✅ |

| Image Models (Text-to-Image)                   | ✅ | ✅ | ❌ | ❌ |

| Text Embedding Models                          | ✅ | ❌ | ❌ | ❌ |

| Multimodal Models                              | ✅ | ❌ | ❌ | ❌ |

| Audio Models                                   | ✅ | ❌ | ❌ | ❌ |

| More OpenAI Functionalities (Function Calling) | ✅ | ❌ | ❌ | ❌ |

## Using Xinference

- **Cloud **

We host a [Xinference Cloud](https://inference.top) service for anyone to try with zero setup. 

- **Self-hosting Xinference Community Edition**

Quickly get Xinference running in your environment with this [starter guide](#getting-started).

Use our [documentation](https://inference.readthedocs.io/) for further references and more in-depth instructions.

- **Xinference for enterprise / organizations**

We provide additional enterprise-centric features. [send us an email](mailto:business@xprobe.io?subject=[GitHub]Business%20License%20Inquiry) to discuss enterprise needs. 

## Staying Ahead

Star Xinference on GitHub and be instantly notified of new releases.

![star-us](assets/stay_ahead.gif)

## Getting Started

* [Docs](https://inference.readthedocs.io/en/latest/index.html)

* [Built-in Models](https://inference.readthedocs.io/en/latest/models/builtin/index.html)

* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)

* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)

* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)

### Jupyter Notebook

The lightest way to experience Xinference is to try our [Jupyter Notebook on Google Colab](https://colab.research.google.com/github/xorbitsai/inference/blob/main/examples/Xinference_Quick_Start.ipynb).

### Docker 

Nvidia GPU users can start Xinference server using [Xinference Docker Image](https://inference.readthedocs.io/en/latest/getting_started/using_docker_image.html). Prior to executing the installation command, ensure that both [Docker](https://docs.docker.com/get-docker/) and [CUDA](https://developer.nvidia.com/cuda-downloads) are set up on your system.

```bash

docker run --name xinference -d -p 9997:9997 -e XINFERENCE_HOME=/data -v :/data --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

```

### K8s via helm

Ensure that you have GPU support in your Kubernetes cluster, then install as follows.

```

# add repo

helm repo add xinference https://xorbitsai.github.io/xinference-helm-charts

# update indexes and query xinference versions

helm repo update xinference

helm search repo xinference/xinference --devel --versions

# install xinference

helm install xinference xinference/xinference -n xinference --version 0.0.1-v

```

For more customized installation methods on K8s, please refer to the [documentation](https://inference.readthedocs.io/en/latest/getting_started/using_kubernetes.html).

### Quick Start

Install Xinference by using pip as follows. (For more options, see [Installation page](https://inference.readthedocs.io/en/latest/getting_started/installation.html).)

```bash

pip install "xinference[all]"

```

To start a local instance of Xinference, run the following command:

```bash

$ xinference-local

```

Once Xinference is running, there are multiple ways you can try it: via the web UI, via cURL,

 via the command line, or via the Xinference’s python client. Check out our [docs]( https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html#run-xinference-locally) for the guide.

![web UI](assets/screenshot.png)

## Getting involved

| Platform                                                                                        | Purpose                                     |

|-------------------------------------------------------------------------------------------------|---------------------------------------------|

| [Github Issues](https://github.com/xorbitsai/inference/issues)                                  | Reporting bugs and filing feature requests. |

| [Discord](https://discord.gg/Xw9tszSkr5) | Collaborating with other Xinference users.  |

| [Twitter](https://twitter.com/xorbitsio)                                                        | Staying up-to-date on new features.         |

## Citation

If this work is helpful, please kindly cite as:

```bibtex

@inproceedings{lu2024xinference,

    title = "Xinference: Making Large Model Serving Easy",

    author = "Lu, Weizheng and Xiong, Lingfeng and Zhang, Feng and Qin, Xuye and Chen, Yueguo",

    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",

    month = nov,

    year = "2024",

    address = "Miami, Florida, USA",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2024.emnlp-demo.30",

    pages = "291--300",

}

```

## Contributors



  



## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=xorbitsai/inference&type=Date)](https://star-history.com/#xorbitsai/inference&Date)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xorbitsai/inference

Awesome Lists containing this project

README