https://github.com/inftyai/llmaz
βΈοΈ Easy, advanced inference platform for large language models on Kubernetes. π Star to support our work!
https://github.com/inftyai/llmaz
huggingface inference inference-platform kubernetes llamacpp llm modelscope ollama sglang text-generation-inference vllm
Last synced: about 1 year ago
JSON representation
βΈοΈ Easy, advanced inference platform for large language models on Kubernetes. π Star to support our work!
- Host: GitHub
- URL: https://github.com/inftyai/llmaz
- Owner: InftyAI
- License: apache-2.0
- Created: 2023-11-20T03:57:28.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-27T06:32:22.000Z (about 1 year ago)
- Last Synced: 2025-03-30T16:15:35.837Z (about 1 year ago)
- Topics: huggingface, inference, inference-platform, kubernetes, llamacpp, llm, modelscope, ollama, sglang, text-generation-inference, vllm
- Language: Go
- Homepage:
- Size: 6.06 MB
- Stars: 110
- Watchers: 6
- Forks: 18
- Open Issues: 38
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Support: docs/support-backends.md
Awesome Lists containing this project
README
Easy, advanced inference platform for large language models on Kubernetes
[](https://github.com/mkenney/software-guides/blob/master/STABILITY-BADGES.md#alpha)
[![GoReport Widget]][GoReport Status]
[](https://github.com/inftyai/llmaz/releases/latest)
[GoReport Widget]: https://goreportcard.com/badge/github.com/inftyai/llmaz
[GoReport Status]: https://goreportcard.com/report/github.com/inftyai/llmaz
**llmaz** (pronounced `/lima:z/`), aims to provide a **Production-Ready** inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.
> π± llmaz is alpha now, so API may change before graduating to Beta.
## Architecture
## Features Overview
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
## Quick Start
### Installation
Read the [Installation](./docs/installation.md) for guidance.
### Deploy
Here's a toy example for deploying `facebook/opt-125m`, all you need to do
is to apply a `Model` and a `Playground`.
If you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.
> Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=` ahead.
#### Model
```yaml
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: opt-125m
spec:
familyName: opt
source:
modelHub:
modelID: facebook/opt-125m
inferenceConfig:
flavors:
- name: default # Configure GPU type
limits:
nvidia.com/gpu: 1
```
#### Inference Playground
```yaml
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: opt-125m
spec:
replicas: 1
modelClaim:
modelName: opt-125m
```
### Verify
#### Expose the service
By default, llmaz will create a ClusterIP service named like `-lb` for load balancing.
```cmd
kubectl port-forward svc/opt-125m-lb 8080:8080
```
#### Get registered models
```cmd
curl http://localhost:8080/v1/models
```
#### Request a query
```cmd
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 10,
"temperature": 0
}'
```
### More than quick-start
If you want to learn more about this project, please refer to [develop.md](./docs/develop.md).
## Roadmap
- Gateway support for traffic routing
- Metrics support
- Serverless support for cloud-agnostic users
- CLI tool support
- Model training, fine tuning in the long-term
## Community
Join us for more discussions:
- **Slack Channel**: [#llmaz](https://inftyai.slack.com/archives/C06D0BGEQ1G)
## Contributions
All kinds of contributions are welcomed ! Please following [CONTRIBUTING.md](./CONTRIBUTING.md).
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/inftyai/projects/llmaz). We'll use the fund transparently to support the development, maintenance, and adoption of our project.
## Star History
[](https://www.star-history.com/#inftyai/llmaz&Date)