{"id":25373927,"url":"https://github.com/inftyai/llmaz","last_synced_at":"2025-04-05T14:03:06.294Z","repository":{"id":208763164,"uuid":"720959702","full_name":"InftyAI/llmaz","owner":"InftyAI","description":"☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!","archived":false,"fork":false,"pushed_at":"2025-03-27T06:32:22.000Z","size":6350,"stargazers_count":110,"open_issues_count":38,"forks_count":18,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-30T16:15:35.837Z","etag":null,"topics":["huggingface","inference","inference-platform","kubernetes","llamacpp","llm","modelscope","ollama","sglang","text-generation-inference","vllm"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InftyAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/support-backends.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-20T03:57:28.000Z","updated_at":"2025-03-29T08:12:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"fb6036c6-a01b-4c37-80a3-5c46e660c840","html_url":"https://github.com/InftyAI/llmaz","commit_stats":null,"previous_names":["inftyai/llmaz-operator","inftyai/llmaz"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2Fllmaz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2Fllmaz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2Fllmaz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2Fllmaz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InftyAI","download_url":"https://codeload.github.com/InftyAI/llmaz/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247345849,"owners_count":20924102,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["huggingface","inference","inference-platform","kubernetes","llamacpp","llm","modelscope","ollama","sglang","text-generation-inference","vllm"],"created_at":"2025-02-15T03:19:54.063Z","updated_at":"2025-04-05T14:03:06.276Z","avatar_url":"https://github.com/InftyAI.png","language":"Go","readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/inftyai/llmaz/main/docs/assets/logo.png\"\u003e\n    \u003cimg alt=\"llmaz\" src=\"https://raw.githubusercontent.com/inftyai/llmaz/main/docs/assets/logo.png\" width=55%\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\nEasy, advanced inference platform for large language models on Kubernetes\n\u003c/h3\u003e\n\n[![stability-alpha](https://img.shields.io/badge/stability-alpha-f4d03f.svg)](https://github.com/mkenney/software-guides/blob/master/STABILITY-BADGES.md#alpha)\n[![GoReport Widget]][GoReport Status]\n[![Latest Release](https://img.shields.io/github/v/release/inftyai/llmaz?include_prereleases)](https://github.com/inftyai/llmaz/releases/latest)\n\n[GoReport Widget]: https://goreportcard.com/badge/github.com/inftyai/llmaz\n[GoReport Status]: https://goreportcard.com/report/github.com/inftyai/llmaz\n\n**llmaz** (pronounced `/lima:z/`), aims to provide a **Production-Ready** inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.\n\n\u003e 🌱 llmaz is alpha now, so API may change before graduating to Beta.\n\n## Architecture\n\n\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003cimg alt=\"architecture\" src=\"https://raw.githubusercontent.com/inftyai/llmaz/main/docs/assets/arch.png\" width=70%\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n## Features Overview\n\n- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.\n- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).\n- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.\n- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.\n- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.\n- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.\n- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.\n- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.\n\n## Quick Start\n\n### Installation\n\nRead the [Installation](./docs/installation.md) for guidance.\n\n### Deploy\n\nHere's a toy example for deploying `facebook/opt-125m`, all you need to do\nis to apply a `Model` and a `Playground`.\n\nIf you're running on CPUs, you can refer to [llama.cpp](/docs/examples/llamacpp/README.md), or more [examples](/docs/examples/README.md) here.\n\n\u003e Note: if your model needs Huggingface token for weight downloads, please run `kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=\u003cyour token\u003e` ahead.\n\n#### Model\n\n```yaml\napiVersion: llmaz.io/v1alpha1\nkind: OpenModel\nmetadata:\n  name: opt-125m\nspec:\n  familyName: opt\n  source:\n    modelHub:\n      modelID: facebook/opt-125m\n  inferenceConfig:\n    flavors:\n      - name: default # Configure GPU type\n        limits:\n          nvidia.com/gpu: 1\n```\n\n#### Inference Playground\n\n```yaml\napiVersion: inference.llmaz.io/v1alpha1\nkind: Playground\nmetadata:\n  name: opt-125m\nspec:\n  replicas: 1\n  modelClaim:\n    modelName: opt-125m\n```\n\n### Verify\n\n#### Expose the service\n\nBy default, llmaz will create a ClusterIP service named like `\u003cservice\u003e-lb` for load balancing.\n\n```cmd\nkubectl port-forward svc/opt-125m-lb 8080:8080\n```\n\n#### Get registered models\n\n```cmd\ncurl http://localhost:8080/v1/models\n```\n\n#### Request a query\n\n```cmd\ncurl http://localhost:8080/v1/completions \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"model\": \"opt-125m\",\n    \"prompt\": \"San Francisco is a\",\n    \"max_tokens\": 10,\n    \"temperature\": 0\n}'\n```\n\n### More than quick-start\n\nIf you want to learn more about this project, please refer to [develop.md](./docs/develop.md).\n\n## Roadmap\n\n- Gateway support for traffic routing\n- Metrics support\n- Serverless support for cloud-agnostic users\n- CLI tool support\n- Model training, fine tuning in the long-term\n\n## Community\n\nJoin us for more discussions:\n\n- **Slack Channel**: [#llmaz](https://inftyai.slack.com/archives/C06D0BGEQ1G)\n\n## Contributions\n\nAll kinds of contributions are welcomed ! Please following [CONTRIBUTING.md](./CONTRIBUTING.md).\n\nWe also have an official fundraising venue through [OpenCollective](https://opencollective.com/inftyai/projects/llmaz). We'll use the fund transparently to support the development, maintenance, and adoption of our project.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=inftyai/llmaz\u0026type=Date)](https://www.star-history.com/#inftyai/llmaz\u0026Date)\n","funding_links":["https://opencollective.com/inftyai/projects/llmaz"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finftyai%2Fllmaz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finftyai%2Fllmaz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finftyai%2Fllmaz/lists"}