An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with llm-serving

A curated list of projects in awesome lists tagged with llm-serving .

https://github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu

Last synced: 29 Jan 2026

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

llm llm-inference llm-serving llm-training llmops

Last synced: 15 May 2025

https://github.com/bentoml/openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 23 Oct 2025

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 14 Mar 2025

https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 02 Apr 2026

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 06 Mar 2026

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/gpustack/gpustack

A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.

ascend cuda deepseek distributed-inference genai high-performance-inference inference llama llm llm-inference llm-serving maas mindie openai qwen rocm sglang vllm

Last synced: 20 Apr 2026

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 12 May 2025

https://github.com/moonshotai/moba

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 14 May 2025

https://github.com/MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 31 Mar 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch

Last synced: 15 May 2025

https://github.com/alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

gpt inference llama llm llm-serving llmops model-serving

Last synced: 14 Oct 2025

https://github.com/mosecorg/mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

cv deep-learning gpu hacktoberfest jax llm llm-serving machine-learning machine-learning-platform mlops model-serving mxnet nerual-network python pytorch rust tensorflow tts

Last synced: 14 May 2025

https://github.com/vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

ascend inference llm llm-serving llmops mlops model-serving transformer vllm

Last synced: 27 Feb 2026

https://github.com/efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 21 Apr 2025

https://github.com/efeslab/nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 16 May 2025

https://github.com/helixml/helix

♾️ Private Agent Fleet with Spec Coding. Each agent gets their own GPU-accelerated desktop. Run Claude, Codex, Gemini and open models on a full private AI Stack ♾️

agents api genai glm golang helm k8s kimi llm llm-agent llm-serving openai openapi qwen rag self-hosted swagger swarm

Last synced: 11 Apr 2026

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 08 May 2025

https://github.com/nexusgpu/tensor-fusion

Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.

ai amd-gpu autoscaling dynamic-resource-allocation gpu gpu-acceleration gpu-pooling gpu-scheduling gpu-usage gpu-virtualization inference karpenter kubernetes llm-serving nvidia pytorch rcuda remote-gpu vgpu

Last synced: 24 May 2026

https://github.com/chenhunghan/ialacol

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

ai cloudnative cuda ggml gptq gpu helm kubernetes langchain llamacpp llm llm-inference llm-serving openai python

Last synced: 30 Sep 2025

https://github.com/powerserve-project/powerserve

High-speed and easy-use LLM serving framework for local deployment

llama llm llm-inference llm-serving npu qwen smallthinker smartphone

Last synced: 06 Apr 2025

https://github.com/em-geeklab/llmone

Enterprise-grade LLM automated deployment tool that makes AI servers truly "plug-and-play".

agent ai-server llm llm-inference llm-serving mindie ollama transformer vllm

Last synced: 07 Oct 2025

https://github.com/thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

deepseek gpu llm llm-serving model-serving pytorch

Last synced: 17 Mar 2025

https://github.com/alibaba/servegen

A framework for generating realistic LLM serving workloads

deepseek llm llm-serving model-serving qwen

Last synced: 14 Oct 2025

https://github.com/mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs

llm-inference llm-serving llmops

Last synced: 11 May 2025

https://github.com/france-travail/happy_vllm

A REST API for vLLM, production ready

api-rest llm llm-serving production vllm

Last synced: 28 Apr 2025

https://github.com/Jason-cs18/HetServe-Foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 12 Jul 2025

https://github.com/zejia-lin/bulletserve

Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration

gpu-sharing inference llm llm-serving sglang

Last synced: 22 Nov 2025

https://github.com/jason-cs18/hetserve-foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 29 Jul 2025

https://github.com/hec-ovi/vllm-qwen

vLLM + Qwen3.6-27B (BF16) OpenAI-compatible inference server on AMD Strix Halo (Ryzen AI Max+ 395, gfx1151). Vision input, 256K context, /v1/responses with separated reasoning, via TheRock ROCm.

amd docker gfx1151 inference-server llm-serving local-llm multimodal-llm openai-compatible qwen qwen3 rocm ryzen-ai self-hosted strix-halo vllm

Last synced: 01 May 2026

https://github.com/teremterem/litellm-server-boilerplate

A lightweight LiteLLM server boilerplate pre-configured with uv and Docker for hosting your own OpenAI- and Anthropic-compatible endpoints. Includes LibreChat as an optional web UI.

agentic-ai ai ai-assistant ai-chatbot anthropic boilerplate chatgpt gemini genai generative-ai librechat litellm litellm-proxy llm llm-server llm-serving openai python setup webchat

Last synced: 16 Feb 2026

https://github.com/france-travail/benchmark_llm_serving

A library to benchmark LLMs via their API exposure

benchmark llm llm-serving vllm

Last synced: 06 Jul 2025

https://github.com/unaidedelf8777/faster-outlines

A Lazy, high throughput and blazing fast structured text generation backend.

ai llama llm llm-serving llmops model-serving performance transformer

Last synced: 27 Jun 2025

https://github.com/loopglitch26/hinglish-ai-mentor

Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI

azure google llm-serving nlp-machine-learning prompt-engineering

Last synced: 10 Jul 2025

https://github.com/george-mountain/web-app-builder--llm

Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.

ai llm llm-serving pypi pypi-package streamlit

Last synced: 07 Apr 2025

https://github.com/scale-snu/layered-prefill

Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.

inference llm llm-infernece llm-serving moe vllm

Last synced: 22 Nov 2025

https://github.com/zhihu/ZhiLight

A highly optimized inference acceleration engine for Llama and its variants.

cuda gpt inference-engine llama llm llm-serving pytorch

Last synced: 12 Aug 2025

https://github.com/CentML/llm-inference-bench

Lightweight and extensible LLM Inference serving benchmark tool written in Rust.

benchmarking llm-inference llm-serving

Last synced: 23 Apr 2025

https://github.com/pierreolivierbonin/canada-labour-research-assistant

The Canada Labour Research Assistant (CLaRA) is a privacy-first LLM-powered research assistant proposing Easily Verifiable Direct Quotations (EVDQ) to mitigate hallucinations in answering questions about Canadian labour laws, standards, and regulations. It works entirely offline and locally, guaranteeing the confidentiality of your conversations.

chatbot-application chatbot-framework labour labour-relations lcs-algorithm llm llm-inference llm-serving metadata ollama question-answering quotations rag-chatbot retrieval-augmented-generation sentence-transformers source-referencing streamlit string-matching-algorithms vector-database vllm

Last synced: 23 Jun 2025

https://github.com/neosun100/kimi-linear-vllm-docker-serve

Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.

awq docker kimi-linear llm-serving long-context vllm

Last synced: 31 Jan 2026

https://github.com/ivynya/illm

internet llm - access your ollama (or any other local llm) instance from across the internet

llm-serving ollama ollama-interface

Last synced: 09 May 2026

https://github.com/ajithvcoder/tsai-emlo-4.0

Contains solutoins for assignments and learning notes from Extensive Machine Learning Operations course of The School of AI

aws aws-cdk aws-ec2 aws-ecr aws-lambda ci-cd cml dvc-pipeline github-actions gpu-instancing gradio litserve llm-serving mlops mlops-workflow pytorch-lightning torch torchscript torchserve

Last synced: 10 Feb 2026

https://github.com/george-mountain/llm-local-streaming

Streaming of LLM responses in realtime using Fastapi and Streamlit.

ai fastapi llm llm-serving llm-streaming streamlit

Last synced: 07 May 2026

https://github.com/jbchouinard/llmailbot

A service that enables chatting with LLMs via email.

email langchain langchain-python large-language-models llm llm-serving llms python

Last synced: 09 May 2026

https://github.com/okikorg/okik

Okik is serving framework to deploy LLMs and much more.

deeplearning llm llm-inference llm-serving llmops machine-learning model-serving python

Last synced: 23 Jan 2026

https://github.com/defai-digital/ax-serving

Offline OpenAI-compatible serving and orchestration plane for AX Fabric on Apple Silicon, with runtime model lifecycle, routing, metrics, and multi-worker control.

apple-silicon automatosx ax-fabric control-plane enterprise-ai llm-serving model-lifecycle model-routing offline-ai openai-compatible orchestration rust

Last synced: 06 Apr 2026

https://github.com/kira94-hkz/powerserve

High-speed and easy-use LLM serving framework for local deployment

llama llm llm-inference llm-serving npu qwen smallthinker smartphone

Last synced: 16 Jun 2025

https://github.com/elinx/llm-mem-calculator

Interactive KV cache memory calculator for LLMs — supports MLA, GQA, hybrid attention, sliding window, and linear attention architectures. Estimate GPU memory for serving any model at any context length.

calculator gpu-memory kv-cache llm llm-serving vllm

Last synced: 09 Jun 2026