An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with llm-serving

A curated list of projects in awesome lists tagged with llm-serving .

https://github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu

Last synced: 12 May 2025

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

llm llm-inference llm-serving llm-training llmops

Last synced: 15 May 2025

https://github.com/sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models.

cuda deepseek deepseek-llm deepseek-r1 deepseek-r1-zero deepseek-v3 inference llama llama3 llama3-1 llava llm llm-serving moe pytorch transformer vlm

Last synced: 12 May 2025

https://github.com/bentoml/openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 23 Oct 2025

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 14 Mar 2025

https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 12 May 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 May 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 12 May 2025

https://github.com/moonshotai/moba

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 14 May 2025

https://github.com/MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 31 Mar 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch

Last synced: 15 May 2025

https://github.com/alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

gpt inference llama llm llm-serving llmops model-serving

Last synced: 14 Oct 2025

https://github.com/mosecorg/mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

cv deep-learning gpu hacktoberfest jax llm llm-serving machine-learning machine-learning-platform mlops model-serving mxnet nerual-network python pytorch rust tensorflow tts

Last synced: 14 May 2025

https://github.com/vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

ascend inference llm llm-serving llmops mlops model-serving transformer vllm

Last synced: 29 Jun 2025

https://github.com/efeslab/nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 16 May 2025

https://github.com/efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 21 Apr 2025

https://github.com/helixml/helix

♾️ Helix is a private GenAI stack for building AI agents with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.

api finetuning function-calling golang gptscript helm k8s llama llm llm-agent llm-serving mistral mixtral openai openapi rag sdxl self-hosted stable-diffusion swagger

Last synced: 03 Nov 2025

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 08 May 2025

https://github.com/chenhunghan/ialacol

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

ai cloudnative cuda ggml gptq gpu helm kubernetes langchain llamacpp llm llm-inference llm-serving openai python

Last synced: 30 Sep 2025

https://github.com/powerserve-project/powerserve

High-speed and easy-use LLM serving framework for local deployment

llama llm llm-inference llm-serving npu qwen smallthinker smartphone

Last synced: 06 Apr 2025

https://github.com/em-geeklab/llmone

Enterprise-grade LLM automated deployment tool that makes AI servers truly "plug-and-play".

agent ai-server llm llm-inference llm-serving mindie ollama transformer vllm

Last synced: 07 Oct 2025

https://github.com/thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

deepseek gpu llm llm-serving model-serving pytorch

Last synced: 17 Mar 2025

https://github.com/alibaba/servegen

A framework for generating realistic LLM serving workloads

deepseek llm llm-serving model-serving qwen

Last synced: 14 Oct 2025

https://github.com/nexusgpu/tensor-fusion

Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.

ai amd-gpu autoscaling gpu gpu-pooling gpu-scheduling gpu-usage gpu-virtualization inference karpenter llm-serving nvidia pytorch rcuda remote-gpu vgpu

Last synced: 07 Jan 2026

https://github.com/mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs

llm-inference llm-serving llmops

Last synced: 11 May 2025

https://github.com/france-travail/happy_vllm

A REST API for vLLM, production ready

api-rest llm llm-serving production vllm

Last synced: 28 Apr 2025

https://github.com/Jason-cs18/HetServe-Foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 12 Jul 2025

https://github.com/zejia-lin/bulletserve

Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration

gpu-sharing inference llm llm-serving sglang

Last synced: 22 Nov 2025

https://github.com/jason-cs18/hetserve-foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 29 Jul 2025

https://github.com/france-travail/benchmark_llm_serving

A library to benchmark LLMs via their API exposure

benchmark llm llm-serving vllm

Last synced: 06 Jul 2025

https://github.com/unaidedelf8777/faster-outlines

A Lazy, high throughput and blazing fast structured text generation backend.

ai llama llm llm-serving llmops model-serving performance transformer

Last synced: 27 Jun 2025

https://github.com/loopglitch26/hinglish-ai-mentor

Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI

azure google llm-serving nlp-machine-learning prompt-engineering

Last synced: 10 Jul 2025

https://github.com/george-mountain/web-app-builder--llm

Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.

ai llm llm-serving pypi pypi-package streamlit

Last synced: 07 Apr 2025

https://github.com/scale-snu/layered-prefill

Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.

inference llm llm-infernece llm-serving moe vllm

Last synced: 22 Nov 2025

https://github.com/zhihu/ZhiLight

A highly optimized inference acceleration engine for Llama and its variants.

cuda gpt inference-engine llama llm llm-serving pytorch

Last synced: 12 Aug 2025

https://github.com/CentML/llm-inference-bench

Lightweight and extensible LLM Inference serving benchmark tool written in Rust.

benchmarking llm-inference llm-serving

Last synced: 23 Apr 2025

https://github.com/pierreolivierbonin/canada-labour-research-assistant

The Canada Labour Research Assistant (CLaRA) is a privacy-first LLM-powered research assistant proposing Easily Verifiable Direct Quotations (EVDQ) to mitigate hallucinations in answering questions about Canadian labour laws, standards, and regulations. It works entirely offline and locally, guaranteeing the confidentiality of your conversations.

chatbot-application chatbot-framework labour labour-relations lcs-algorithm llm llm-inference llm-serving metadata ollama question-answering quotations rag-chatbot retrieval-augmented-generation sentence-transformers source-referencing streamlit string-matching-algorithms vector-database vllm

Last synced: 23 Jun 2025

https://github.com/ivynya/illm

internet llm - access your ollama (or any other local llm) instance from across the internet

llm-serving ollama ollama-interface

Last synced: 21 Mar 2025

https://github.com/ajithvcoder/tsai-emlo-4.0

Contains solutoins for assignments and learning notes from Extensive Machine Learning Operations course of The School of AI

aws aws-cdk aws-ec2 aws-ecr aws-lambda ci-cd cml dvc-pipeline github-actions gpu-instancing gradio litserve llm-serving mlops mlops-workflow pytorch-lightning torch torchscript torchserve

Last synced: 02 Mar 2025

https://github.com/kira94-hkz/powerserve

High-speed and easy-use LLM serving framework for local deployment

llama llm llm-inference llm-serving npu qwen smallthinker smartphone

Last synced: 16 Jun 2025

https://github.com/george-mountain/llm-local-streaming

Streaming of LLM responses in realtime using Fastapi and Streamlit.

ai fastapi llm llm-serving llm-streaming streamlit

Last synced: 30 Jul 2025

https://github.com/jbchouinard/llmailbot

A service that enables chatting with LLMs via email.

email langchain langchain-python large-language-models llm llm-serving llms python

Last synced: 22 Mar 2025