Projects in Awesome Lists tagged with llm-serving

https://github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu

Last synced: 12 May 2025

https://github.com/ray-project/ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

data-science deep-learning deployment distributed hyperparameter-optimization hyperparameter-search large-language-models llm llm-inference llm-serving machine-learning optimization parallel python pytorch ray reinforcement-learning rllib serving tensorflow

Last synced: 09 Sep 2025

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

llm llm-inference llm-serving llm-training llmops

Last synced: 15 May 2025

https://github.com/sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models.

cuda deepseek deepseek-llm deepseek-r1 deepseek-r1-zero deepseek-v3 inference llama llama3 llama3-1 llava llm llm-serving moe pytorch transformer vlm

Last synced: 12 May 2025

https://github.com/bentoml/openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 23 Oct 2025

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 14 Mar 2025

https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 12 May 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 May 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/superduper-io/superduper

Superduper: End-to-end framework for building custom AI applications and agents.

ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search

Last synced: 14 May 2025

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 12 May 2025

https://github.com/gpustack/gpustack

Manage GPU clusters for running AI models

ascend cuda deepseek distributed distributed-inference genai ggml inference llama llamacpp llm llm-inference llm-serving maas metal mindie openai qwen rocm vllm

Last synced: 23 Oct 2025

https://github.com/microsoft/aici

AICI: Prompts as (Wasm) Programs

ai inference language-model llm llm-framework llm-inference llm-serving llmops model-serving rust transformer wasm wasmtime

Last synced: 14 May 2025

https://github.com/moonshotai/moba

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 14 May 2025

https://github.com/MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention llm llm-serving llm-training moe pytorch transformer

Last synced: 31 Mar 2025

https://github.com/ray-project/ray-llm

RayLLM - LLMs on Ray

distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers

Last synced: 25 Feb 2025

https://github.com/ray-project/aviary

RayLLM - LLMs on Ray

distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers

Last synced: 05 Mar 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch

Last synced: 15 May 2025

https://github.com/alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

gpt inference llama llm llm-serving llmops model-serving

Last synced: 14 Oct 2025

https://github.com/mosecorg/mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

cv deep-learning gpu hacktoberfest jax llm llm-serving machine-learning machine-learning-platform mlops model-serving mxnet nerual-network python pytorch rust tensorflow tts

Last synced: 14 May 2025

https://github.com/vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

ascend inference llm llm-serving llmops mlops model-serving transformer vllm

Last synced: 29 Jun 2025

https://github.com/efeslab/nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 16 May 2025

https://github.com/efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 21 Apr 2025

https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models

LLM (Large Language Model) FineTuning

gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch

Last synced: 16 Oct 2025

https://github.com/helixml/helix

♾️ Helix is a private GenAI stack for building AI agents with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.

api finetuning function-calling golang gptscript helm k8s llama llm llm-agent llm-serving mistral mixtral openai openapi rag sdxl self-hosted stable-diffusion swagger

Last synced: 03 Nov 2025

https://github.com/rohan-paul/llm-finetuning-large-language-models

LLM (Large Language Model) FineTuning

gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch

Last synced: 04 Apr 2025

https://github.com/hpcaitech/swiftinfer

Efficient AI Inference & Serving

artificial-intelligence deep-learning gpt inference llama llama2 llm-inference llm-serving

Last synced: 05 Apr 2025

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 08 May 2025

https://github.com/torchpipe/torchpipe

Serving Inside Pytorch

deployment inference llm-serving pipeline-parallelism pytorch ray serve serving tensorrt torch2trt triton-inference-server

Last synced: 18 Mar 2025

https://github.com/chenhunghan/ialacol

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

ai cloudnative cuda ggml gptq gpu helm kubernetes langchain llamacpp llm llm-inference llm-serving openai python

Last synced: 30 Sep 2025

https://github.com/powerserve-project/powerserve

High-speed and easy-use LLM serving framework for local deployment

llama llm llm-inference llm-serving npu qwen smallthinker smartphone

Last synced: 06 Apr 2025

https://github.com/slai-labs/get-beam

Run GPU inference and training jobs on serverless infrastructure that scales with you.

artificial-intelligence cloud-computing cost-optimization data-science deep-learning distributed-computing gpu-acceleration gpu-computing hpc llm-serving llm-training machine-learning ml-infrastructure mlops python serverless serverless-architectures

Last synced: 18 Apr 2025

https://github.com/em-geeklab/llmone

Enterprise-grade LLM automated deployment tool that makes AI servers truly "plug-and-play".

agent ai-server llm llm-inference llm-serving mindie ollama transformer vllm

Last synced: 07 Oct 2025

https://github.com/thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

deepseek gpu llm llm-serving model-serving pytorch

Last synced: 17 Mar 2025

https://github.com/alibaba/servegen

A framework for generating realistic LLM serving workloads

deepseek llm llm-serving model-serving qwen

Last synced: 14 Oct 2025

https://github.com/nexusgpu/tensor-fusion

Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.

ai amd-gpu autoscaling gpu gpu-pooling gpu-scheduling gpu-usage gpu-virtualization inference karpenter llm-serving nvidia pytorch rcuda remote-gpu vgpu

Last synced: 07 Jan 2026

https://github.com/mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs

llm-inference llm-serving llmops

Last synced: 11 May 2025

https://github.com/friendliai/friendli-client

Friendli: the fastest serving engine for generative AI

ai generative-ai gpt gpt3 inference inference-engine inference-server llama2 llm llm-inference llm-ops llm-serving llmops llms mistral ml mlops serving stable-diffusion

Last synced: 05 Apr 2025

https://github.com/france-travail/happy_vllm

A REST API for vLLM, production ready

api-rest llm llm-serving production vllm

Last synced: 28 Apr 2025

https://github.com/Jason-cs18/HetServe-Foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 12 Jul 2025

https://github.com/zejia-lin/bulletserve

Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration

gpu-sharing inference llm llm-serving sglang

Last synced: 22 Nov 2025

https://github.com/jason-cs18/hetserve-foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

diffusion-models foundation-models llm-serving survey

Last synced: 29 Jul 2025

https://github.com/ivanlulyf/bunny-llm

Deno LLM API Service

chatgpt chatgpt-api cloudflare-workers-ai llm llm-serving

Last synced: 16 Apr 2025

https://github.com/france-travail/benchmark_llm_serving

A library to benchmark LLMs via their API exposure

benchmark llm llm-serving vllm

Last synced: 06 Jul 2025

https://github.com/fork123aniket/llm-rag-powered-qa-app

A Production-Ready, Scalable RAG-powered LLM-based Context-Aware QA App

context-aware-system eleutherai fine-tuning large-language-models llm-inference llm-serving llm-training llmops parameter-efficient-fine-tuning question-answering ray ray-serve retrieval-augmented-generation

Last synced: 31 Jul 2025

https://github.com/unaidedelf8777/faster-outlines

A Lazy, high throughput and blazing fast structured text generation backend.

ai llama llm llm-serving llmops model-serving performance transformer

Last synced: 27 Jun 2025

https://github.com/nuhmanpk/quick-llama

Run Ollama models anywhere easily

colab langchain-python llama llm llm-agents llm-serving ollama ollama-api ollama-client ollama-python open-ai pypi

Last synced: 07 Jul 2025

https://github.com/loopglitch26/hinglish-ai-mentor

Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI

azure google llm-serving nlp-machine-learning prompt-engineering

Last synced: 10 Jul 2025

https://github.com/george-mountain/web-app-builder--llm

Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.

ai llm llm-serving pypi pypi-package streamlit

Last synced: 07 Apr 2025

https://github.com/scale-snu/layered-prefill

Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.

inference llm llm-infernece llm-serving moe vllm

Last synced: 22 Nov 2025

https://github.com/zhihu/ZhiLight

A highly optimized inference acceleration engine for Llama and its variants.

cuda gpt inference-engine llama llm llm-serving pytorch

Last synced: 12 Aug 2025

https://github.com/CentML/llm-inference-bench

Lightweight and extensible LLM Inference serving benchmark tool written in Rust.

benchmarking llm-inference llm-serving

Last synced: 23 Apr 2025

https://github.com/biosfood/intel-llm-guide

A guide on how to run LLMs on intel CPUs

guide intel llm llm-inference llm-serving machine-learning setup setup-development-environment tutorial

Last synced: 09 Apr 2025

https://github.com/pierreolivierbonin/canada-labour-research-assistant

The Canada Labour Research Assistant (CLaRA) is a privacy-first LLM-powered research assistant proposing Easily Verifiable Direct Quotations (EVDQ) to mitigate hallucinations in answering questions about Canadian labour laws, standards, and regulations. It works entirely offline and locally, guaranteeing the confidentiality of your conversations.

chatbot-application chatbot-framework labour labour-relations lcs-algorithm llm llm-inference llm-serving metadata ollama question-answering quotations rag-chatbot retrieval-augmented-generation sentence-transformers source-referencing streamlit string-matching-algorithms vector-database vllm