Projects in Awesome Lists tagged with llm-serving
A curated list of projects in awesome lists tagged with llm-serving .
https://github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu
Last synced: 12 May 2025
https://github.com/ray-project/ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
data-science deep-learning deployment distributed hyperparameter-optimization hyperparameter-search large-language-models llm llm-inference llm-serving machine-learning optimization parallel python pytorch ray reinforcement-learning rllib serving tensorflow
Last synced: 09 Sep 2025
https://github.com/liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
llm llm-inference llm-serving llm-training llmops
Last synced: 15 May 2025
https://github.com/sgl-project/sglang
SGLang is a fast serving framework for large language models and vision language models.
cuda deepseek deepseek-llm deepseek-r1 deepseek-r1-zero deepseek-v3 inference llama llama3 llama3-1 llava llm llm-serving moe pytorch transformer vlm
Last synced: 12 May 2025
https://github.com/bentoml/openllm
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna
Last synced: 23 Oct 2025
https://github.com/bentoml/OpenLLM
Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna
Last synced: 14 Mar 2025
https://github.com/skypilot-org/skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu
Last synced: 12 May 2025
https://github.com/bentoml/bentoml
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 12 May 2025
https://github.com/bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 12 Mar 2025
https://github.com/superduper-io/superduper
Superduper: End-to-end framework for building custom AI applications and agents.
ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search
Last synced: 14 May 2025
https://github.com/predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers
Last synced: 12 May 2025
https://github.com/gpustack/gpustack
Manage GPU clusters for running AI models
ascend cuda deepseek distributed distributed-inference genai ggml inference llama llamacpp llm llm-inference llm-serving maas metal mindie openai qwen rocm vllm
Last synced: 23 Oct 2025
https://github.com/microsoft/aici
AICI: Prompts as (Wasm) Programs
ai inference language-model llm llm-framework llm-inference llm-serving llmops model-serving rust transformer wasm wasmtime
Last synced: 14 May 2025
https://github.com/moonshotai/moba
MoBA: Mixture of Block Attention for Long-Context LLMs
flash-attention llm llm-serving llm-training moe pytorch transformer
Last synced: 14 May 2025
https://github.com/MoonshotAI/MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
flash-attention llm llm-serving llm-training moe pytorch transformer
Last synced: 31 Mar 2025
https://github.com/ray-project/ray-llm
RayLLM - LLMs on Ray
distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers
Last synced: 25 Feb 2025
https://github.com/ray-project/aviary
RayLLM - LLMs on Ray
distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers
Last synced: 05 Mar 2025
https://github.com/zhihu/zhilight
A highly optimized LLM inference acceleration engine for Llama and its variants.
cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch
Last synced: 15 May 2025
https://github.com/alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
gpt inference llama llm llm-serving llmops model-serving
Last synced: 14 Oct 2025
https://github.com/mosecorg/mosec
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
cv deep-learning gpu hacktoberfest jax llm llm-serving machine-learning machine-learning-platform mlops model-serving mxnet nerual-network python pytorch rust tensorflow tts
Last synced: 14 May 2025
https://github.com/vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
ascend inference llm llm-serving llmops mlops model-serving transformer vllm
Last synced: 29 Jun 2025
https://github.com/efeslab/nanoflow
A throughput-oriented high-performance serving framework for LLMs
cuda inference llama2 llm llm-serving model-serving
Last synced: 16 May 2025
https://github.com/efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
cuda inference llama2 llm llm-serving model-serving
Last synced: 21 Apr 2025
https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models
LLM (Large Language Model) FineTuning
gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch
Last synced: 16 Oct 2025
https://github.com/helixml/helix
♾️ Helix is a private GenAI stack for building AI agents with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.
api finetuning function-calling golang gptscript helm k8s llama llm llm-agent llm-serving mistral mixtral openai openapi rag sdxl self-hosted stable-diffusion swagger
Last synced: 03 Nov 2025
https://github.com/rohan-paul/llm-finetuning-large-language-models
LLM (Large Language Model) FineTuning
gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch
Last synced: 04 Apr 2025
https://github.com/hpcaitech/swiftinfer
Efficient AI Inference & Serving
artificial-intelligence deep-learning gpt inference llama llama2 llm-inference llm-serving
Last synced: 05 Apr 2025
https://github.com/ray-project/ray-educational-materials
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune
Last synced: 08 May 2025
https://github.com/torchpipe/torchpipe
Serving Inside Pytorch
deployment inference llm-serving pipeline-parallelism pytorch ray serve serving tensorrt torch2trt triton-inference-server
Last synced: 18 Mar 2025
https://github.com/chenhunghan/ialacol
🪶 Lightweight OpenAI drop-in replacement for Kubernetes
ai cloudnative cuda ggml gptq gpu helm kubernetes langchain llamacpp llm llm-inference llm-serving openai python
Last synced: 30 Sep 2025
https://github.com/powerserve-project/powerserve
High-speed and easy-use LLM serving framework for local deployment
llama llm llm-inference llm-serving npu qwen smallthinker smartphone
Last synced: 06 Apr 2025
https://github.com/slai-labs/get-beam
Run GPU inference and training jobs on serverless infrastructure that scales with you.
artificial-intelligence cloud-computing cost-optimization data-science deep-learning distributed-computing gpu-acceleration gpu-computing hpc llm-serving llm-training machine-learning ml-infrastructure mlops python serverless serverless-architectures
Last synced: 18 Apr 2025
https://github.com/em-geeklab/llmone
Enterprise-grade LLM automated deployment tool that makes AI servers truly "plug-and-play".
agent ai-server llm llm-inference llm-serving mindie ollama transformer vllm
Last synced: 07 Oct 2025
https://github.com/thu-pacman/chitu
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
deepseek gpu llm llm-serving model-serving pytorch
Last synced: 17 Mar 2025
https://github.com/alibaba/servegen
A framework for generating realistic LLM serving workloads
deepseek llm llm-serving model-serving qwen
Last synced: 14 Oct 2025
https://github.com/nexusgpu/tensor-fusion
Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.
ai amd-gpu autoscaling gpu gpu-pooling gpu-scheduling gpu-usage gpu-virtualization inference karpenter llm-serving nvidia pytorch rcuda remote-gpu vgpu
Last synced: 07 Jan 2026
https://github.com/mani-kantap/llm-inference-solutions
A collection of all available inference solutions for the LLMs
llm-inference llm-serving llmops
Last synced: 11 May 2025
https://github.com/friendliai/friendli-client
Friendli: the fastest serving engine for generative AI
ai generative-ai gpt gpt3 inference inference-engine inference-server llama2 llm llm-inference llm-ops llm-serving llmops llms mistral ml mlops serving stable-diffusion
Last synced: 05 Apr 2025
https://github.com/france-travail/happy_vllm
A REST API for vLLM, production ready
api-rest llm llm-serving production vllm
Last synced: 28 Apr 2025
https://github.com/Jason-cs18/HetServe-Foundation
A Overview of Efficiently Serving Foundation Models across Edge Devices
diffusion-models foundation-models llm-serving survey
Last synced: 12 Jul 2025
https://github.com/zejia-lin/bulletserve
Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration
gpu-sharing inference llm llm-serving sglang
Last synced: 22 Nov 2025
https://github.com/jason-cs18/hetserve-foundation
A Overview of Efficiently Serving Foundation Models across Edge Devices
diffusion-models foundation-models llm-serving survey
Last synced: 29 Jul 2025
https://github.com/ivanlulyf/bunny-llm
Deno LLM API Service
chatgpt chatgpt-api cloudflare-workers-ai llm llm-serving
Last synced: 16 Apr 2025
https://github.com/france-travail/benchmark_llm_serving
A library to benchmark LLMs via their API exposure
benchmark llm llm-serving vllm
Last synced: 06 Jul 2025
https://github.com/fork123aniket/llm-rag-powered-qa-app
A Production-Ready, Scalable RAG-powered LLM-based Context-Aware QA App
context-aware-system eleutherai fine-tuning large-language-models llm-inference llm-serving llm-training llmops parameter-efficient-fine-tuning question-answering ray ray-serve retrieval-augmented-generation
Last synced: 31 Jul 2025
https://github.com/unaidedelf8777/faster-outlines
A Lazy, high throughput and blazing fast structured text generation backend.
ai llama llm llm-serving llmops model-serving performance transformer
Last synced: 27 Jun 2025
https://github.com/nuhmanpk/quick-llama
Run Ollama models anywhere easily
colab langchain-python llama llm llm-agents llm-serving ollama ollama-api ollama-client ollama-python open-ai pypi
Last synced: 07 Jul 2025
https://github.com/loopglitch26/hinglish-ai-mentor
Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI
azure google llm-serving nlp-machine-learning prompt-engineering
Last synced: 10 Jul 2025
https://github.com/george-mountain/web-app-builder--llm
Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.
ai llm llm-serving pypi pypi-package streamlit
Last synced: 07 Apr 2025
https://github.com/scale-snu/layered-prefill
Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.
inference llm llm-infernece llm-serving moe vllm
Last synced: 22 Nov 2025
https://github.com/zhihu/ZhiLight
A highly optimized inference acceleration engine for Llama and its variants.
cuda gpt inference-engine llama llm llm-serving pytorch
Last synced: 12 Aug 2025
https://github.com/CentML/llm-inference-bench
Lightweight and extensible LLM Inference serving benchmark tool written in Rust.
benchmarking llm-inference llm-serving
Last synced: 23 Apr 2025
https://github.com/biosfood/intel-llm-guide
A guide on how to run LLMs on intel CPUs
guide intel llm llm-inference llm-serving machine-learning setup setup-development-environment tutorial
Last synced: 09 Apr 2025
https://github.com/pierreolivierbonin/canada-labour-research-assistant
The Canada Labour Research Assistant (CLaRA) is a privacy-first LLM-powered research assistant proposing Easily Verifiable Direct Quotations (EVDQ) to mitigate hallucinations in answering questions about Canadian labour laws, standards, and regulations. It works entirely offline and locally, guaranteeing the confidentiality of your conversations.
chatbot-application chatbot-framework labour labour-relations lcs-algorithm llm llm-inference llm-serving metadata ollama question-answering quotations rag-chatbot retrieval-augmented-generation sentence-transformers source-referencing streamlit string-matching-algorithms vector-database vllm
Last synced: 23 Jun 2025
https://github.com/johnclaw/llama-3.2-1b.vb
one-file llama 3.2 1b fp16 cpu inference in pure vb.net
basic-programming cpu-inference fp16 inference inference-engine llama llama3 llama3-2 llm llm-inference llm-serving llms vb-net vbnet visual-basic-dot-net visual-basic-net
Last synced: 14 Jun 2025
https://github.com/johnclaw/gemma-2-2b-it.cs
gemma-2-2b-it int8 cpu inference in one file of pure C#
cpu-inference csharp gemma gemma2 gemma2-2b-it inference inference-engine int8 int8-inference int8-quantization llm llm-inference llm-serving llms quantization
Last synced: 22 Jul 2025
https://github.com/ivynya/illm
internet llm - access your ollama (or any other local llm) instance from across the internet
llm-serving ollama ollama-interface
Last synced: 21 Mar 2025
https://github.com/ajithvcoder/tsai-emlo-4.0
Contains solutoins for assignments and learning notes from Extensive Machine Learning Operations course of The School of AI
aws aws-cdk aws-ec2 aws-ecr aws-lambda ci-cd cml dvc-pipeline github-actions gpu-instancing gradio litserve llm-serving mlops mlops-workflow pytorch-lightning torch torchscript torchserve
Last synced: 02 Mar 2025
https://github.com/kira94-hkz/powerserve
High-speed and easy-use LLM serving framework for local deployment
llama llm llm-inference llm-serving npu qwen smallthinker smartphone
Last synced: 16 Jun 2025
https://github.com/george-mountain/llm-local-streaming
Streaming of LLM responses in realtime using Fastapi and Streamlit.
ai fastapi llm llm-serving llm-streaming streamlit
Last synced: 30 Jul 2025
https://github.com/jbchouinard/llmailbot
A service that enables chatting with LLMs via email.
email langchain langchain-python large-language-models llm llm-serving llms python
Last synced: 22 Mar 2025