An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with llm-inference

A curated list of projects in awesome lists tagged with llm-inference .

https://github.com/nomic-ai/gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

ai-chat llm-inference

Last synced: 24 Sep 2025

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

llm llm-inference llm-serving llm-training llmops

Last synced: 15 May 2025

https://github.com/lightning-ai/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 12 May 2025

https://github.com/Lightning-AI/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 26 Mar 2025

https://github.com/bentoml/openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 23 Oct 2025

https://github.com/mistralai/mistral-inference

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 12 May 2025

https://github.com/mistralai/mistral-inference?tab=readme-ov-file

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 16 Mar 2025

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 14 Mar 2025

https://github.com/sjtu-ipads/powerinfer

High-speed Large Language Model Serving for Local Deployment

large-language-models llama llm llm-inference local-inference

Last synced: 12 May 2025

https://github.com/SJTU-IPADS/PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

bamboo-7b falcon large-language-models llama llm llm-inference local-inference

Last synced: 18 Mar 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 May 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/internlm/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 24 Dec 2025

https://github.com/InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 20 Mar 2025

https://github.com/deftruth/awesome-llm-inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

awesome-llm deepseek deepseek-r1 deepseek-v3 flash-attention flash-attention-3 flash-mla llm-inference minimax-01 mla paged-attention tensorrt-llm vllm

Last synced: 04 Apr 2025

https://github.com/nvidia/generativeaiexamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 13 May 2025

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 12 May 2025

https://github.com/NVIDIA/GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 28 Mar 2025

https://github.com/flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

attention cuda gpu jit large-large-models llm-inference nvidia pytorch

Last synced: 05 Jan 2026

https://github.com/databricks/dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai

Last synced: 25 Oct 2025

https://github.com/fasterdecoding/medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 14 May 2025

https://github.com/FasterDecoding/Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 09 May 2025

https://github.com/intel/intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

4-bits autoround chatbot chatpdf gaudi3 habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b rag retrieval speculative-decoding streamingllm

Last synced: 24 Feb 2025

https://github.com/b4rtaz/distributed-llama

Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

distributed-computing distributed-llm llama2 llama3 llm llm-inference llms neural-network open-llm

Last synced: 13 Apr 2025

https://github.com/liltom-eth/llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.

llama-2 llama2 llm llm-inference

Last synced: 14 May 2025

https://github.com/sauravpanda/browserai

Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser

agents ai llama llm llm-inference local localllm tts webgpu

Last synced: 14 May 2025

https://github.com/PySpur-Dev/PySpur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 14 Sep 2025

https://github.com/cloud-code-ai/browserai

Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser

agents ai llama llm llm-inference local localllm tts webgpu

Last synced: 14 Apr 2025

https://github.com/PySpur-Dev/pyspur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 07 Sep 2025

https://github.com/lean-dojo/LeanCopilot

LLMs as Copilots for Theorem Proving in Lean

formal-mathematics lean lean4 llm-inference machine-learning theorem-proving

Last synced: 09 Jul 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch

Last synced: 15 May 2025

https://github.com/harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm

Last synced: 23 Aug 2025

https://github.com/SafeAILab/EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

large-language-models llm-inference speculative-decoding

Last synced: 20 Mar 2025

https://github.com/katanemo/archgw

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by the core contributors of Envoy proxy, on Envoy.

ai-gateway envoy envoyproxy gateway generative-ai llm-gateway llm-inference llm-routing llmops llms openai prompt proxy proxy-server routing

Last synced: 21 Oct 2025

https://github.com/inspector-apm/neuron-ai

The PHP Agent Development Kit - powered by Inspector.dev

agent agentic-ai agentic-framework agents ai llm llm-inference llms php vector-database

Last synced: 21 Jun 2025

https://github.com/eastriverlee/llm.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

gguf ios llm llm-inference macos swift tvos visionos watchos

Last synced: 15 May 2025

https://github.com/foldl/chatllm.cpp

Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)

llm llm-inference

Last synced: 15 May 2025

https://github.com/zeux/calm

CUDA/Metal accelerated language model inference

cuda llm-inference ml

Last synced: 10 Apr 2025

https://github.com/eastriverlee/LLM.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

gguf ios llm llm-inference macos swift tvos visionos watchos

Last synced: 11 Mar 2025

https://github.com/nano-collective/nanocoder

A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒

ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter

Last synced: 05 Jan 2026

https://github.com/stanford-mast/blast

Browser-LLM Auto-Scaling Technology

ai-agents browser-automation llm-inference python

Last synced: 11 May 2025

https://github.com/michael-a-kuykendall/shimmy

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

api-server command-line-tool developer-tools gguf huggingface huggingface-models huggingface-transformers inference-server llama llamacpp llm-inference local-ai lora machine-learning ollama-api openai-compatible rust rust-crate transformers

Last synced: 13 Sep 2025

https://github.com/feifeibear/long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

attention-is-all-you-need deepspeed-ulysses llm-inference llm-training pytorch ring-attention

Last synced: 14 May 2025

https://github.com/Nano-Collective/nanocoder

A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒

ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter

Last synced: 23 Sep 2025

https://github.com/flagai-open/aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 15 May 2025

https://github.com/vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 09 May 2025

https://github.com/vectorch-ai/scalellm

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 14 Apr 2025

https://github.com/rizerphe/local-llm-function-calling

A tool for generating function arguments and choosing what function to call with local LLMs

chatgpt-functions huggingface-transformers json-schema llm llm-inference openai-function-call openai-functions

Last synced: 26 Oct 2025

https://github.com/FlagAI-Open/Aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 08 Apr 2025

https://github.com/preternatural-explore/mlx-swift-chat

A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.

ios llm-inference macos mlx mlx-swift swiftui

Last synced: 10 Apr 2025

https://github.com/jax-ml/scaling-book

Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs

jax llm-inference llms roofline tpus

Last synced: 18 Jun 2025

https://github.com/felladrin/minisearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama

Last synced: 12 Apr 2025

https://github.com/felladrin/MiniSearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama

Last synced: 24 Mar 2025

https://github.com/nvidia/star-attention

Efficient LLM Inference over Long Sequences

attention-mechanism large-language-models llm-inference

Last synced: 16 May 2025

https://github.com/microsoft/sarathi-serve

A low-latency & high-throughput serving engine for LLMs

llama llm-inference pytorch transformer

Last synced: 16 May 2025

https://github.com/intel/neural-speed

An innovative library for efficient LLM inference via low-bit quantization

cpu fp4 fp8 gaudi2 gpu int1 int2 int3 int4 int5 int6 int7 int8 llamacpp llm-fine-tuning llm-inference low-bit mxformat nf4 sparsity

Last synced: 25 Oct 2025

https://github.com/zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 08 Sep 2025

https://github.com/devflowinc/uzi

CLI for running large numbers of coding agents in parallel with git worktrees

agentic-ai ai codegen go golang llm llm-inference parallelization

Last synced: 24 Jun 2025

https://github.com/zjhellofss/kuiperllama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 16 May 2025

https://github.com/ai-hypercomputer/jetstream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 23 Oct 2025

https://github.com/ugorsahin/TalkingHeads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 11 Apr 2025

https://github.com/alipay/painlessinferenceacceleration

Accelerate inference without tears

llm-inference

Last synced: 16 May 2025

https://github.com/ugorsahin/talkingheads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 15 May 2025

https://github.com/unifyai/unify

Notion for AI Observability 📊

ai claude gpt gpt-4 llama2 llm llm-inference llms mixtral openai python

Last synced: 15 May 2025

https://github.com/AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 31 Mar 2025

https://github.com/morpheuslord/hackbot

AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.

ai automation chatbot cli-chat-app cybersecurity cybersecurity-education cybersecurity-tools llama-api llama2 llama2-7b llamacpp llm-inference runpod

Last synced: 08 Oct 2025

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 08 May 2025

https://github.com/andrewkchan/deepseek.cpp

CPU inference for the DeepSeek family of large language models in pure C++

cpp deepseek llama llm llm-inference machine-learning transformers

Last synced: 16 May 2025

https://github.com/andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

cpp cuda inference-engine llama llamacpp llm llm-inference machine-learning mistral

Last synced: 12 Apr 2025

https://github.com/armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

apple-silicon dpo large-language-models llm llm-inference llm-training lora mlx

Last synced: 18 Jul 2025

https://github.com/structuredllm/syncode

Efficient and general syntactical decoding for Large Language Models

grammar large-language-models llm llm-inference parser

Last synced: 11 May 2025

https://github.com/modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

cpu cuda guided-decoding llm llm-inference native-engine

Last synced: 12 Apr 2025

https://github.com/inferflow/inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

baichuan2 bloom deepseek falcon gemma internlm llama2 llamacpp llm-inference m2m100 minicpm mistral mixtral mixture-of-experts model-quantization moe multi-gpu-inference phi-2 qwen

Last synced: 07 Apr 2025

https://github.com/expectedparrot/edsl

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference market-research mixtral open-source openai python social-science surveys synthetic-data

Last synced: 15 May 2025