An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with inference

A curated list of projects in awesome lists tagged with inference .

https://github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu

Last synced: 29 Jan 2026

https://github.com/ggml-org/whisper.cpp

Port of OpenAI's Whisper model in C/C++

inference openai speech-recognition speech-to-text transformer whisper

Last synced: 16 Jan 2026

https://github.com/deepspeedai/deepspeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

billion-parameters compression data-parallelism deep-learning gpu inference machine-learning mixture-of-experts model-parallelism pipeline-parallelism pytorch trillion-parameters zero

Last synced: 15 Jan 2026

https://github.com/microsoft/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

billion-parameters compression data-parallelism deep-learning gpu inference machine-learning mixture-of-experts model-parallelism pipeline-parallelism pytorch trillion-parameters zero

Last synced: 02 Apr 2025

https://github.com/deepspeedai/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

billion-parameters compression data-parallelism deep-learning gpu inference machine-learning mixture-of-experts model-parallelism pipeline-parallelism pytorch trillion-parameters zero

Last synced: 19 Oct 2025

https://github.com/sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

attention blackwell cuda deepseek diffusion glm gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

Last synced: 16 May 2026

https://github.com/tencent/ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform

android arm-neon artificial-intelligence caffe darknet deep-learning high-preformance inference ios keras mlir mxnet ncnn neural-network onnx pytorch riscv simd tensorflow vulkan

Last synced: 09 Sep 2025

https://github.com/Tencent/ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform

android arm-neon artificial-intelligence caffe darknet deep-learning high-preformance inference ios keras mlir mxnet ncnn neural-network onnx pytorch riscv simd tensorflow vulkan

Last synced: 14 Mar 2025

https://github.com/gvergnaud/ts-pattern

🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.

branching conditions exhaustive inference javascript matching pattern pattern-matching ts type-inference typescript

Last synced: 14 May 2025

https://github.com/nvidia/tensorrt

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

deep-learning gpu-acceleration inference nvidia tensorrt

Last synced: 09 Sep 2025

https://github.com/aws/amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

aws data-science deep-learning examples inference jupyter-notebook machine-learning mlops reinforcement-learning sagemaker training

Last synced: 13 May 2025

https://github.com/NVIDIA/TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

deep-learning gpu-acceleration inference nvidia tensorrt

Last synced: 20 Mar 2025

https://github.com/huggingface/text-generation-inference

Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Last synced: 13 May 2025

https://github.com/xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

artificial-intelligence chatglm deployment flan-t5 gemma ggml glm4 inference llama llama3 llamacpp llm machine-learning mistral openai-api pytorch qwen vllm whisper wizardlm

Last synced: 25 Apr 2026

https://github.com/triton-inference-server/server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

cloud datacenter deep-learning edge gpu inference machine-learning

Last synced: 24 Dec 2025

https://github.com/oumi-ai/oumi

Easily fine-tune, evaluate and deploy Qwen3, DeepSeek-R1, Llama 4 or any open source LLM / VLM!

dpo evaluation fine-tuning inference llama llms sft vlms

Last synced: 29 Jan 2026

https://github.com/lmcache/lmcache

Supercharge Your LLM with the Fastest KV Cache Layer

amd cuda fast inference kv-cache llm pytorch rocm speed vllm

Last synced: 03 Jul 2026

https://github.com/linzaer/ultra-light-fast-generic-face-detector-1mb

💎1MB lightweight face detection model (1MB轻量级人脸检测模型)

arm face-detection inference mnn ncnn

Last synced: 14 May 2025

https://github.com/Linzaer/Ultra-Light-Fast-Generic-Face-Detector-1MB

💎1MB lightweight face detection model (1MB轻量级人脸检测模型)

arm face-detection inference mnn ncnn

Last synced: 14 Mar 2025

https://github.com/gcanti/io-ts

Runtime type system for IO decoding/encoding

inference runtime types typescript validation

Last synced: 14 May 2025

https://gcanti.github.io/io-ts/

Runtime type system for IO decoding/encoding

inference runtime types typescript validation

Last synced: 08 Apr 2025

https://github.com/trusted-ai/adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai

Last synced: 13 May 2025

https://github.com/Trusted-AI/adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai

Last synced: 23 Mar 2025

https://github.com/gpustack/gpustack

A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.

ascend cuda deepseek distributed-inference genai high-performance-inference inference llama llm llm-inference llm-serving maas mindie openai qwen rocm sglang vllm

Last synced: 20 Apr 2026

https://github.com/autogptq/autogptq

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers

Last synced: 08 Apr 2025

https://github.com/AutoGPTQ/AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers

Last synced: 14 Mar 2025

https://github.com/nvidia-ai-iot/torch2trt

An easy to use PyTorch to TensorRT converter

classification inference jetson-nano jetson-tx2 jetson-xavier pytorch tensorrt

Last synced: 14 May 2025

https://github.com/NVIDIA-AI-IOT/torch2trt

An easy to use PyTorch to TensorRT converter

classification inference jetson-nano jetson-tx2 jetson-xavier pytorch tensorrt

Last synced: 20 Mar 2025

https://github.com/argmaxinc/whisperkit

On-device Speech Recognition for Apple Silicon

inference ios macos speech-recognition swift transformers visionos watchos whisper

Last synced: 13 May 2025

https://github.com/tencent/tnn

TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts. TNN has been deployed in multiple Apps from Tencent, such as Mobile QQ, Weishi, Pitu, etc. Contributions are welcome to work in collaborative with us and make TNN a better framework.

coreml deep-learning face-detection hairsegmentaion inference mnn ncnn ocr openvino pytorch tengine tensorflow tensorrt

Last synced: 13 May 2025

https://github.com/Tencent/TNN

TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts. TNN has been deployed in multiple Apps from Tencent, such as Mobile QQ, Weishi, Pitu, etc. Contributions are welcome to work in collaborative with us and make TNN a better framework.

coreml deep-learning face-detection hairsegmentaion inference mnn ncnn ocr openvino pytorch tengine tensorflow tensorrt

Last synced: 20 Mar 2025

https://github.com/argmaxinc/WhisperKit

On-device Speech Recognition for Apple Silicon

inference ios macos speech-recognition swift transformers visionos watchos whisper

Last synced: 28 Mar 2025

https://github.com/opencsgs/csghub

CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain full control over the lifecycle of LLMs, datasets, and agents, with Python SDK compatibility with Hugging Face. Join us! ⭐️

ai asset-management dataset deepseek deploy finetune git huggingface inference llm management-system model platform prompt ray space

Last synced: 16 Jun 2026

https://github.com/tencentmusic/cube-studio

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

ai aihub argo automl gpt inference kubeflow kubernetes llmops mlops notebook pipeline pytorch spark vgpu workflow

Last synced: 06 Feb 2026

https://github.com/bytedance/lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

accelerate bart beam-search bert cuda diverse-decoding gpt inference multilingual-nmt sampling training transformer

Last synced: 14 May 2025

https://github.com/huggingface/optimum

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

graphcore habana inference intel onnx onnxruntime optimization pytorch quantization tflite training transformers

Last synced: 14 May 2025

https://github.com/openvinotoolkit/openvino_notebooks

📚 Jupyter notebook tutorials for OpenVINO™

computer-vision deep-learning inference machine-learning openvino

Last synced: 01 Jul 2025

https://github.com/zjhellofss/kuiperinfer

校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step

caffe convolution deep-learning deep-neural-networks diy graph-algorithms inference inference-engine maxpooling ncnn pnnx pytorch relu resnet sigmoid yolo yolov5

Last synced: 14 May 2025

https://github.com/zjhellofss/KuiperInfer

校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step

caffe convolution deep-learning deep-neural-networks diy graph-algorithms inference inference-engine maxpooling ncnn pnnx pytorch relu resnet sigmoid yolo yolov5

Last synced: 19 Mar 2025

https://github.com/raullenchai/rapid-mlx

The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.

apple-silicon claude-code cursor deepseek fastapi hacktoberfest inference llm local-llm m1 m2 m3 macos mlx ollama-alternative openai-api python qwen tool-calling

Last synced: 12 Jun 2026

https://github.com/Andyyyy64/whichllm

Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

ai apple-silicon benchmarks cli command-line-tool gguf gpu huggingface inference llm local-llm ollama python vram

Last synced: 09 Jun 2026

https://github.com/huggingface/huggingface.js

Use Hugging Face with JavaScript

api-client hub huggingface inference machine-learning

Last synced: 10 Jun 2026

https://github.com/zml/zml

Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild

ai bazel hpc inference xla zig

Last synced: 12 Apr 2025

https://github.com/pytorch/ao

PyTorch native quantization and sparsity for training and inference

brrr cuda dtypes float8 inference llama mx offloading optimizer pytorch quantization sparsity training transformer

Last synced: 12 May 2025

https://github.com/deepspeedai/deepspeed-mii

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

deep-learning inference pytorch

Last synced: 29 Apr 2025

https://github.com/microsoft/DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

deep-learning inference pytorch

Last synced: 05 Apr 2025

https://github.com/deepspeedai/DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

deep-learning inference pytorch

Last synced: 14 Mar 2025

https://github.com/dstackai/dstack

dstack is an open-source alternative to Kubernetes and Slurm, designed to simplify GPU allocation and AI workload orchestration for ML teams across top clouds, on-prem clusters, and accelerators.

amd aws azure cloud docker fine-tuning gcp gpu inference k8s kubernetes llms machine-learning nvidia orchestration python slurm training

Last synced: 21 Jan 2026

https://github.com/els-rd/transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

deep-learning deployment inference machine-learning natural-language-processing server

Last synced: 14 May 2025

https://github.com/ELS-RD/transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

deep-learning deployment inference machine-learning natural-language-processing server

Last synced: 03 Apr 2025

https://github.com/trymirai/uzu

A high-performance inference engine for AI models

ai high-performance inference llm metal rust tts

Last synced: 08 Jun 2026

https://github.com/Delta-ML/delta

DELTA is a deep learning based natural language and speech processing platform. LF AI & DATA Projects: https://lfaidata.foundation/projects/delta/

asr custom-ops deep-learning emotion-recognition front-end inference nlp nlu ops seq2seq sequence-to-sequence serving speaker-verification speech speech-recognition tensorflow tensorflow-lite tensorflow-serving text-classification text-generation

Last synced: 07 Apr 2025

https://github.com/tencent/turbotransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

albert bert decoder gpt2 gpu huggingface-transformers inference machine-translation nlp pytorch roberta transformer

Last synced: 15 May 2025

https://github.com/Tencent/TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

albert bert decoder gpt2 gpu huggingface-transformers inference machine-translation nlp pytorch roberta transformer

Last synced: 19 Mar 2025

https://github.com/ebhy/budgetml

Deploy a ML inference service on a budget in less than 10 lines of code.

api data-science deployment fastapi inference machine-learning mlops

Last synced: 15 May 2025

https://github.com/pykeio/ort

Fast ML inference & training for ONNX models in Rust

ai ai-training fine-tuning inference machine-learning onnx onnxruntime rust

Last synced: 19 Oct 2025

https://github.com/0xCrunchyy/10x

Optimized inference and fine-tuning framework for diffusion (image & video) models. Up to 3x faster & 80% less VRAM.

artificial-inteligence diffusion diffusion-models fine-tuning flux gpt inference lora pytorch sdxl

Last synced: 09 Jan 2026

https://github.com/fentechsolutions/causaldiscoverytoolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.

algorithm causal-discovery causal-inference causal-models causality graph graph-structure-recovery inference machine-learning python toolbox

Last synced: 15 May 2025

https://github.com/FenTechSolutions/CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.

algorithm causal-discovery causal-inference causal-models causality graph graph-structure-recovery inference machine-learning python toolbox

Last synced: 26 Mar 2025

https://github.com/RightNow-AI/picolm

Run a 1-billion parameter LLM on a $10 board with 256MB RAM

arm embedded inference llm openclaw picoclaw quantization raspberry-pi risc-v

Last synced: 19 Jun 2026

https://github.com/awslabs/multi-model-server

Multi Model Server is a tool for serving neural net models for inference

ai deep-learning inference mxnet neural-network onnx server

Last synced: 14 Jan 2026

https://github.com/huawei-noah/bolt

Bolt is a deep learning library with high performance and heterogeneous flexibility.

android arm bolt caffe cnn cv deep-learning high-performance huawei inference ios mali mobile nlp noah onnx rnn tensorflow x86

Last synced: 16 May 2025

https://github.com/uber/neuropod

A uniform interface to run deep learning models from multiple frameworks

deep-learning deeplearning incubation inference keras machine-learning machinelearning pytorch tensorflow

Last synced: 11 Jun 2025

https://github.com/openintrostat/ims

📚 Introduction to Modern Statistics - A college-level open-source textbook with a modern approach highlighting multivariable relationships and simulation-based inference.

bootstrap-confidence-intervals inference modern-statistics openintro rstats simulation statistics

Last synced: 26 Jan 2026

https://github.com/alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

gpt inference llama llm llm-serving llmops model-serving

Last synced: 14 Oct 2025

https://github.com/OpenIntroStat/ims

📚 Introduction to Modern Statistics - A college-level open-source textbook with a modern approach highlighting multivariable relationships and simulation-based inference. For v1, see https://openintro-ims.netlify.app.

bootstrap-confidence-intervals inference modern-statistics openintro rstats simulation statistics

Last synced: 17 Apr 2025

https://github.com/serizba/cppflow

Run TensorFlow models in C++ without installation and without Bazel

c cpp inference model neural-networks tensorflow tensorflow-cpp tensorflow-examples tensorflow-models

Last synced: 16 May 2025

https://github.com/triton-inference-server/pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.

deep-learning gpu inference

Last synced: 15 May 2025

https://github.com/vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

ascend inference llm llm-serving llmops mlops model-serving transformer vllm

Last synced: 27 Feb 2026

https://github.com/efeslab/nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 16 May 2025

https://github.com/efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 21 Apr 2025

https://github.com/theodo-group/genossgpt

One API for all LLMs either Private or Public (Anthropic, Llama V2, GPT 3.5/4, Vertex, GPT4ALL, HuggingFace ...) 🌈🐂 Replace OpenAI GPT with any LLMs in your app with one line.

api gpt gpt4all huggingface inference llama llm openai private public

Last synced: 04 Apr 2025