Projects in Awesome Lists tagged with speculative-decoding

https://github.com/Luce-Org/lucebox-hub

Lucebox: LLM inference server built for speed for specific consumer hardware.

cuda cuda-kernels dflash kernel llama-cpp local-ai luce lucebox megakernel nvidia-cuda pflash qwen rtx3090 speculative-decoding speculative-prefill

Last synced: 23 May 2026

https://github.com/intel/intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

4-bits autoround chatbot chatpdf gaudi3 habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b rag retrieval speculative-decoding streamingllm

Last synced: 03 Jul 2026

https://github.com/dphnai/aphrodite-engine

Large-scale LLM inference engine

api-rest cuda inference-engine inferentia intel lora machine-learning rocm speculative-decoding tpu

Last synced: 04 Jul 2026

https://github.com/aphrodite-engine/aphrodite-engine

Large-scale LLM inference engine

api-rest cuda inference-engine inferentia intel lora machine-learning rocm speculative-decoding tpu

Last synced: 14 May 2025

https://github.com/SafeAILab/EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

large-language-models llm-inference speculative-decoding

Last synced: 20 Mar 2025

https://github.com/Infini-AI-Lab/Sequoia

scalable and robust tree-based speculative decoding algorithm

efficiency inference llm speculative-decoding

Last synced: 16 Oct 2025

https://github.com/facebookresearch/layerskip

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024

early-exit layer-drop llm optimization speculative-decoding transformers

Last synced: 12 Apr 2025

https://github.com/facebookresearch/LayerSkip

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024

early-exit layer-drop llm optimization speculative-decoding transformers

Last synced: 11 Mar 2025

https://github.com/Infini-AI-Lab/TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

acceleration efficiency inference llm llm-inference long-context speculative-decoding

Last synced: 16 May 2025

https://github.com/fasterdecoding/rest

REST: Retrieval-Based Speculative Decoding, NAACL 2024

llm-inference retrieval speculative-decoding

Last synced: 16 May 2025

https://github.com/FasterDecoding/REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024

llm-inference retrieval speculative-decoding

Last synced: 07 May 2025

https://github.com/kssteven418/biglittledecoder

[NeurIPS'23] Speculative Decoding with Big Little Decoder

decoding efficient-inference fast-inference llm speculative-decoding speculative-execution

Last synced: 31 Jul 2025

https://github.com/AR6420/Hail_Hydra

🐉 Hail Hydra — Multi-headed speculative execution framework for Claude Code. 10 AI agents, 3x faster, ~70% cheaper. Inspired by speculative decoding.

ai-agent ai-agents-framework ai-framwork anthropic claude claude-code claude-code-agents context-engineering haiku hail-hydra hydra meta-prompting multi-agent multi-agent-collaboration opus sonnet speculative-decoding speculative-execution

Last synced: 01 Jul 2026

https://github.com/autonomicperfectionist/pipeinfer

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

inference llamacpp llm speculative-decoding

Last synced: 05 Oct 2025

https://github.com/mscheong01/speculative_decoding.c

minimal C implementation of speculative decoding based on llama2.c

artificial-intelligence c llama2 llm speculative-decoding

Last synced: 23 Jun 2025

https://github.com/hsj576/griffin

Official Implementation of "GRIFFIN: Effective Token Alignment for Faster Speculative Decoding"

large-language-models llm-inference speculative-decoding

Last synced: 13 May 2025

https://github.com/hec-ovi/vllm-awq4-qwen

vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.

27b amd-strix-halo awq dflash docker gfx1151 llm-inference multimodal-llm openai-api qwen3 rdna35 rocm ryzen-ai-max speculative-decoding vllm

Last synced: 21 Jun 2026

https://github.com/croll83/llama.cpp-dgx

llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP

blackwell dflash gb10 llama-cpp nvfp4 speculative-decoding turboquant

Last synced: 02 Jun 2026

https://github.com/redhillsmediafl/caix

caix — native Apple Core AI inference server for Apple silicon (beta): OpenAI/Anthropic API, dashboard, streaming chat with tools/skills/MCP, MTP speculative decoding.

anthropic-api apple-silicon coreml inference-server llm llm-inference local-llm macos mlx neural-engine ollama-alternative on-device-ai openai-api speculative-decoding swift

Last synced: 04 Jul 2026

https://github.com/llmsresearch/specstream

Fast LLM inference with 2.8x speedup using speculative decoding

inference largelanguagemodel llms speculative-decoding

Last synced: 14 Jan 2026

https://github.com/manishklach/gpu-resident-inference-lab

Research lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.

cuda gpu-systems kv-cache llm-inference mega-kernel model-systems persistent-kernel runtime speculative-decoding

Last synced: 19 Jun 2026

https://github.com/wtlow003/speculative-sampling

Implementation of Speculative Sampling in "Accelerating Large Language Model Decoding with Speculative Sampling"

deepmind llm-inference speculative-decoding speculative-sampling

Last synced: 24 Sep 2025

https://github.com/geralt-targaryen/awesome-speculative-decoding

Reading notes on Speculative Decoding papers

awesome llm nlp papers speculative-decoding

Last synced: 14 Mar 2025

https://github.com/jcartu/qwen36-27b-blackwell-inference-study

Systematic 24-hour benchmark study of Qwen3.6-27B inference on dual NVIDIA RTX PRO 6000 Blackwell SM120 (TP=2). 8 experiments comparing repne/vllm fork vs upstream vLLM across FP8/BF16/NVFP4/Q8_0 quants and MTP/DFlash speculative decoding. Peak: 2,083 tok/s at c=32. Quality: KLD vs BF16 = 0.0018 (noise floor).

benchmark bf16 blackwell fp8 inference nvfp4 qwen qwen3 rtx-pro-6000 speculative-decoding vllm

Last synced: 12 Jun 2026

https://github.com/wtlow003/ngram-decoding

(Re)-implementation of "Prompt Lookup Decoding" by Apoorv Saxena, with extended ideas from LLMA Decoding.

llm-inference n-gram ngram-decoding prompt-lookup-decoding speculative-decoding

Last synced: 21 Apr 2026

https://github.com/smpanaro/token-recycling

Unofficial implementation of Token Recycling self-speculative decoding method.

large-language-model llm-inference speculative-decoding

Last synced: 11 Mar 2026

https://github.com/jcartu/qwen-bench-2026-05-11-v2-followup

Study #4: FP8+MTP{3,5} speed on repne/vllm:v2 + max_tokens=8192 quality re-runs for BF16+DFlash n=8 and FP8+MTP=3. Follow-up to studies #2 and #3.

benchmark blackwell dflash humaneval inference mbpp mtp qwen-bench qwen3 speculative-decoding vllm

Last synced: 12 Jun 2026

https://github.com/eps-ai-solutions/claudecli

HYDRA 10.0 - Advanced AI System with Self-Correction, Few-Shot Learning, Speculative Decoding, Load Balancing & Semantic RAG

ai automation claude few-shot-learning llm mcp ollama powershell self-correction speculative-decoding

Last synced: 16 Jan 2026

https://github.com/ifuryst/nanollmserve

🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.

ai-infra continuous-batching fastapi inference kv-cache llm llm-engine llm-inference llm-infra llm-serving lora openai-compatible paged-attention pytorch quantization sglang speculative-decoding teaching transformers vllm

Last synced: 05 Jun 2026

https://github.com/theogravity/dual-rtx-6000-blackwell-qwen3.6-27b-fp8

Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.

benchmark blackwell fp8 llm-inference local-llm multi-token-prediction qwen3 rtx-pro-6000 speculative-decoding vllm

Last synced: 11 May 2026

https://github.com/theogravity/dual-rtx-6000-blackwell-gemma-4-31b-it-nvfp4

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

am5 amd blackwell cuda docker fp4 gemma gemma4 llm-inference multi-token-prediction nvfp4 prefix-caching rtx-6000 speculative-decoding tensor-parallel vllm

Last synced: 11 May 2026

https://github.com/jcartu/qwen36-27b-fp8-repne-vs-upstream

Same FP8+MTP=3 config on Repne fork vs upstream vLLM v0.20.1, dual RTX PRO 6000 Blackwell. Repne fork wins at short-context multi-stream.

benchmark blackwell fp8 mtp qwen speculative-decoding vllm

Last synced: 12 Jun 2026

https://github.com/jcartu/qwen-bench

Hub for ongoing Qwen inference benchmarks on NVIDIA Blackwell. Indexes all studies, hosts the rolling SOTA leaderboard, points to the toolchain.

benchmark blackwell hub inference leaderboard qwen qwen3 rtx-pro-6000 speculative-decoding vllm

Last synced: 12 Jun 2026

https://github.com/jcartu/llm-stress-harness

Diagnostic toolkit for self-hosted LLM inference: failure-taxonomic stress harness + 4-phase orchestrator + parametric vLLM launchers

benchmarking humaneval inference llm mbpp python sglang speculative-decoding stress-testing vllm

Last synced: 12 Jun 2026