An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with kv-cache

A curated list of projects in awesome lists tagged with kv-cache .

https://github.com/lmcache/lmcache

Supercharge Your LLM with the Fastest KV Cache Layer

amd cuda fast inference kv-cache llm pytorch rocm speed vllm

Last synced: 30 May 2026

https://github.com/hdt3213/godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

cluster go godis golang kv-cache redis redis-cluster redis-server

Last synced: 13 May 2025

https://github.com/HDT3213/godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

cluster go godis golang kv-cache redis redis-cluster redis-server

Last synced: 27 Mar 2025

https://github.com/harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm

Last synced: 23 Aug 2025

https://github.com/fminference/h2o

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity

Last synced: 05 Apr 2025

https://github.com/FMInference/H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity

Last synced: 09 May 2025

https://github.com/quantumaikr/quant.cpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

delta-compression embeddable gguf kv-cache llm llm-inference pure-c quantization transformer turboquant

Last synced: 08 Apr 2026

https://github.com/arozanov/turboquant-mlx

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

apple-silicon kv-cache llm metal mlx quantization turboquant

Last synced: 05 May 2026

https://github.com/kddubey/cappr

Completion After Prompt Probability. Make your LLM make a choice

huggingface kv-cache llamacpp llm-inference probability prompt-engineering text-classification zero-shot

Last synced: 05 Apr 2025

https://github.com/DRSY/EasyKV

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

cache-eviction cache-management kv-cache llm

Last synced: 16 May 2025

https://github.com/mindtro/semafold

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

embedding-compression kv-cache llm-inference qjl quantization retrieval semafold turboquant vector-compression vector-database

Last synced: 07 Apr 2026

https://github.com/vectorarc/avp-python

Python SDK for Agent Vector Protocol – transfer KV-cache between LLM agents instead of text

ai-agents inference kv-cache llm machine-learning multi-agent protocol python transformers vllm

Last synced: 26 Apr 2026

https://github.com/jagmarques/nexusquant

Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.

attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization

Last synced: 01 May 2026

https://github.com/codepawl/turboquant-torch

Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.

compression inference kv-cache llm pytorch quantization

Last synced: 05 Apr 2026

https://github.com/back2matching/turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

compression gpu huggingface inference kv-cache llm machine-learning pytorch quantization transformers turboquant vram

Last synced: 30 Apr 2026

https://github.com/manishklach/kv-cpu-driver

Reference Linux control plane, RTL, and FPGA emulation scaffold for KV-CPU semantic KV-cache orchestration. Patent pending in India (App No. 202641056309).

device-driver fpga kv-cache linux-kernel llm-inference memory-tiering pcie rtl systemverilog vllm

Last synced: 11 May 2026

https://github.com/jaameypr/keyvalue-caching

Java-based caching solution designed to temporarily store key-value pairs with a specified time-to-live (TTL) duration.

caching caching-strategies java java-17 java-cache java-caching java-caching-strategy java-keyvalue-cache java-kv-cache keyvalue keyvalue-cache keyvalue-store keyvaluestore kv-cache maven

Last synced: 15 Feb 2026

https://github.com/cklxx/arle

Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust

Last synced: 02 May 2026

https://github.com/manishklach/intent-attention-kernel

Intent-aware attention research prototype that treats long-context inference as structured semantic blocks instead of a flat token stream, proving CPU-first correctness and analytical KV/FLOP savings before GPU kernel implementation.

agentic-ai ai-infrastructure attention block-attention cost-model cuda gpu-kernels inference kernel-research kv-cache llm-inference long-context python pytorch research semantic-attention sparse-attention systems transformers triton

Last synced: 28 May 2026

https://github.com/alepot55/flash-reasoning

Tree-Aware Attention for System 2 Reasoning. Reduces KV-Cache VRAM by 96% and exceeds physical HBM bandwidth (1.33x) via Fused GQA Triton kernels.

attention-mechanism fused-gqa kv-cache reasoning triton

Last synced: 28 Jan 2026

https://github.com/eldriss-studio/tardigrade-db

Experimental LLM-native memory-kernel prototype for persistent agent memory in quantized KV-cache latent space.

agent-memory ai-infrastructure huggingface kv-cache llm llm-memory maturin pyo3 python quantization rag rust transformers vector-database vllm

Last synced: 26 May 2026

https://github.com/sshoecraft/shepherd

An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.

anthropic cli cpp cuda gemini grok inference kv-cache llama-cpp llm mcp ollama openai openai-server rag smart-evictions tensorrt tool-calling ulimited-context

Last synced: 10 Apr 2026

https://github.com/lamaparbat/express_redis_caching_rate_limit

EXPRESS REST API CACHING + RATE LIMITING + KV-STORE

express ioredis kv-cache rate-limiting redis redis-stack restapi

Last synced: 04 May 2026

https://github.com/mcp-tool-shop/context-window-manager

MCP server for lossless LLM context restoration via KV cache persistence

ai claude context-management kv-cache llm lmcache machine-learning mcp python vllm

Last synced: 30 Jan 2026