Projects in Awesome Lists tagged with kv-cache
A curated list of projects in awesome lists tagged with kv-cache .
https://github.com/hdt3213/godis
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
cluster go godis golang kv-cache redis redis-cluster redis-server
Last synced: 13 May 2025
https://github.com/HDT3213/godis
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
cluster go godis golang kv-cache redis redis-cluster redis-server
Last synced: 27 Mar 2025
https://github.com/harleyszhang/llm_note
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm
Last synced: 23 Aug 2025
https://github.com/nvidia/kvpress
LLM KV cache compression made easy
inference kv-cache kv-cache-compression large-language-models llm long-context python pytorch transformers
Last synced: 09 Apr 2026
https://github.com/fminference/h2o
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity
Last synced: 05 Apr 2025
https://github.com/FMInference/H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity
Last synced: 09 May 2025
https://github.com/quantumaikr/quant.cpp
LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
delta-compression embeddable gguf kv-cache llm llm-inference pure-c quantization transformer turboquant
Last synced: 08 Apr 2026
https://github.com/arozanov/turboquant-mlx
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
apple-silicon kv-cache llm metal mlx quantization turboquant
Last synced: 05 May 2026
https://github.com/kddubey/cappr
Completion After Prompt Probability. Make your LLM make a choice
huggingface kv-cache llamacpp llm-inference probability prompt-engineering text-classification zero-shot
Last synced: 05 Apr 2025
https://github.com/hkproj/pytorch-llama-notes
Notes about LLaMA 2 model
attention-is-all-you-need kv-cache llama2 rmsprop rotary-position-encoding study-notes
Last synced: 06 May 2025
https://github.com/DRSY/EasyKV
Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)
cache-eviction cache-management kv-cache llm
Last synced: 16 May 2025
https://github.com/vectifyai/condb
ConDB: The KV-Cache Native Context Database
agents ai context-database kv-cache llm long-context rag reasoning retrieval tree-search
Last synced: 04 Jun 2026
https://github.com/mindtro/semafold
Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).
embedding-compression kv-cache llm-inference qjl quantization retrieval semafold turboquant vector-compression vector-database
Last synced: 07 Apr 2026
https://github.com/vectorarc/avp-python
Python SDK for Agent Vector Protocol – transfer KV-cache between LLM agents instead of text
ai-agents inference kv-cache llm machine-learning multi-agent protocol python transformers vllm
Last synced: 26 Apr 2026
https://github.com/jagmarques/nexusquant
Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.
attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization
Last synced: 01 May 2026
https://github.com/codepawl/turboquant-torch
Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.
compression inference kv-cache llm pytorch quantization
Last synced: 05 Apr 2026
https://github.com/mehdihosseinimoghadam/ava-mistral-7b
Fine-Tuned Mistral 7B Persian Large Language Model LLM / Persian Mistral 7B
ava ava-mistral ava-mistral-7b deep-learning kv-cache large-language-models llm mistral mistral-7b nlp persian-mistral persian-mistral-7b
Last synced: 11 Apr 2025
https://github.com/reshalfahsi/image-captioning-mobilenet-llama3
Image Captioning With MobileNet-LLaMA 3
cnn flickr8k-dataset grouped-query-attention image-captioning image-text kv-cache llama3 mobilenetv3 nlp pytorch pytorch-lightning rms-norm rotary-position-embedding transformer
Last synced: 12 Apr 2025
https://github.com/back2matching/turboquant
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
compression gpu huggingface inference kv-cache llm machine-learning pytorch quantization transformers turboquant vram
Last synced: 30 Apr 2026
https://github.com/geronimo-iia/qjl-sketch
QJL sign-based vector compression and scoring in Rust — near-optimal distortion rate, append-only persistence, CPU-only, no LLM
attention error-handling input-validation johnson-lindenstrauss kv-cache mmap mmap-persistence quantization rust search sign-hashing vector-compression
Last synced: 07 Jun 2026
https://github.com/manishklach/kv-cpu-driver
Reference Linux control plane, RTL, and FPGA emulation scaffold for KV-CPU semantic KV-cache orchestration. Patent pending in India (App No. 202641056309).
device-driver fpga kv-cache linux-kernel llm-inference memory-tiering pcie rtl systemverilog vllm
Last synced: 11 May 2026
https://github.com/jaameypr/keyvalue-caching
Java-based caching solution designed to temporarily store key-value pairs with a specified time-to-live (TTL) duration.
caching caching-strategies java java-17 java-cache java-caching java-caching-strategy java-keyvalue-cache java-kv-cache keyvalue keyvalue-cache keyvalue-store keyvaluestore kv-cache maven
Last synced: 15 Feb 2026
https://github.com/cklxx/arle
Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.
agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust
Last synced: 02 May 2026
https://github.com/manishklach/ghostkv-lab
Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.
ai-infrastructure attention-optimization cxl flashattention gpu-memory kv-cache llm-inference long-context long-context-inference memory-systems systems-research transformer transformer-memory transformer-optimization
Last synced: 09 Jun 2026
https://github.com/manishklach/semantic-kv-control-plane
A systems research platform for semantic KV-cache orchestration, topology-aware memory placement, distributed prefix reuse, and rack-scale inference memory simulation.
ai-infrastructure ai-systems cxl distributed-cache distributed-systems gpu hbm inference kv-cache llm-inference memory-orchestration memory-systems memory-tiering prefetching rack-scale runtime-systems semantic-caching simulation systems-research topology-aware
Last synced: 09 Jun 2026
https://github.com/manishklach/intent-attention-kernel
Intent-aware attention research prototype that treats long-context inference as structured semantic blocks instead of a flat token stream, proving CPU-first correctness and analytical KV/FLOP savings before GPU kernel implementation.
agentic-ai ai-infrastructure attention block-attention cost-model cuda gpu-kernels inference kernel-research kv-cache llm-inference long-context python pytorch research semantic-attention sparse-attention systems transformers triton
Last synced: 28 May 2026
https://github.com/hinanohart/hybridserve-state
hybrid-models kv-cache linear-attention llm-inference mamba serialization ssm
Last synced: 15 Jun 2026
https://github.com/sshoecraft/shepherd
An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.
anthropic cli cpp cuda gemini grok inference kv-cache llama-cpp llm mcp ollama openai openai-server rag smart-evictions tensorrt tool-calling ulimited-context
Last synced: 10 Apr 2026
https://github.com/ionden/mlx-quant-fidelity
Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
apple-silicon diagnostics kl-divergence kv-cache kv-cache-quantization llm llm-eval llm-inference machine-learning metal mlx mlx-lm model-quantization perplexity python quantization quantization-quality
Last synced: 14 Jun 2026
https://github.com/andrewhsugithub/min-llama
my llama3 implementation
grouped-query-attention kv-cache llama3 llm nlp rope swiglu transformers
Last synced: 18 May 2026
https://github.com/alepot55/flash-reasoning
Tree-Aware Attention for System 2 Reasoning. Reduces KV-Cache VRAM by 96% and exceeds physical HBM bandwidth (1.33x) via Fused GQA Triton kernels.
attention-mechanism fused-gqa kv-cache reasoning triton
Last synced: 28 Jan 2026
https://github.com/mcp-tool-shop/context-window-manager
MCP server for lossless LLM context restoration via KV cache persistence
ai claude context-management kv-cache llm lmcache machine-learning mcp python vllm
Last synced: 30 Jan 2026
https://github.com/mcp-tool-shop-org/context-window-manager
MCP server for lossless LLM context restoration via KV cache persistence
ai claude context-management context-window kv-cache llm llm-inference lmcache machine-learning mcp mcp-server memory model-context-protocol python rag session-management token-management vllm
Last synced: 27 Feb 2026
https://github.com/ifuryst/nanollmserve
🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.
ai-infra continuous-batching fastapi inference kv-cache llm llm-engine llm-inference llm-infra llm-serving lora openai-compatible paged-attention pytorch quantization sglang speculative-decoding teaching transformers vllm
Last synced: 05 Jun 2026
https://github.com/eldriss-studio/tardigrade-db
Experimental LLM-native memory-kernel prototype for persistent agent memory in quantized KV-cache latent space.
agent-memory ai-infrastructure huggingface kv-cache llm llm-memory maturin pyo3 python quantization rag rust transformers vector-database vllm
Last synced: 26 May 2026
https://github.com/lamaparbat/express_redis_caching_rate_limit
EXPRESS REST API CACHING + RATE LIMITING + KV-STORE
express ioredis kv-cache rate-limiting redis redis-stack restapi
Last synced: 04 May 2026
https://github.com/elinx/llm-mem-calculator
Interactive KV cache memory calculator for LLMs — supports MLA, GQA, hybrid attention, sliding window, and linear attention architectures. Estimate GPU memory for serving any model at any context length.
calculator gpu-memory kv-cache llm llm-serving vllm
Last synced: 09 Jun 2026
https://github.com/synapt-dev/vorn-mat
Vorn: Residual Direction, Familial Eviction, and the Granularity Rescue Spectrum — paper + reproducible code, data, and figures (Zenodo DOI: 10.5281/zenodo.20519215)
attention eviction kv-cache language-models long-context machine-learning nlp reproducible-research
Last synced: 10 Jun 2026