Projects in Awesome Lists tagged with kv-cache
A curated list of projects in awesome lists tagged with kv-cache .
https://github.com/hdt3213/godis
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
cluster go godis golang kv-cache redis redis-cluster redis-server
Last synced: 13 May 2025
https://github.com/HDT3213/godis
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
cluster go godis golang kv-cache redis redis-cluster redis-server
Last synced: 27 Mar 2025
https://github.com/harleyszhang/llm_note
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm
Last synced: 23 Aug 2025
https://github.com/nvidia/kvpress
LLM KV cache compression made easy
inference kv-cache kv-cache-compression large-language-models llm long-context python pytorch transformers
Last synced: 09 Apr 2026
https://github.com/fminference/h2o
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity
Last synced: 05 Apr 2025
https://github.com/FMInference/H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity
Last synced: 09 May 2025
https://github.com/quantumaikr/quant.cpp
LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
delta-compression embeddable gguf kv-cache llm llm-inference pure-c quantization transformer turboquant
Last synced: 08 Apr 2026
https://github.com/arozanov/turboquant-mlx
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
apple-silicon kv-cache llm metal mlx quantization turboquant
Last synced: 05 May 2026
https://github.com/kddubey/cappr
Completion After Prompt Probability. Make your LLM make a choice
huggingface kv-cache llamacpp llm-inference probability prompt-engineering text-classification zero-shot
Last synced: 05 Apr 2025
https://github.com/hkproj/pytorch-llama-notes
Notes about LLaMA 2 model
attention-is-all-you-need kv-cache llama2 rmsprop rotary-position-encoding study-notes
Last synced: 06 May 2025
https://github.com/DRSY/EasyKV
Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)
cache-eviction cache-management kv-cache llm
Last synced: 16 May 2025
https://github.com/mindtro/semafold
Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).
embedding-compression kv-cache llm-inference qjl quantization retrieval semafold turboquant vector-compression vector-database
Last synced: 07 Apr 2026
https://github.com/vectorarc/avp-python
Python SDK for Agent Vector Protocol – transfer KV-cache between LLM agents instead of text
ai-agents inference kv-cache llm machine-learning multi-agent protocol python transformers vllm
Last synced: 26 Apr 2026
https://github.com/jagmarques/nexusquant
Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.
attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization
Last synced: 01 May 2026
https://github.com/codepawl/turboquant-torch
Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.
compression inference kv-cache llm pytorch quantization
Last synced: 05 Apr 2026
https://github.com/mehdihosseinimoghadam/ava-mistral-7b
Fine-Tuned Mistral 7B Persian Large Language Model LLM / Persian Mistral 7B
ava ava-mistral ava-mistral-7b deep-learning kv-cache large-language-models llm mistral mistral-7b nlp persian-mistral persian-mistral-7b
Last synced: 11 Apr 2025
https://github.com/reshalfahsi/image-captioning-mobilenet-llama3
Image Captioning With MobileNet-LLaMA 3
cnn flickr8k-dataset grouped-query-attention image-captioning image-text kv-cache llama3 mobilenetv3 nlp pytorch pytorch-lightning rms-norm rotary-position-embedding transformer
Last synced: 12 Apr 2025
https://github.com/back2matching/turboquant
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
compression gpu huggingface inference kv-cache llm machine-learning pytorch quantization transformers turboquant vram
Last synced: 30 Apr 2026
https://github.com/manishklach/kv-cpu-driver
Reference Linux control plane, RTL, and FPGA emulation scaffold for KV-CPU semantic KV-cache orchestration. Patent pending in India (App No. 202641056309).
device-driver fpga kv-cache linux-kernel llm-inference memory-tiering pcie rtl systemverilog vllm
Last synced: 11 May 2026
https://github.com/jaameypr/keyvalue-caching
Java-based caching solution designed to temporarily store key-value pairs with a specified time-to-live (TTL) duration.
caching caching-strategies java java-17 java-cache java-caching java-caching-strategy java-keyvalue-cache java-kv-cache keyvalue keyvalue-cache keyvalue-store keyvaluestore kv-cache maven
Last synced: 15 Feb 2026
https://github.com/cklxx/arle
Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.
agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust
Last synced: 02 May 2026
https://github.com/manishklach/intent-attention-kernel
Intent-aware attention research prototype that treats long-context inference as structured semantic blocks instead of a flat token stream, proving CPU-first correctness and analytical KV/FLOP savings before GPU kernel implementation.
agentic-ai ai-infrastructure attention block-attention cost-model cuda gpu-kernels inference kernel-research kv-cache llm-inference long-context python pytorch research semantic-attention sparse-attention systems transformers triton
Last synced: 28 May 2026
https://github.com/alepot55/flash-reasoning
Tree-Aware Attention for System 2 Reasoning. Reduces KV-Cache VRAM by 96% and exceeds physical HBM bandwidth (1.33x) via Fused GQA Triton kernels.
attention-mechanism fused-gqa kv-cache reasoning triton
Last synced: 28 Jan 2026
https://github.com/andrewhsugithub/min-llama
my llama3 implementation
grouped-query-attention kv-cache llama3 llm nlp rope swiglu transformers
Last synced: 18 May 2026
https://github.com/eldriss-studio/tardigrade-db
Experimental LLM-native memory-kernel prototype for persistent agent memory in quantized KV-cache latent space.
agent-memory ai-infrastructure huggingface kv-cache llm llm-memory maturin pyo3 python quantization rag rust transformers vector-database vllm
Last synced: 26 May 2026
https://github.com/sshoecraft/shepherd
An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.
anthropic cli cpp cuda gemini grok inference kv-cache llama-cpp llm mcp ollama openai openai-server rag smart-evictions tensorrt tool-calling ulimited-context
Last synced: 10 Apr 2026
https://github.com/lamaparbat/express_redis_caching_rate_limit
EXPRESS REST API CACHING + RATE LIMITING + KV-STORE
express ioredis kv-cache rate-limiting redis redis-stack restapi
Last synced: 04 May 2026
https://github.com/mcp-tool-shop/context-window-manager
MCP server for lossless LLM context restoration via KV cache persistence
ai claude context-management kv-cache llm lmcache machine-learning mcp python vllm
Last synced: 30 Jan 2026
https://github.com/mcp-tool-shop-org/context-window-manager
MCP server for lossless LLM context restoration via KV cache persistence
ai claude context-management context-window kv-cache llm llm-inference lmcache machine-learning mcp mcp-server memory model-context-protocol python rag session-management token-management vllm
Last synced: 27 Feb 2026