Projects in Awesome Lists tagged with kv-cache

https://github.com/lmcache/lmcache

Supercharge Your LLM with the Fastest KV Cache Layer

amd cuda fast inference kv-cache llm pytorch rocm speed vllm

Last synced: 13 Jun 2026

https://github.com/hdt3213/godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

cluster go godis golang kv-cache redis redis-cluster redis-server

Last synced: 13 May 2025

https://github.com/HDT3213/godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

cluster go godis golang kv-cache redis redis-cluster redis-server

Last synced: 27 Mar 2025

https://github.com/harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm

Last synced: 23 Aug 2025

https://github.com/nvidia/kvpress

LLM KV cache compression made easy

inference kv-cache kv-cache-compression large-language-models llm long-context python pytorch transformers

Last synced: 09 Apr 2026

https://github.com/fminference/h2o

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity

Last synced: 05 Apr 2025

https://github.com/FMInference/H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

gpt-3 heavy-hitters high-throughput kv-cache large-language-models sparsity

Last synced: 09 May 2025

https://github.com/quantumaikr/quant.cpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

delta-compression embeddable gguf kv-cache llm llm-inference pure-c quantization transformer turboquant

Last synced: 08 Apr 2026

https://github.com/arozanov/turboquant-mlx

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

apple-silicon kv-cache llm metal mlx quantization turboquant

Last synced: 05 May 2026

https://github.com/kddubey/cappr

Completion After Prompt Probability. Make your LLM make a choice

huggingface kv-cache llamacpp llm-inference probability prompt-engineering text-classification zero-shot

Last synced: 05 Apr 2025

https://github.com/hkproj/pytorch-llama-notes

Notes about LLaMA 2 model

attention-is-all-you-need kv-cache llama2 rmsprop rotary-position-encoding study-notes

Last synced: 06 May 2025

https://github.com/DRSY/EasyKV

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

cache-eviction cache-management kv-cache llm

Last synced: 16 May 2025

https://github.com/vectifyai/condb

ConDB: The KV-Cache Native Context Database

agents ai context-database kv-cache llm long-context rag reasoning retrieval tree-search

Last synced: 04 Jun 2026

https://github.com/mindtro/semafold

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

embedding-compression kv-cache llm-inference qjl quantization retrieval semafold turboquant vector-compression vector-database

Last synced: 07 Apr 2026

https://github.com/vectorarc/avp-python

Python SDK for Agent Vector Protocol – transfer KV-cache between LLM agents instead of text

ai-agents inference kv-cache llm machine-learning multi-agent protocol python transformers vllm

Last synced: 26 Apr 2026

https://github.com/jagmarques/nexusquant

Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.

attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization

Last synced: 01 May 2026

https://github.com/codepawl/turboquant-torch

Unofficial PyTorch implementation of TurboQuant (Google Research, ICLR 2026). Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.

compression inference kv-cache llm pytorch quantization

Last synced: 05 Apr 2026

https://github.com/mehdihosseinimoghadam/ava-mistral-7b

Fine-Tuned Mistral 7B Persian Large Language Model LLM / Persian Mistral 7B

ava ava-mistral ava-mistral-7b deep-learning kv-cache large-language-models llm mistral mistral-7b nlp persian-mistral persian-mistral-7b

Last synced: 11 Apr 2025

https://github.com/reshalfahsi/image-captioning-mobilenet-llama3

Image Captioning With MobileNet-LLaMA 3

cnn flickr8k-dataset grouped-query-attention image-captioning image-text kv-cache llama3 mobilenetv3 nlp pytorch pytorch-lightning rms-norm rotary-position-embedding transformer

Last synced: 12 Apr 2025

https://github.com/back2matching/turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

compression gpu huggingface inference kv-cache llm machine-learning pytorch quantization transformers turboquant vram

Last synced: 30 Apr 2026

https://github.com/geronimo-iia/qjl-sketch

QJL sign-based vector compression and scoring in Rust — near-optimal distortion rate, append-only persistence, CPU-only, no LLM

attention error-handling input-validation johnson-lindenstrauss kv-cache mmap mmap-persistence quantization rust search sign-hashing vector-compression

Last synced: 07 Jun 2026

https://github.com/manishklach/kv-cpu-driver

Reference Linux control plane, RTL, and FPGA emulation scaffold for KV-CPU semantic KV-cache orchestration. Patent pending in India (App No. 202641056309).

device-driver fpga kv-cache linux-kernel llm-inference memory-tiering pcie rtl systemverilog vllm

Last synced: 11 May 2026

https://github.com/jaameypr/keyvalue-caching

Java-based caching solution designed to temporarily store key-value pairs with a specified time-to-live (TTL) duration.

caching caching-strategies java java-17 java-cache java-caching java-caching-strategy java-keyvalue-cache java-kv-cache keyvalue keyvalue-cache keyvalue-store keyvaluestore kv-cache maven

Last synced: 15 Feb 2026

https://github.com/cklxx/arle

Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust

Last synced: 02 May 2026

https://github.com/manishklach/ghostkv-lab

Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.

ai-infrastructure attention-optimization cxl flashattention gpu-memory kv-cache llm-inference long-context long-context-inference memory-systems systems-research transformer transformer-memory transformer-optimization

Last synced: 09 Jun 2026

https://github.com/manishklach/semantic-kv-control-plane

A systems research platform for semantic KV-cache orchestration, topology-aware memory placement, distributed prefix reuse, and rack-scale inference memory simulation.

ai-infrastructure ai-systems cxl distributed-cache distributed-systems gpu hbm inference kv-cache llm-inference memory-orchestration memory-systems memory-tiering prefetching rack-scale runtime-systems semantic-caching simulation systems-research topology-aware

Last synced: 09 Jun 2026

https://github.com/manishklach/intent-attention-kernel

Intent-aware attention research prototype that treats long-context inference as structured semantic blocks instead of a flat token stream, proving CPU-first correctness and analytical KV/FLOP savings before GPU kernel implementation.

agentic-ai ai-infrastructure attention block-attention cost-model cuda gpu-kernels inference kernel-research kv-cache llm-inference long-context python pytorch research semantic-attention sparse-attention systems transformers triton

Last synced: 28 May 2026

https://github.com/hinanohart/hybridserve-state

hybrid-models kv-cache linear-attention llm-inference mamba serialization ssm

Last synced: 15 Jun 2026

https://github.com/sshoecraft/shepherd

An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.

anthropic cli cpp cuda gemini grok inference kv-cache llama-cpp llm mcp ollama openai openai-server rag smart-evictions tensorrt tool-calling ulimited-context

Last synced: 10 Apr 2026

https://github.com/ionden/mlx-quant-fidelity

Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights

apple-silicon diagnostics kl-divergence kv-cache kv-cache-quantization llm llm-eval llm-inference machine-learning metal mlx mlx-lm model-quantization perplexity python quantization quantization-quality

Last synced: 14 Jun 2026

https://github.com/andrewhsugithub/min-llama

my llama3 implementation

grouped-query-attention kv-cache llama3 llm nlp rope swiglu transformers

Last synced: 18 May 2026

https://github.com/prajeshshrestha/llama-2.0-architecture-and-inference-from-scratch-with-pytorch

grouped-query-attention kv-cache llama2 pytorch pytorch-implementation rotary-positional-embedding

Last synced: 16 Mar 2026

https://github.com/alepot55/flash-reasoning

Tree-Aware Attention for System 2 Reasoning. Reduces KV-Cache VRAM by 96% and exceeds physical HBM bandwidth (1.33x) via Fused GQA Triton kernels.

attention-mechanism fused-gqa kv-cache reasoning triton

Last synced: 28 Jan 2026

https://github.com/mcp-tool-shop/context-window-manager

MCP server for lossless LLM context restoration via KV cache persistence

ai claude context-management kv-cache llm lmcache machine-learning mcp python vllm

Last synced: 30 Jan 2026

https://github.com/mcp-tool-shop-org/context-window-manager

MCP server for lossless LLM context restoration via KV cache persistence

ai claude context-management context-window kv-cache llm llm-inference lmcache machine-learning mcp mcp-server memory model-context-protocol python rag session-management token-management vllm

Last synced: 27 Feb 2026

https://github.com/ifuryst/nanollmserve

🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.

ai-infra continuous-batching fastapi inference kv-cache llm llm-engine llm-inference llm-infra llm-serving lora openai-compatible paged-attention pytorch quantization sglang speculative-decoding teaching transformers vllm

Last synced: 05 Jun 2026

https://github.com/back2matching/kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers. Supports Ollama + llama.cpp.

benchmark gpu inference kv-cache llama-cpp llm local-llm ollama quantization vram

Last synced: 08 Jun 2026

https://github.com/eldriss-studio/tardigrade-db

Experimental LLM-native memory-kernel prototype for persistent agent memory in quantized KV-cache latent space.

agent-memory ai-infrastructure huggingface kv-cache llm llm-memory maturin pyo3 python quantization rag rust transformers vector-database vllm

Last synced: 26 May 2026

https://github.com/lamaparbat/express_redis_caching_rate_limit

EXPRESS REST API CACHING + RATE LIMITING + KV-STORE

express ioredis kv-cache rate-limiting redis redis-stack restapi

Last synced: 04 May 2026

https://github.com/barrel-platform/barrel_inference

OTP-native LLM inference runtime with token-exact tiered KV cache, plus an OpenAI/Anthropic/Ollama-compatible HTTP daemon. Erlang/OTP + llama.cpp.

erlang inference kv-cache llama-cpp llm otp

Last synced: 09 Jun 2026

https://github.com/elinx/llm-mem-calculator

Interactive KV cache memory calculator for LLMs — supports MLA, GQA, hybrid attention, sliding window, and linear attention architectures. Estimate GPU memory for serving any model at any context length.

calculator gpu-memory kv-cache llm llm-serving vllm

Last synced: 09 Jun 2026

https://github.com/synapt-dev/vorn-mat

Vorn: Residual Direction, Familial Eviction, and the Granularity Rescue Spectrum — paper + reproducible code, data, and figures (Zenodo DOI: 10.5281/zenodo.20519215)

attention eviction kv-cache language-models long-context machine-learning nlp reproducible-research

Last synced: 10 Jun 2026