https://github.com/manishklach/ghostkv-lab

Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.
https://github.com/manishklach/ghostkv-lab

ai-infrastructure attention-optimization cxl flashattention gpu-memory kv-cache llm-inference long-context long-context-inference memory-systems systems-research transformer transformer-memory transformer-optimization

Last synced: about 1 month ago
JSON representation

Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.

Host: GitHub
URL: https://github.com/manishklach/ghostkv-lab
Owner: manishklach
License: mit
Created: 2026-05-17T16:40:41.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-17T18:23:03.000Z (about 2 months ago)
Last Synced: 2026-05-17T18:51:31.103Z (about 2 months ago)
Topics: ai-infrastructure, attention-optimization, cxl, flashattention, gpu-memory, kv-cache, llm-inference, long-context, long-context-inference, memory-systems, systems-research, transformer, transformer-memory, transformer-optimization
Language: Python
Size: 881 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
- Roadmap: docs/roadmap.md

Awesome Lists containing this project

README

# GhostKV Lab

[![CI](https://github.com/manishklach/ghostkv-lab/actions/workflows/ci.yml/badge.svg)](https://github.com/manishklach/ghostkv-lab/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)
![Status: Research Prototype](https://img.shields.io/badge/Status-Research%20Prototype-6c757d)

“A research simulator for query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference.”

GhostKV Lab is a lightweight Python repository for studying whether sketch-based bounded elimination can reduce KV-cache memory movement while preserving attention quality in long-context decode workloads. It is built as a synthetic evaluation harness first: no heavyweight model downloads, no kernel claims, and no fabricated benchmark results.

The current empirical emphasis is on failure analysis as much as success cases: the most important result in the repository today is that the current GPT-2 frontier sweep did **not** find safe-ish operating points with `false_elimination_rate <= 5%` and `elimination_rate >= 30%`.

## Patent Notice

This repository is associated with Indian provisional patent application `202641062451`, titled:

“GHOSTKV: A SYSTEM AND METHOD FOR QUERY-TIME BOUNDED ELIMINATION OF RECONSTRUCTABLE KEY-VALUE WITNESSES IN TRANSFORMER ATTENTION MECHANISMS”

Filed on `2026-05-17`.

The repository is intended as a research and evaluation harness for exploring the underlying systems concepts. A concise note is available in [docs/patent_notice.md](docs/patent_notice.md).

## Current Status

Current status:

- Synthetic GhostKV simulator: working
- GPT-2 real attention validation: working
- False-elimination frontier analysis: working
- Hierarchical elimination experiments: working
- Synthetic result generation pipeline: working
- Modern Llama/Mistral validation: pending
- GPU kernel integration: pending
- Production inference integration: not implemented

## Current Research Focus

The current focus is false-elimination frontier analysis on real transformer attention tensors.

The key question:

Can GhostKV eliminate meaningful amounts of cold KV state while keeping false elimination acceptably low?

Current experiments focus on:

- attention sketch preservation
- bounded elimination behavior
- layer/head sensitivity
- hierarchical elimination
- synthetic memory-traffic modeling

Latency reduction and production inference integration remain future work.

## Current Headline Result

Under the current GPT-2 real-attention frontier sweep, GhostKV Lab did **not** find safe-ish operating points meeting:

- `false_elimination_rate <= 5%`
- `elimination_rate >= 30%`

This is currently the most important result in the repository because it shows that coarse ranking preservation alone is not enough. High rank correlation can coexist with weak extreme-rank preservation and unacceptable elimination tradeoffs.

See:

- [RESULTS.md](RESULTS.md)
- [results/frontier/FRONTIER.md](results/frontier/FRONTIER.md)

Run:

```bash
make frontier
```

Results are written to:

`results/frontier/`

## What GhostKV Is

GhostKV is a systems-oriented hypothesis for KV-cache handling during decode:

- Cold KV-cache entries are converted into compact ghost records.
- Each ghost record stores an attention sketch vector, a semantic anchor identifier, and a residual uncertainty term.
- At query time, the simulator computes a conservative attention upper bound for each ghost record:

`AttnUB(Q, G_i) = sketch_sim(Q, G_i.sketch) + epsilon_res_i + sigma_anchor_i`

- Ghost tokens with an upper bound below `theta_elim` are eliminated.
- Surviving ghost records are resurrected and included in exact attention.

The key property in this repository is exactness over survivors: approximation is confined to the elimination stage. Once candidates survive elimination, the simulator treats attention over `hot + resurrected` tokens as exact.

## What GhostKV Is Not

- Not a production LLM runtime
- Not a CUDA kernel implementation
- Not a proof of speedup
- Not a substitute for real-model validation

This repository uses synthetic tensors first and now includes GPT-2 attention-tensor validation. Broader modern-model validation remains future work.

## Architecture

```text
KV Cache
|
+--> Hot / Warm / Ghost / Archive
|
Query --> Sketch --> Bound --> Eliminate or Resurrect --> Exact Attend
```

The working intuition is simple: eliminate before moving, but only if the elimination bound remains conservative enough to avoid unacceptable false elimination.

## Repository Layout

```text
ghostkv-lab/
docs/
src/ghostkv/
experiments/
tests/
results/
data/
```

## Quickstart

### PowerShell

These commands work from the repo root in Windows PowerShell:

```bash
python -m venv .venv
.venv\Scripts\activate
python -m pip install -e ".[dev]"
python -m pytest
python experiments/sketch_quality_audit.py
python experiments/elimination_tradeoff.py
python experiments/bandwidth_model_demo.py
python experiments/synthetic_decode_simulation.py
python experiments/generate_results.py
python experiments/real_attention_validation.py
python experiments/hierarchical_elimination.py
python experiments/false_elimination_frontier.py
```

### WSL / Linux / macOS

WSL is recommended for reproducible experiment workflows, especially for the heavier plotting and HuggingFace-based validation scripts.

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
pytest
make results
make frontier
python experiments/real_attention_validation.py
```

From Windows, the same workflow can be invoked explicitly through WSL:

```bash
wsl -e bash -c "pytest"
wsl -e bash -c "make results"
wsl -e bash -c "make frontier"
wsl -e bash -c "python experiments/real_attention_validation.py"
```

If you prefer not to create a virtual environment, the same install and run commands work with the active Python environment as long as it is Python 3.10+.

## Core Idea

Ghost records are compact witnesses for cold KV entries:

1. `attention sketch vector`
2. `semantic anchor id`
3. `residual uncertainty value`

At each decode step:

1. Project the query into sketch space.
2. Compute conservative upper bounds for ghost records.
3. Eliminate records with bounds below `theta_elim`.
4. Resurrect survivors.
5. Run exact attention over hot tokens plus resurrected tokens.

## Why This Repo Exists

Long-context inference can become bottlenecked by KV-cache movement rather than only by arithmetic throughput. This repository exists to evaluate whether bounded elimination can reduce the amount of KV state that must be moved or re-read on each decode step without aggressively approximating the final attention calculation.

## Experiments

- `experiments/sketch_quality_audit.py`: compares exact scores and sketch-space scores across sketch dimensions
- `experiments/elimination_tradeoff.py`: sweeps elimination thresholds and sketch dimensions
- `experiments/bandwidth_model_demo.py`: compares illustrative memory footprints for full KV, quantized KV, and GhostKV
- `experiments/synthetic_decode_simulation.py`: runs a multi-step decode simulation and summarizes aggregate metrics
- `experiments/generate_results.py`: regenerates synthetic CSV outputs, PNG plots, and `RESULTS.md`
- `experiments/real_attention_validation.py`: captures GPT-2 Q/K tensors and evaluates ranking preservation on real attention states
- `experiments/hierarchical_elimination.py`: compares flat and hierarchical elimination on real attention tensors
- `experiments/false_elimination_frontier.py`: sweeps `theta_elim` on real attention tensors to map elimination versus false-elimination frontiers by layer and head

Synthetic and real-attention experiments are both intended to inform feasibility, not to claim production benefit.

## Known Findings So Far

- Random projections preserve global similarity structure more effectively than exact top-attention ranking.
- Real transformer tensors behave differently from synthetic Gaussian tensors.
- False elimination remains the primary technical challenge.
- Some attention heads and layers appear substantially more sketch-preserving than others.
- Hierarchical elimination may improve elimination behavior in principle, but the current naive clustering baseline does not yet outperform flat elimination consistently.
- The current GPT-2 frontier sweep did not find safe-ish operating points with false elimination below 5% and elimination above 30%.

## Generate Results

```bash
make demo
```

This runs the test suite and then generates synthetic CSV outputs, PNG plots, and a refreshed [RESULTS.md](RESULTS.md) summary. If you only want to regenerate artifacts, use `make results`.

Additional targets:

- `make real-validation`
- `make hierarchical`
- `make frontier`
- `make all-results`

If `make` is not available in your shell, the equivalent commands are:

```bash
python -m pytest
python experiments/generate_results.py
```

For reproducible experiment workflows on Windows, using WSL is recommended:

```bash
wsl -e bash -c "pytest"
wsl -e bash -c "make results"
wsl -e bash -c "make frontier"
```

## Current State Of The Project

What currently works:

- synthetic sketch-quality sweeps
- elimination-threshold experiments
- GPT-2 attention tensor capture on CPU
- per-layer and per-head real attention metrics
- flat versus hierarchical elimination comparisons
- decode-step simulation with exact attention on surviving candidates
- illustrative bandwidth and resurrection modeling
- CSV, plot, and markdown result generation

What is currently simulated:

- anchor and residual uncertainty terms
- resurrection cost estimates
- memory-traffic comparisons

What remains hypothetical or unvalidated:

- quality retention on benchmark tasks
- runtime overlap between resurrection and decode compute
- end-to-end latency benefit in a production inference stack
- generalization from GPT-2 to larger modern models such as Llama, Mistral, and GQA-based decoders

What is future work:

- broader real-model Q/K capture
- LongBench and retrieval-style validation
- FlashAttention-compatible survivor paths
- GPU and memory-tier experiments

## Roadmap

### Phase 1 — Synthetic Validation

- synthetic sketch quality
- elimination sweeps
- bandwidth modeling

### Phase 2 — Real Attention Validation

- GPT-2 Q/K capture
- layer/head frontier analysis
- false elimination measurement

### Phase 3 — Modern Model Validation

- TinyLlama
- Mistral
- Llama-3 style architectures
- grouped-query attention behavior

### Phase 4 — Runtime Integration

- FlashAttention-compatible survivor path
- decode-side resurrection overlap
- GPU kernel hooks
- memory movement instrumentation

### Phase 5 — Memory-System Exploration

- hierarchical ghost indexes
- learned sketch functions
- CXL / near-memory filtering
- memory-side elimination experiments

Additional detail is in [docs/roadmap.md](docs/roadmap.md).

## Development Notes

- Python 3.10+
- Main dependencies: `numpy`, `matplotlib`, `torch`, `transformers`
- Test runner: `pytest`
- Editable install supported via `pip install -e ".[dev]"`

## License Clarification

The source code in this repository is available under the MIT License. That copyright license applies to the code itself; it does not by itself waive any separate patent rights that may be associated with related patent filings.

## License

MIT. See [LICENSE](LICENSE).

## Limitations

- GPT-2 is not representative of all modern LLMs.
- The repository does not include a production decode kernel.
- No real memory movement reduction is measured yet.
- The resurrection pipeline is still simulated.
- There is no FlashAttention integration.
- There is no end-to-end throughput benchmark.
- There is no proof of quality preservation on downstream tasks.

This repository currently explores feasibility and methodology, not production deployment.

## Disclaimer

GhostKV Lab is an experimental research repository exploring systems concepts related to KV-cache memory movement and bounded elimination in transformer inference workloads.

Current experiments are synthetic or small-model analytical studies intended for methodology exploration. The repository does not currently implement a production transformer runtime.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/manishklach/ghostkv-lab

Awesome Lists containing this project

README