{"id":50708333,"url":"https://github.com/manishklach/ghostkv-lab","last_synced_at":"2026-06-09T13:30:40.441Z","repository":{"id":358501429,"uuid":"1241647029","full_name":"manishklach/ghostkv-lab","owner":"manishklach","description":"Research harness for evaluating query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference workloads. Related provisional filing: IN 202641062451.","archived":false,"fork":false,"pushed_at":"2026-05-17T18:23:03.000Z","size":902,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-17T18:51:31.103Z","etag":null,"topics":["ai-infrastructure","attention-optimization","cxl","flashattention","gpu-memory","kv-cache","llm-inference","long-context","long-context-inference","memory-systems","systems-research","transformer","transformer-memory","transformer-optimization"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manishklach.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-17T16:40:41.000Z","updated_at":"2026-05-17T18:23:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/manishklach/ghostkv-lab","commit_stats":null,"previous_names":["manishklach/ghostkv-lab"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/manishklach/ghostkv-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fghostkv-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fghostkv-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fghostkv-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fghostkv-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manishklach","download_url":"https://codeload.github.com/manishklach/ghostkv-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fghostkv-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34110009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-infrastructure","attention-optimization","cxl","flashattention","gpu-memory","kv-cache","llm-inference","long-context","long-context-inference","memory-systems","systems-research","transformer","transformer-memory","transformer-optimization"],"created_at":"2026-06-09T13:30:39.326Z","updated_at":"2026-06-09T13:30:40.434Z","avatar_url":"https://github.com/manishklach.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GhostKV Lab\n\n[![CI](https://github.com/manishklach/ghostkv-lab/actions/workflows/ci.yml/badge.svg)](https://github.com/manishklach/ghostkv-lab/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)\n![Status: Research Prototype](https://img.shields.io/badge/Status-Research%20Prototype-6c757d)\n\n“A research simulator for query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference.”\n\nGhostKV Lab is a lightweight Python repository for studying whether sketch-based bounded elimination can reduce KV-cache memory movement while preserving attention quality in long-context decode workloads. It is built as a synthetic evaluation harness first: no heavyweight model downloads, no kernel claims, and no fabricated benchmark results.\n\nThe current empirical emphasis is on failure analysis as much as success cases: the most important result in the repository today is that the current GPT-2 frontier sweep did **not** find safe-ish operating points with `false_elimination_rate \u003c= 5%` and `elimination_rate \u003e= 30%`.\n\n## Patent Notice\n\nThis repository is associated with Indian provisional patent application `202641062451`, titled:\n\n“GHOSTKV: A SYSTEM AND METHOD FOR QUERY-TIME BOUNDED ELIMINATION OF RECONSTRUCTABLE KEY-VALUE WITNESSES IN TRANSFORMER ATTENTION MECHANISMS”\n\nFiled on `2026-05-17`.\n\nThe repository is intended as a research and evaluation harness for exploring the underlying systems concepts. A concise note is available in [docs/patent_notice.md](docs/patent_notice.md).\n\n## Current Status\n\nCurrent status:\n\n- Synthetic GhostKV simulator: working\n- GPT-2 real attention validation: working\n- False-elimination frontier analysis: working\n- Hierarchical elimination experiments: working\n- Synthetic result generation pipeline: working\n- Modern Llama/Mistral validation: pending\n- GPU kernel integration: pending\n- Production inference integration: not implemented\n\n## Current Research Focus\n\nThe current focus is false-elimination frontier analysis on real transformer attention tensors.\n\nThe key question:\n\nCan GhostKV eliminate meaningful amounts of cold KV state while keeping false elimination acceptably low?\n\nCurrent experiments focus on:\n\n- attention sketch preservation\n- bounded elimination behavior\n- layer/head sensitivity\n- hierarchical elimination\n- synthetic memory-traffic modeling\n\nLatency reduction and production inference integration remain future work.\n\n## Current Headline Result\n\nUnder the current GPT-2 real-attention frontier sweep, GhostKV Lab did **not** find safe-ish operating points meeting:\n\n- `false_elimination_rate \u003c= 5%`\n- `elimination_rate \u003e= 30%`\n\nThis is currently the most important result in the repository because it shows that coarse ranking preservation alone is not enough. High rank correlation can coexist with weak extreme-rank preservation and unacceptable elimination tradeoffs.\n\nSee:\n\n- [RESULTS.md](RESULTS.md)\n- [results/frontier/FRONTIER.md](results/frontier/FRONTIER.md)\n\nRun:\n\n```bash\nmake frontier\n```\n\nResults are written to:\n\n`results/frontier/`\n\n## What GhostKV Is\n\nGhostKV is a systems-oriented hypothesis for KV-cache handling during decode:\n\n- Cold KV-cache entries are converted into compact ghost records.\n- Each ghost record stores an attention sketch vector, a semantic anchor identifier, and a residual uncertainty term.\n- At query time, the simulator computes a conservative attention upper bound for each ghost record:\n\n`AttnUB(Q, G_i) = sketch_sim(Q, G_i.sketch) + epsilon_res_i + sigma_anchor_i`\n\n- Ghost tokens with an upper bound below `theta_elim` are eliminated.\n- Surviving ghost records are resurrected and included in exact attention.\n\nThe key property in this repository is exactness over survivors: approximation is confined to the elimination stage. Once candidates survive elimination, the simulator treats attention over `hot + resurrected` tokens as exact.\n\n## What GhostKV Is Not\n\n- Not a production LLM runtime\n- Not a CUDA kernel implementation\n- Not a proof of speedup\n- Not a substitute for real-model validation\n\nThis repository uses synthetic tensors first and now includes GPT-2 attention-tensor validation. Broader modern-model validation remains future work.\n\n## Architecture\n\n```text\nKV Cache\n  |\n  +--\u003e Hot / Warm / Ghost / Archive\n                    |\nQuery --\u003e Sketch --\u003e Bound --\u003e Eliminate or Resurrect --\u003e Exact Attend\n```\n\nThe working intuition is simple: eliminate before moving, but only if the elimination bound remains conservative enough to avoid unacceptable false elimination.\n\n## Repository Layout\n\n```text\nghostkv-lab/\n  docs/\n  src/ghostkv/\n  experiments/\n  tests/\n  results/\n  data/\n```\n\n## Quickstart\n\n### PowerShell\n\nThese commands work from the repo root in Windows PowerShell:\n\n```bash\npython -m venv .venv\n.venv\\Scripts\\activate\npython -m pip install -e \".[dev]\"\npython -m pytest\npython experiments/sketch_quality_audit.py\npython experiments/elimination_tradeoff.py\npython experiments/bandwidth_model_demo.py\npython experiments/synthetic_decode_simulation.py\npython experiments/generate_results.py\npython experiments/real_attention_validation.py\npython experiments/hierarchical_elimination.py\npython experiments/false_elimination_frontier.py\n```\n\n### WSL / Linux / macOS\n\nWSL is recommended for reproducible experiment workflows, especially for the heavier plotting and HuggingFace-based validation scripts.\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npython -m pip install -e \".[dev]\"\npytest\nmake results\nmake frontier\npython experiments/real_attention_validation.py\n```\n\nFrom Windows, the same workflow can be invoked explicitly through WSL:\n\n```bash\nwsl -e bash -c \"pytest\"\nwsl -e bash -c \"make results\"\nwsl -e bash -c \"make frontier\"\nwsl -e bash -c \"python experiments/real_attention_validation.py\"\n```\n\nIf you prefer not to create a virtual environment, the same install and run commands work with the active Python environment as long as it is Python 3.10+.\n\n## Core Idea\n\nGhost records are compact witnesses for cold KV entries:\n\n1. `attention sketch vector`\n2. `semantic anchor id`\n3. `residual uncertainty value`\n\nAt each decode step:\n\n1. Project the query into sketch space.\n2. Compute conservative upper bounds for ghost records.\n3. Eliminate records with bounds below `theta_elim`.\n4. Resurrect survivors.\n5. Run exact attention over hot tokens plus resurrected tokens.\n\n## Why This Repo Exists\n\nLong-context inference can become bottlenecked by KV-cache movement rather than only by arithmetic throughput. This repository exists to evaluate whether bounded elimination can reduce the amount of KV state that must be moved or re-read on each decode step without aggressively approximating the final attention calculation.\n\n## Experiments\n\n- `experiments/sketch_quality_audit.py`: compares exact scores and sketch-space scores across sketch dimensions\n- `experiments/elimination_tradeoff.py`: sweeps elimination thresholds and sketch dimensions\n- `experiments/bandwidth_model_demo.py`: compares illustrative memory footprints for full KV, quantized KV, and GhostKV\n- `experiments/synthetic_decode_simulation.py`: runs a multi-step decode simulation and summarizes aggregate metrics\n- `experiments/generate_results.py`: regenerates synthetic CSV outputs, PNG plots, and `RESULTS.md`\n- `experiments/real_attention_validation.py`: captures GPT-2 Q/K tensors and evaluates ranking preservation on real attention states\n- `experiments/hierarchical_elimination.py`: compares flat and hierarchical elimination on real attention tensors\n- `experiments/false_elimination_frontier.py`: sweeps `theta_elim` on real attention tensors to map elimination versus false-elimination frontiers by layer and head\n\nSynthetic and real-attention experiments are both intended to inform feasibility, not to claim production benefit.\n\n## Known Findings So Far\n\n- Random projections preserve global similarity structure more effectively than exact top-attention ranking.\n- Real transformer tensors behave differently from synthetic Gaussian tensors.\n- False elimination remains the primary technical challenge.\n- Some attention heads and layers appear substantially more sketch-preserving than others.\n- Hierarchical elimination may improve elimination behavior in principle, but the current naive clustering baseline does not yet outperform flat elimination consistently.\n- The current GPT-2 frontier sweep did not find safe-ish operating points with false elimination below 5% and elimination above 30%.\n\n## Generate Results\n\n```bash\nmake demo\n```\n\nThis runs the test suite and then generates synthetic CSV outputs, PNG plots, and a refreshed [RESULTS.md](RESULTS.md) summary. If you only want to regenerate artifacts, use `make results`.\n\nAdditional targets:\n\n- `make real-validation`\n- `make hierarchical`\n- `make frontier`\n- `make all-results`\n\nIf `make` is not available in your shell, the equivalent commands are:\n\n```bash\npython -m pytest\npython experiments/generate_results.py\n```\n\nFor reproducible experiment workflows on Windows, using WSL is recommended:\n\n```bash\nwsl -e bash -c \"pytest\"\nwsl -e bash -c \"make results\"\nwsl -e bash -c \"make frontier\"\n```\n\n## Current State Of The Project\n\nWhat currently works:\n\n- synthetic sketch-quality sweeps\n- elimination-threshold experiments\n- GPT-2 attention tensor capture on CPU\n- per-layer and per-head real attention metrics\n- flat versus hierarchical elimination comparisons\n- decode-step simulation with exact attention on surviving candidates\n- illustrative bandwidth and resurrection modeling\n- CSV, plot, and markdown result generation\n\nWhat is currently simulated:\n\n- anchor and residual uncertainty terms\n- resurrection cost estimates\n- memory-traffic comparisons\n\nWhat remains hypothetical or unvalidated:\n\n- quality retention on benchmark tasks\n- runtime overlap between resurrection and decode compute\n- end-to-end latency benefit in a production inference stack\n- generalization from GPT-2 to larger modern models such as Llama, Mistral, and GQA-based decoders\n\nWhat is future work:\n\n- broader real-model Q/K capture\n- LongBench and retrieval-style validation\n- FlashAttention-compatible survivor paths\n- GPU and memory-tier experiments\n\n## Roadmap\n\n### Phase 1 — Synthetic Validation\n\n- synthetic sketch quality\n- elimination sweeps\n- bandwidth modeling\n\n### Phase 2 — Real Attention Validation\n\n- GPT-2 Q/K capture\n- layer/head frontier analysis\n- false elimination measurement\n\n### Phase 3 — Modern Model Validation\n\n- TinyLlama\n- Mistral\n- Llama-3 style architectures\n- grouped-query attention behavior\n\n### Phase 4 — Runtime Integration\n\n- FlashAttention-compatible survivor path\n- decode-side resurrection overlap\n- GPU kernel hooks\n- memory movement instrumentation\n\n### Phase 5 — Memory-System Exploration\n\n- hierarchical ghost indexes\n- learned sketch functions\n- CXL / near-memory filtering\n- memory-side elimination experiments\n\nAdditional detail is in [docs/roadmap.md](docs/roadmap.md).\n\n## Development Notes\n\n- Python 3.10+\n- Main dependencies: `numpy`, `matplotlib`, `torch`, `transformers`\n- Test runner: `pytest`\n- Editable install supported via `pip install -e \".[dev]\"`\n\n## License Clarification\n\nThe source code in this repository is available under the MIT License. That copyright license applies to the code itself; it does not by itself waive any separate patent rights that may be associated with related patent filings.\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n## Limitations\n\n- GPT-2 is not representative of all modern LLMs.\n- The repository does not include a production decode kernel.\n- No real memory movement reduction is measured yet.\n- The resurrection pipeline is still simulated.\n- There is no FlashAttention integration.\n- There is no end-to-end throughput benchmark.\n- There is no proof of quality preservation on downstream tasks.\n\nThis repository currently explores feasibility and methodology, not production deployment.\n\n## Disclaimer\n\nGhostKV Lab is an experimental research repository exploring systems concepts related to KV-cache memory movement and bounded elimination in transformer inference workloads.\n\nCurrent experiments are synthetic or small-model analytical studies intended for methodology exploration. The repository does not currently implement a production transformer runtime.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fghostkv-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishklach%2Fghostkv-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fghostkv-lab/lists"}