{"id":46371131,"url":"https://github.com/efficientcontext/contextpilot","last_synced_at":"2026-03-05T04:01:08.471Z","repository":{"id":340589019,"uuid":"1131257882","full_name":"EfficientContext/ContextPilot","owner":"EfficientContext","description":"Accelerating Long Context LLM Inference with Accuracy-Preserving Context Optimization in SGLang, vLLM, llama.cpp, RAG, and Agentic AI.","archived":false,"fork":false,"pushed_at":"2026-03-04T21:01:34.000Z","size":17923,"stargazers_count":56,"open_issues_count":4,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-04T21:24:23.397Z","etag":null,"topics":["ai-agents","context-api","context-engineering","inference-optimization","prompt-engineering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EfficientContext.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-09T18:00:28.000Z","updated_at":"2026-03-04T14:22:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"bb09a150-4337-4fa4-89b6-fd715c350443","html_url":"https://github.com/EfficientContext/ContextPilot","commit_stats":null,"previous_names":["efficientcontext/contextpilot"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/EfficientContext/ContextPilot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EfficientContext%2FContextPilot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EfficientContext%2FContextPilot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EfficientContext%2FContextPilot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EfficientContext%2FContextPilot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EfficientContext","download_url":"https://codeload.github.com/EfficientContext/ContextPilot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EfficientContext%2FContextPilot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30109075,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T03:40:26.266Z","status":"ssl_error","status_checked_at":"2026-03-05T03:39:15.902Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","context-api","context-engineering","inference-optimization","prompt-engineering"],"created_at":"2026-03-05T04:00:27.257Z","updated_at":"2026-03-05T04:01:08.433Z","avatar_url":"https://github.com/EfficientContext.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/about.png\" alt=\"ContextPilot Logo\" width=\"600\"/\u003e\n\n  \u003ch2\u003e\u003cstrong\u003eContextPilot: Fast Long-Context Inference via Context Reuse\u003c/strong\u003e\u003c/h2\u003e\n\n  [![Python](https://img.shields.io/badge/python-≥3.10-blue)](https://www.python.org/)\n  [![PyPI](https://img.shields.io/pypi/v/contextpilot)](https://pypi.org/project/contextpilot/)\n  [![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)\n\n  \u003cp\u003e\u003cstrong\u003e4–12× cache hits | 1.5–3× faster prefill | ~36% token savings\u003c/strong\u003e across vLLM, SGLang, RAG, AI Agents, and more.\u003c/p\u003e\n\n\u003c/div\u003e\n\n--------------------------------------------------------------------------------\n\n| [**Documentation**](https://efficientcontext.github.io/contextpilot-docs/) | [**Examples**](examples/) | [**Benchmarks**](https://efficientcontext.github.io/contextpilot-docs/reference/benchmarks) | [**Docker**](https://efficientcontext.github.io/contextpilot-docs/getting_started/docker) | [**Paper**](https://arxiv.org/abs/2511.03475) |\n\n## News\n\n- [2026/03] ContextPilot now can run on **macOS / Apple Silicon** via [llama.cpp](docs/guides/mac_llama_cpp.md).\n- [2026/02] ContextPilot v0.3.2 released, supporting [PageIndex](https://github.com/VectifyAI/PageIndex) and [Mem0](https://github.com/mem0ai/mem0).\n- [2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.\n\n## About\n\nLong-context workloads (RAG, memory chat, tool-augmented agents) prepend many context blocks. Across requests, these blocks often overlap but get reordered or duplicated, changing token prefixes and triggering cache misses and redundant KV recomputation. Common examples include (1) Trending Topic QA, (2) Closed-Domain Long-Context QA, (3) Batched Long-Context Inference, (4) multi-turn conversations with long-term memory and many more.\n\nContextPilot sits between context assembly and inference to maximize prefix reuse and remove duplicates:\n\n1. **Higher throughput \u0026 cache hits** — boosts prefill throughput and prefix cache hit ratio via context reuse.  \n2. **Drop-in solutions** — works with [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), and backends like [vLLM](https://github.com/vllm-project/vllm) / [SGLang](https://github.com/sgl-project/sglang) / [llama.cpp](docs/guides/mac_llama_cpp.md).\n3. **No compromise in reasoning quality** — can even improve with extremely long contexts.\n4. **Widely tested** — validated across diverse RAG and agentic workloads.\n\nIt maintains a **Context Index** of cached content, then per request applies **Reorder** (align shared blocks into a common prefix) and/or **Deduplicate** (replace repeats with reference hints), plus **cache-aware scheduling** to maximize prefix sharing. The optimized prompt is sent via the OpenAI-compatible API; `POST /evict` keeps the index synced when KV cache is reclaimed. See its design overview below.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/system_description.jpg\" alt=\"ContextPilot Architecture\" width=\"600\"/\u003e\n\u003c/div\u003e\n\n\u003e For more design details, see [Paper](https://arxiv.org/abs/2511.03475) and [Documentation](https://efficientcontext.github.io/contextpilot-docs/).\n\n## Performance at a Glance\n\nContextPilot is validated across three representative settings: single-node academic RAG, multi-node production MoE inference, and multi-turn memory-augmented chat. In every case it delivers significant speedups with comparable answer quality.\n\n**Qwen3-32B on 4×A6000** — single-node academic RAG with a 32B model on consumer GPUs.\n\n| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |\n|-----------|--------|--------------------|-----------|--------|\n| MultihopRAG | SGLang | 7,290 | 4.64% | 60.42 |\n|              | **SGLang + ContextPilot** | **14,214** | **33.97%** | **64.39** |\n| NarrativeQA | SGLang | 7,921 | 5.91% | 28.41 |\n|              | **SGLang + ContextPilot** | **12,117** | **20.82%** | **29.64** |\n\n**DeepSeek-R1-671B on 16×H20** — production-scale 671B MoE inference on a multi-node GPU cluster.\n\n| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |\n|-----------|--------|--------------------|-----------|--------|\n| MultihopRAG | SGLang | 9,636 | 5.12% | 64.15 |\n|            | **SGLang + ContextPilot** | **17,498** | **60.37%** | **64.68** |\n| NarrativeQA | SGLang | 8,687 | 6.08% | 40.20 |\n|            | **SGLang + ContextPilot** | **13,201** | **38.24%** | **41.08** |\n\n**Qwen3-4B on 1×A6000** — multi-turn memory chat with [Mem0](https://github.com/mem0ai/mem0) on the [LoCoMo](https://github.com/snap-research/locomo) benchmark.\n\n| Context Size | Method | TTFT (s) | LLM Judge |\n|--------------|--------|----------|-----------|\n| 100 memories | SGLang | 0.1012 | 0.437 |\n|            | **SGLang + ContextPilot** | **0.0554** | 0.420 |\n\n\u003eContextPilot results in mem0 table are without context annotation — an optional feature that adds original importance ranking to reordered context blocks, which can further improve answer quality (see [Paper](https://arxiv.org/abs/2511.03475)).\n\n**Llama-3.2-1B on Apple M3 (MacBook Air, 16 GB)** — MultihopRAG on Apple Silicon with llama.cpp, no GPU server required.\n\n| Method | Avg Latency (ms) |\n|--------|-----------------|\n| llama.cpp | 3,315 |\n| **llama.cpp + ContextPilot** | **1,378** |\n\nSettings: `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, Metal offload (`-ngl 99`), `--cache-reuse 256`, `--parallel 4`, context 32768 tokens. See the [Mac + llama.cpp guide](docs/guides/mac_llama_cpp.md).\n\n## Installation\n\n**Requirements:** Python \u003e= 3.10\n\n---\n\n### vLLM / SGLang\n\nContextPilot works with both CPU and GPU backends for building the context index. The `[gpu]` extra enables GPU-accelerated distance computation (via `cupy-cuda12x`) and is faster for large batches; without it, ContextPilot falls back to the CPU backend automatically.\n\n**From PyPI** — the vLLM and SGLang hooks are installed automatically:\n```bash\npip install contextpilot          # CPU index computation\npip install \"contextpilot[gpu]\"   # GPU index computation (CUDA 12.x)\n```\n\n**From source** — run `install_hook` manually after install, since editable installs do not copy the `.pth` file to site-packages:\n```bash\ngit clone https://github.com/EfficientContext/ContextPilot.git\ncd ContextPilot\npip install -e .                  # CPU\npip install -e \".[gpu]\"           # GPU (CUDA 12.x)\npython -m contextpilot.install_hook   # one-time: enables automatic vLLM / SGLang integration\n```\n\nThe `install_hook` step writes a `.pth` file into your site-packages so the vLLM and SGLang hooks load automatically at Python startup — no code changes required. To uninstall: `python -m contextpilot.install_hook --remove`.\n\n---\n\n### Mac / Apple Silicon — llama.cpp\n\n**From PyPI:**\n```bash\npip install contextpilot\nxcode-select --install    # one-time: provides clang++ to compile the native hook\n```\n\n**From source:**\n```bash\ngit clone https://github.com/EfficientContext/ContextPilot.git\ncd ContextPilot\npip install -e .\nxcode-select --install    # one-time: provides clang++ to compile the native hook\n```\n\n\u003e **Why `xcode-select`?** The llama.cpp integration uses a small C++ shared library injected into `llama-server` via `DYLD_INSERT_LIBRARIES`. It is compiled automatically on first use and requires `clang++` from Xcode Command Line Tools.\n\n---\n\nMore [detailed installation instructions](https://efficientcontext.github.io/contextpilot-docs/getting_started/installation) are available in the docs.\n\nDocker images are also available for both all-in-one and standalone deployment. See the [Docker guide](https://efficientcontext.github.io/contextpilot-docs/getting_started/docker).\n\n## Getting Started\n\n### Quick Start with Context Ordering\n\nAdd **one call** (`cp_instance.optimize()`) before inference to rearrange context blocks so that shared content aligns into a common prefix, enabling cache reuse. An importance ranking in the prompt preserves accuracy.\n\n| Mode | When to Use | How It Works |\n|------|-------------|--------------|\n| **Online** | Multi-turn (e.g., chatbot + [Mem0](https://github.com/mem0ai/mem0)) | Tracks previously cached blocks; moves overlapping ones to the prefix each turn |\n| **Offline** | Batch / single-shot | Globally reorders and schedules all requests for maximum prefix sharing |\n\nBoth modes work with any OpenAI-compatible endpoint (vLLM, SGLang, etc.) — no changes to your inference deployment. They support both direct API calls (shown below) and HTTP server deployment (see the [online usage guide](https://efficientcontext.github.io/contextpilot-docs/guides/online_usage)).\n\n---\n\n#### Accelerating Online Inference\n\nMulti-turn chatbot with Mem0 or RAG where each turn's context blocks partially overlap. `cp_instance.optimize()` moves shared blocks to the prefix so the engine reuses cached KV states.\n\n```python\nfrom openai import OpenAI\n# Step 1: Import ContextPilot\nimport contextpilot as cp\n\nclient = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"EMPTY\")\n# Step 2: Create a ContextPilot instance\ncp_instance = cp.ContextPilot(use_gpu=False)\n\nfor query in queries:\n    contexts = get_contexts(query)                         # Mem0, Retriever, ...\n    # Step 3: Optimize context ordering and build ready-to-use messages\n    messages = cp_instance.optimize(contexts, query)\n\n    response = client.chat.completions.create(\n        model=\"Qwen/Qwen3-4B\",\n        messages=messages,\n    )\n    print(f\"Q: {query}\\nA: {response.choices[0].message.content}\\n\")\n```\n\n\u003e **Note:** When the engine evicts KV-cache entries under memory pressure, ContextPilot's index can go stale. Set `CONTEXTPILOT_INDEX_URL` when launching [SGLang or vLLM](https://efficientcontext.github.io/contextpilot-docs/guides/online_usage#inference-engine-integration) to enable automatic eviction sync. For distributed setups, see [Distributed Setup](https://efficientcontext.github.io/contextpilot-docs/getting_started/installation#distributed-setup).\n\n---\n\n#### Accelerating Offline Inference\n\nBatch of requests with overlapping context blocks. `cp_instance.optimize_batch()` globally reorders blocks and schedules execution order so queries with similar contexts run consecutively, maximizing cache reuse. See the [offline usage guide](https://efficientcontext.github.io/contextpilot-docs/guides/offline_usage) for details. Offline mode can also be deployed as an HTTP server without eviction sync — see [Stateless Mode](https://efficientcontext.github.io/contextpilot-docs/guides/online_usage#stateless-mode).\n\n```python\nimport asyncio\nimport openai\n# Step 1: Import ContextPilot\nimport contextpilot as cp\n\nBASE_URL = \"http://localhost:30000/v1\"\n# Step 2: Create a ContextPilot instance\ncp_instance = cp.ContextPilot(use_gpu=False)\n\nall_contexts = [get_contexts(q) for q in queries]          # Mem0, Retriever, ...\n# Step 3: Optimize — reorder, schedule, and build prompts in one call\nmessages_batch, order = cp_instance.optimize_batch(all_contexts, queries)\n\n# Send all requests concurrently\nasync def generate_all():\n    ac = openai.AsyncOpenAI(base_url=BASE_URL, api_key=\"EMPTY\")\n    return await asyncio.gather(*[ac.chat.completions.create(\n        model=\"Qwen/Qwen3-4B\", messages=m\n    ) for m in messages_batch])\n\nfor resp, idx in zip(asyncio.run(generate_all()), order):\n    print(f\"Q: {queries[idx]}\\nA: {resp.choices[0].message.content}\\n\")\n```\n\nFor a detailed walkthrough with concrete examples, see the [Quick Start Guide](https://efficientcontext.github.io/contextpilot-docs/getting_started/quickstart). For more fine-grained control, you can also use `cp_instance.reorder()` and `cp_instance.deduplicate()` directly — see the [API reference](https://efficientcontext.github.io/contextpilot-docs/reference/api) and [multi-turn deduplication guide](https://efficientcontext.github.io/contextpilot-docs/guides/multi_turn).\n\n### Adoption Examples\n\nSee many useful adoption examples: [Mem0 integration](https://efficientcontext.github.io/contextpilot-docs/guides/mem0), [PageIndex RAG](https://efficientcontext.github.io/contextpilot-docs/guides/pageindex), [offline batch scheduling](https://efficientcontext.github.io/contextpilot-docs/guides/offline_usage), and [multi-turn deduplication](https://efficientcontext.github.io/contextpilot-docs/guides/multi_turn).\n\n## Citation\n```bibtex\n@inproceedings{contextpilot2026,\n  title     = {ContextPilot: Fast Long-Context Inference via Context Reuse},\n  author    = {Jiang, Yinsicheng and Huang, Yeqi and Cheng, Liang and Deng, Cheng and Sun, Xuan and Mai, Luo},\n  booktitle = {Proceedings of the 9th Conference on Machine Learning and Systems (MLSys 2026)},\n  year      = {2026},\n  url       = {https://arxiv.org/abs/2511.03475}\n}\n```\n\n## Contributing\n\nWe welcome and value all contributions! Please feel free to submit issues and pull requests.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefficientcontext%2Fcontextpilot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fefficientcontext%2Fcontextpilot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefficientcontext%2Fcontextpilot/lists"}