{"id":40391795,"url":"https://github.com/siddhant-k-code/distill","last_synced_at":"2026-05-02T17:01:30.189Z","repository":{"id":331326068,"uuid":"1125934989","full_name":"Siddhant-K-code/distill","owner":"Siddhant-K-code","description":"Reliable LLM outputs start with clean context. Deterministic deduplication, compression, and caching for RAG pipelines.","archived":false,"fork":false,"pushed_at":"2026-05-02T14:54:03.000Z","size":359,"stargazers_count":147,"open_issues_count":8,"forks_count":14,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-02T15:05:15.261Z","etag":null,"topics":["ai-agents","compression","context-optimization","deduplication","deterministic","developer-tools","go","golang","llamaindex","llm","pinecone","qdrant","rag","retrieval-augmented-generation","vector-database"],"latest_commit_sha":null,"homepage":"https://distill.siddhantkhare.com/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Siddhant-K-code.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"Siddhant-K-code","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":"siddhantkhare","thanks_dev":null,"custom":null}},"created_at":"2025-12-31T17:16:14.000Z","updated_at":"2026-05-02T14:54:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"6091a827-8d1b-4559-bd89-9c55aa41f331","html_url":"https://github.com/Siddhant-K-code/distill","commit_stats":null,"previous_names":["siddhant-k-code/distill"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/Siddhant-K-code/distill","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fdistill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fdistill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fdistill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fdistill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Siddhant-K-code","download_url":"https://codeload.github.com/Siddhant-K-code/distill/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fdistill/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32542201,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-02T12:25:33.646Z","status":"ssl_error","status_checked_at":"2026-05-02T12:24:51.733Z","response_time":132,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","compression","context-optimization","deduplication","deterministic","developer-tools","go","golang","llamaindex","llm","pinecone","qdrant","rag","retrieval-augmented-generation","vector-database"],"created_at":"2026-01-20T12:37:18.329Z","updated_at":"2026-05-02T17:01:30.169Z","avatar_url":"https://github.com/Siddhant-K-code.png","language":"Go","funding_links":["https://github.com/sponsors/Siddhant-K-code","https://buymeacoffee.com/siddhantkhare"],"categories":[],"sub_categories":[],"readme":"# Distill\n\n[![CI](https://github.com/Siddhant-K-code/distill/actions/workflows/ci.yml/badge.svg)](https://github.com/Siddhant-K-code/distill/actions/workflows/ci.yml)\n[![Release](https://img.shields.io/github/v/release/Siddhant-K-code/distill)](https://github.com/Siddhant-K-code/distill/releases/latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Go Report Card](https://goreportcard.com/badge/github.com/Siddhant-K-code/distill)](https://goreportcard.com/report/github.com/Siddhant-K-code/distill)\n\n\n[![Build with Ona](https://ona.com/build-with-ona.svg)](https://app.ona.com/#https://github.com/siddhant-k-code/distill)\n\n**Open-source context preprocessing for LLM applications.**\n\nDistill sits between your application and any LLM. It cleans up context before it's sent: deduplicating semantically redundant chunks, compressing conversation history as it ages, and placing cache markers on stable content so Anthropic's prompt cache actually fires.\n\nThe result: fewer tokens sent, lower cost per request, and context windows that don't fill up with noise.\n\n**[Learn more →](https://distill.siddhantkhare.com)**\n\n\u003e 📖 Distill implements the 4-layer context engineering stack described in **[The Agentic Engineering Guide](https://agents.siddhantkhare.com/05-context-engineering-stack/)**, a free open book on AI agent infrastructure.\n\n```\nRAG / tools / memory / docs\n          ↓\n        Distill\n  (dedupe · compress · cache)\n          ↓\n         LLM\n```\n\n## The Problem\n\n30-40% of context assembled from multiple sources is semantically redundant. The same information arrives from docs, code, memory, and tool outputs, all competing for attention in the same prompt.\n\nThis causes non-deterministic outputs, confused reasoning, and failures that only show up at scale. Better prompts don't fix it. The context going in needs to be clean.\n\n## How It Works\n\nNo LLM calls. Fully deterministic. ~12ms overhead.\n\n| Stage | What it does |\n|-------|-------------|\n| **Deduplicate** | Cluster semantically similar chunks, keep one representative per cluster |\n| **Compress** | Extractive compression to remove noise and preserve signal |\n| **Summarize** | Progressively condense conversation history as turns age |\n| **Cache** | Annotate stable prefixes with `cache_control`, track TTL per prefix |\n\nAll four stages chain together via `POST /v1/pipeline` or `distill pipeline` CLI.\n\n### Dedup pipeline\n\n```\nQuery → Over-fetch (50) → Cluster → Select → MMR Re-rank (8) → LLM\n```\n\n1. **Over-fetch** - retrieve 3-5x more chunks than needed\n2. **Cluster** - group semantically similar chunks (agglomerative clustering)\n3. **Select** - pick the best representative from each cluster\n4. **MMR Re-rank** - balance relevance and diversity\n\n**Result:** Deterministic, diverse context. No LLM calls. Fully auditable.\n\n## Installation\n\n### Binary (Recommended)\n\nDownload from [GitHub Releases](https://github.com/Siddhant-K-code/distill/releases):\n\n```bash\n# macOS (Apple Silicon)\ncurl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep \"browser_download_url.*darwin_arm64.tar.gz\" | cut -d '\"' -f 4) | tar xz\n\n# macOS (Intel)\ncurl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep \"browser_download_url.*darwin_amd64.tar.gz\" | cut -d '\"' -f 4) | tar xz\n\n# Linux (amd64)\ncurl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep \"browser_download_url.*linux_amd64.tar.gz\" | cut -d '\"' -f 4) | tar xz\n\n# Linux (arm64)\ncurl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep \"browser_download_url.*linux_arm64.tar.gz\" | cut -d '\"' -f 4) | tar xz\n\n# Move to PATH\nsudo mv distill /usr/local/bin/\n```\n\nOr download directly from the [releases page](https://github.com/Siddhant-K-code/distill/releases/latest).\n\n### Go Install\n\n```bash\ngo install github.com/Siddhant-K-code/distill@latest\n```\n\n### Docker\n\n```bash\ndocker pull ghcr.io/siddhant-k-code/distill:latest\ndocker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill\n```\n\n### Build from Source\n\n```bash\ngit clone https://github.com/Siddhant-K-code/distill.git\ncd distill\ngo build -o distill .\n```\n\n## Development\n\n```bash\nmake build        # compile ./distill\nmake test         # go test ./...\nmake check        # fmt + vet + test\nmake test-cover   # test with coverage report\nmake bench        # run benchmarks\nmake lint         # golangci-lint (requires golangci-lint in PATH)\nmake docker-build # build Docker image\nmake help         # list all targets\n```\n\n## Quick Start\n\n### 1. Standalone API (No Vector DB Required)\n\nStart the API server and send chunks directly:\n\n```bash\nexport OPENAI_API_KEY=\"your-key\"  # For embeddings\ndistill api --port 8080\n```\n\nDeduplicate chunks:\n\n```bash\ncurl -X POST http://localhost:8080/v1/dedupe \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"chunks\": [\n      {\"id\": \"1\", \"text\": \"React is a JavaScript library for building UIs.\"},\n      {\"id\": \"2\", \"text\": \"React.js is a JS library for building user interfaces.\"},\n      {\"id\": \"3\", \"text\": \"Vue is a progressive framework for building UIs.\"}\n    ]\n  }'\n```\n\nResponse:\n\n```json\n{\n  \"chunks\": [\n    {\"id\": \"1\", \"text\": \"React is a JavaScript library for building UIs.\", \"cluster_id\": 0},\n    {\"id\": \"3\", \"text\": \"Vue is a progressive framework for building UIs.\", \"cluster_id\": 1}\n  ],\n  \"stats\": {\n    \"input_count\": 3,\n    \"output_count\": 2,\n    \"reduction_pct\": 33,\n    \"latency_ms\": 12\n  }\n}\n```\n\n**With pre-computed embeddings (no OpenAI key needed):**\n\n```bash\ncurl -X POST http://localhost:8080/v1/dedupe \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"chunks\": [\n      {\"id\": \"1\", \"text\": \"React is...\", \"embedding\": [0.1, 0.2, ...]},\n      {\"id\": \"2\", \"text\": \"React.js is...\", \"embedding\": [0.11, 0.21, ...]},\n      {\"id\": \"3\", \"text\": \"Vue is...\", \"embedding\": [0.9, 0.8, ...]}\n    ]\n  }'\n```\n\n### 2. With Vector Database\n\nConnect to Pinecone or Qdrant for retrieval + deduplication:\n\n```bash\nexport PINECONE_API_KEY=\"your-key\"\nexport OPENAI_API_KEY=\"your-key\"\n\ndistill serve --index my-index --port 8080\n```\n\nQuery with automatic deduplication:\n\n```bash\ncurl -X POST http://localhost:8080/v1/retrieve \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"how do I reset my password?\"}'\n```\n\n### 3. MCP Integration (AI Assistants)\n\nWorks with Claude, Cursor, Amp, and other MCP-compatible assistants:\n\n```bash\n# Dedup only\ndistill mcp\n\n# With memory and sessions\ndistill mcp --memory --session\n```\n\nAdd to Claude Desktop (`~/Library/Application Support/Claude/claude_desktop_config.json`):\n\n```json\n{\n  \"mcpServers\": {\n    \"distill\": {\n      \"command\": \"/path/to/distill\",\n      \"args\": [\"mcp\", \"--memory\", \"--session\"],\n      \"env\": {\n        \"OPENAI_API_KEY\": \"your-key\"\n      }\n    }\n  }\n}\n```\n\nSee [mcp/README.md](mcp/README.md) for more configuration options.\n\n## Context Memory\n\nPersistent memory that accumulates knowledge across agent sessions. Memories are deduplicated on write, ranked by relevance + recency on recall, and compressed over time through hierarchical decay.\n\nEnable with the `--memory` flag on `api` or `mcp` commands.\n\n### CLI\n\n```bash\n# Store a memory\ndistill memory store --text \"Auth uses JWT with RS256 signing\" --tags auth --source docs\n\n# Recall relevant memories\ndistill memory recall --query \"How does authentication work?\" --max-results 5\n\n# Remove outdated memories\ndistill memory forget --tags deprecated\n\n# View statistics\ndistill memory stats\n```\n\n### API\n\n```bash\n# Start API with memory enabled\ndistill api --port 8080 --memory\n\n# Store\ncurl -X POST http://localhost:8080/v1/memory/store \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"session_id\": \"session-1\",\n    \"entries\": [{\"text\": \"Auth uses JWT with RS256\", \"tags\": [\"auth\"], \"source\": \"docs\"}]\n  }'\n\n# Recall\ncurl -X POST http://localhost:8080/v1/memory/recall \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"How does auth work?\", \"max_results\": 5}'\n```\n\n### MCP\n\nMemory tools are available in Claude Desktop, Cursor, and other MCP clients when `--memory` is enabled:\n\n```bash\ndistill mcp --memory\n```\n\nTools exposed: `store_memory`, `recall_memory`, `forget_memory`, `memory_stats`.\n\n### How Decay Works\n\nMemories compress over time based on access patterns:\n\n```\nFull text → Summary (~20%) → Keywords (~5%) → Evicted\n  (24h)        (7 days)         (30 days)\n```\n\nAccessing a memory resets its decay clock. Configure ages via `distill.yaml`:\n\n```yaml\nmemory:\n  db_path: distill-memory.db\n  dedup_threshold: 0.15\n```\n\n## Session Management\n\nToken-budgeted context windows for long-running agent sessions. Push context incrementally - Distill deduplicates, compresses aging entries, and evicts when the budget is exceeded.\n\nEnable with the `--session` flag on `api` or `mcp` commands.\n\n### CLI\n\n```bash\n# Create a session with 128K token budget\ndistill session create --session-id task-42 --max-tokens 128000\n\n# Push context as the agent works\ndistill session push --session-id task-42 --role user --content \"Fix the JWT validation bug\"\ndistill session push --session-id task-42 --role tool --content \"$(cat auth/jwt.go)\" --source file_read --importance 0.8\n\n# Read the current context window\ndistill session context --session-id task-42\n\n# Clean up when done\ndistill session delete --session-id task-42\n```\n\n### API\n\n```bash\n# Start API with sessions enabled\ndistill api --port 8080 --session\n\n# Create session\ncurl -X POST http://localhost:8080/v1/session/create \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"session_id\": \"task-42\", \"max_tokens\": 128000}'\n\n# Push entries\ncurl -X POST http://localhost:8080/v1/session/push \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"session_id\": \"task-42\",\n    \"entries\": [\n      {\"role\": \"tool\", \"content\": \"file contents...\", \"source\": \"file_read\", \"importance\": 0.8}\n    ]\n  }'\n\n# Read context window\ncurl -X POST http://localhost:8080/v1/session/context \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"session_id\": \"task-42\"}'\n```\n\n### MCP\n\nSession tools are available when `--session` is enabled:\n\n```bash\ndistill mcp --session\n```\n\nTools exposed: `create_session`, `push_session`, `session_context`, `delete_session`.\n\n### How Budget Enforcement Works\n\nWhen a push exceeds the token budget:\n\n1. **Compress** oldest entries (outside the `preserve_recent` window) through levels:\n   - Full text → Summary (~20%) → Single sentence (~5%) → Keywords (~1%)\n2. **Evict** entries that are already at keyword level\n3. Lowest-importance entries are compressed/evicted first\n\nThe `preserve_recent` setting (default: 10) keeps the most recent entries at full fidelity.\n\n## CLI Commands\n\n```bash\ndistill api        # Start standalone API server\ndistill serve      # Start server with vector DB connection\ndistill pipeline   # Run full optimisation pipeline (dedup → compress → summarize)\ndistill mcp        # Start MCP server for AI assistants\ndistill memory     # Store, recall, and manage persistent context memories\ndistill session    # Manage token-budgeted context windows for agent sessions\ndistill analyze    # Analyze a file for duplicates\ndistill sync       # Upload vectors to Pinecone with dedup\ndistill query      # Test a query from command line\ndistill config     # Manage configuration files\ndistill completion # Generate shell completion scripts (bash/zsh/fish/powershell)\n```\n\n### Pipeline command\n\n```bash\n# Run full pipeline on a JSON chunk array\necho '[{\"id\":\"1\",\"text\":\"...\"}]' | distill pipeline\n\n# From file, with stats\ndistill pipeline --input chunks.json --output optimised.json --stats\n\n# Tune individual stages\ndistill pipeline --dedup-threshold 0.2 --compress-ratio 0.4 --summarize --summarize-max-tokens 2000\n\n# Disable a stage\ndistill pipeline --no-compress\n```\n\n### Shell completions\n\n```bash\n# Bash (one-time)\ndistill completion bash \u003e /etc/bash_completion.d/distill\n\n# Zsh\ndistill completion zsh \u003e \"${fpath[1]}/_distill\"\n\n# Fish\ndistill completion fish \u003e ~/.config/fish/completions/distill.fish\n\n# PowerShell\ndistill completion powershell | Out-String | Invoke-Expression\n```\n\n## API Endpoints\n\n| Method | Path | Description |\n|--------|------|-------------|\n| POST | `/v1/dedupe` | Deduplicate chunks |\n| POST | `/v1/dedupe/stream` | SSE streaming dedup with per-stage progress |\n| POST | `/v1/pipeline` | Full optimisation pipeline (dedup → compress → summarize) |\n| POST | `/v1/batch` | Submit async batch job |\n| GET | `/v1/batch/{id}` | Poll batch job status and progress |\n| GET | `/v1/batch/{id}/results` | Retrieve completed batch results |\n| POST | `/v1/retrieve` | Query vector DB with dedup (requires backend) |\n| POST | `/v1/memory/store` | Store memories with write-time dedup (requires `--memory`) |\n| POST | `/v1/memory/recall` | Recall memories by relevance + recency (requires `--memory`) |\n| POST | `/v1/memory/forget` | Remove memories by ID, tag, or age (requires `--memory`) |\n| GET | `/v1/memory/stats` | Memory store statistics (requires `--memory`) |\n| POST | `/v1/session/create` | Create a session with token budget (requires `--session`) |\n| POST | `/v1/session/push` | Push entries with dedup + budget enforcement (requires `--session`) |\n| POST | `/v1/session/context` | Read current context window (requires `--session`) |\n| POST | `/v1/session/delete` | Delete a session (requires `--session`) |\n| GET | `/v1/session/get` | Get session metadata (requires `--session`) |\n| GET | `/health` | Health check |\n| GET | `/metrics` | Prometheus metrics |\n\n### Pipeline API\n\n```json\nPOST /v1/pipeline\n{\n  \"chunks\": [{\"id\": \"1\", \"text\": \"...\"}],\n  \"options\": {\n    \"dedup\":     {\"enabled\": true, \"threshold\": 0.15},\n    \"compress\":  {\"enabled\": true, \"target_reduction\": 0.5},\n    \"summarize\": {\"enabled\": false, \"max_tokens\": 4000}\n  }\n}\n```\n\nResponse includes per-stage token counts, reduction ratios, and latency.\n\n### Batch API\n\n```bash\n# Submit\ncurl -X POST /v1/batch -d '{\"chunks\":[...],\"options\":{...}}'\n# → {\"job_id\":\"batch_1234\",\"status\":\"queued\"}\n\n# Poll\ncurl /v1/batch/batch_1234\n# → {\"status\":\"processing\",\"progress\":0.45}\n\n# Results (when completed)\ncurl /v1/batch/batch_1234/results\n# → {\"chunks\":[...],\"stats\":{...}}\n```\n\n## Logging\n\nDistill uses structured `log/slog` logging. Default output is JSON to stderr.\n\n```go\nimport \"github.com/Siddhant-K-code/distill/pkg/logging\"\n\n// JSON logger (production default)\nlogger := logging.New(logging.Config{Level: \"info\", Format: logging.FormatJSON})\n\n// Text logger for local development\nlogger := logging.NewDebug()\n\n// Attach request context\nlogger = logging.WithRequestID(logger, requestID)\nlogger = logging.WithTraceID(logger, traceID)\n```\n\nLog levels: `debug`, `info` (default), `warn`, `error`.\n\n## Configuration\n\n### Config File\n\nDistill supports a `distill.yaml` configuration file for persistent settings. Generate a template:\n\n```bash\ndistill config init              # Creates distill.yaml in current directory\ndistill config init --stdout     # Print template to stdout\ndistill config validate          # Validate existing config file\n```\n\nConfig file search order: `./distill.yaml`, `$HOME/distill.yaml`.\n\n**Priority:** CLI flags \u003e environment variables \u003e config file \u003e defaults.\n\nExample `distill.yaml`:\n\n```yaml\nserver:\n  port: 8080\n  host: 0.0.0.0\n  read_timeout: 30s\n  write_timeout: 60s\n\nembedding:\n  provider: openai\n  model: text-embedding-3-small\n  batch_size: 100\n\ndedup:\n  threshold: 0.15\n  method: agglomerative\n  linkage: average\n  lambda: 0.5\n  enable_mmr: true\n\nretriever:\n  backend: pinecone    # pinecone or qdrant\n  index: my-index\n  host: \"\"             # required for qdrant\n  namespace: \"\"\n  top_k: 50\n  target_k: 8\n\nauth:\n  api_keys:\n    - ${DISTILL_API_KEY}\n\nmemory:\n  db_path: distill-memory.db\n  dedup_threshold: 0.15\n\nsession:\n  db_path: distill-sessions.db\n  dedup_threshold: 0.15\n  max_tokens: 128000\n```\n\nEnvironment variables can be referenced using `${VAR}` or `${VAR:-default}` syntax.\n\n### Environment Variables\n\n```bash\nOPENAI_API_KEY      # For text → embedding conversion (see note below)\nPINECONE_API_KEY    # For Pinecone backend\nQDRANT_URL          # For Qdrant backend (default: localhost:6334)\nDISTILL_API_KEYS    # Optional: protect your self-hosted instance (see below)\n```\n\n### Protecting Your Self-Hosted Instance\n\nIf you're exposing Distill publicly, set `DISTILL_API_KEYS` to require authentication:\n\n```bash\n# Generate a random API key\nexport DISTILL_API_KEYS=\"sk-$(openssl rand -hex 32)\"\n\n# Or multiple keys (comma-separated)\nexport DISTILL_API_KEYS=\"sk-key1,sk-key2,sk-key3\"\n```\n\nThen include the key in requests:\n\n```bash\ncurl -X POST http://your-server:8080/v1/dedupe \\\n  -H \"Authorization: Bearer sk-your-key\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"chunks\": [...]}'\n```\n\nIf `DISTILL_API_KEYS` is not set, the API is open (suitable for local/internal use).\n\n### About OpenAI API Key\n\n**When you need it:**\n- Sending text chunks without pre-computed embeddings\n- Using text queries with vector database retrieval\n- Using the MCP server with text-based tools\n\n**When you DON'T need it:**\n- Sending chunks with pre-computed embeddings (include `\"embedding\": [...]` in your request)\n- Using Distill purely for clustering/deduplication on existing vectors\n\n**What it's used for:**\n- Converts text to embeddings using `text-embedding-3-small` model\n- ~$0.00002 per 1K tokens (very cheap)\n- Embeddings are used only for similarity comparison, never stored\n\n**Alternatives:**\n- Bring your own embeddings - include `\"embedding\"` field in chunks\n- Self-host an embedding model - set `EMBEDDING_API_URL` to your endpoint\n\n### Parameters\n\n| Parameter | Description | Default |\n|-----------|-------------|---------|\n| `--threshold` | Clustering distance (lower = stricter) | 0.15 |\n| `--lambda` | MMR balance: 1.0 = relevance, 0.0 = diversity | 0.5 |\n| `--over-fetch-k` | Chunks to retrieve initially | 50 |\n| `--target-k` | Chunks to return after dedup | 8 |\n\n## Self-Hosting\n\n### Docker (Recommended)\n\nUse the pre-built image from GitHub Container Registry:\n\n```bash\n# Pull and run\ndocker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:latest\n\n# Or with a specific version\ndocker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:v0.1.0\n```\n\n### Docker Compose\n\n```bash\n# Start Distill + Qdrant (local vector DB)\ndocker-compose up\n```\n\n### Build from Source\n\n```bash\ndocker build -t distill .\ndocker run -p 8080:8080 -e OPENAI_API_KEY=your-key distill api\n```\n\n### Fly.io\n\n```bash\nfly launch\nfly secrets set OPENAI_API_KEY=your-key\nfly deploy\n```\n\n### Render\n\n[![Deploy to Render](https://render.com/images/deploy-to-render-button.svg)](https://render.com/deploy?repo=https://github.com/Siddhant-K-code/distill)\n\nOr manually:\n1. Connect your GitHub repo\n2. Set environment variables (`OPENAI_API_KEY`)\n3. Deploy\n\n### Railway\n\nConnect your repo and set `OPENAI_API_KEY` in environment variables.\n\n## Monitoring\n\nDistill exposes a Prometheus-compatible `/metrics` endpoint on both `api` and `serve` commands.\n\n### Metrics\n\n**Pipeline metrics**\n\n| Metric | Type | Description |\n|--------|------|-------------|\n| `distill_requests_total` | Counter | Total requests by endpoint and status code |\n| `distill_request_duration_seconds` | Histogram | Request latency distribution |\n| `distill_chunks_processed_total` | Counter | Chunks processed (input/output) |\n| `distill_reduction_ratio` | Histogram | Chunk reduction ratio per request |\n| `distill_active_requests` | Gauge | Currently processing requests |\n| `distill_clusters_formed_total` | Counter | Clusters formed during deduplication |\n\n**Cache cost metrics**\n\nRecord Anthropic API usage with `metrics.RecordCacheUsage(UsageRecord{...})` after each API call to track prompt cache efficiency:\n\n| Metric | Type | Description |\n|--------|------|-------------|\n| `distill_cache_creation_tokens_total` | Counter | Tokens written to Anthropic cache (charged at 1.25× input price) |\n| `distill_cache_read_tokens_total` | Counter | Tokens read from Anthropic cache (charged at 0.10× input price) |\n| `distill_uncached_input_tokens_total` | Counter | Uncached input tokens (charged at 1.00×) |\n| `distill_cache_hit_rate` | Gauge | Rolling hit rate: `cache_read / (cache_read + cache_creation + input)` |\n| `distill_cache_write_efficiency` | Gauge | Reads/writes ratio. Values below 1.0 mean cache writes that expire before being read |\n\n**Per-call-site hit rate tracking**\n\n`CallSiteTracker` records Anthropic API usage per call site and surfaces the worst performers first:\n\n```go\ntracker := metrics.NewCallSiteTracker()\n\n// After each Anthropic API call:\ntracker.Record(\"agent/planner.go:84\", metrics.UsageRecord{\n    CacheCreationInputTokens: resp.Usage.CacheCreationInputTokens,\n    CacheReadInputTokens:     resp.Usage.CacheReadInputTokens,\n    InputTokens:              resp.Usage.InputTokens,\n})\n\n// Inspect\ns := tracker.Stats(\"agent/planner.go:84\")\nfmt.Printf(\"hit rate: %.0f%%  efficiency: %.1fx\\n\", s.HitRate()*100, s.WriteEfficiency())\n\n// All call sites, worst hit rate first\nfor _, s := range tracker.AllStats() {\n    fmt.Printf(\"%-40s %.0f%%\\n\", s.CallSite, s.HitRate()*100)\n}\n```\n\n**Cache boundary metrics** (populated by the session boundary manager)\n\n| Metric | Type | Description |\n|--------|------|-------------|\n| `distill_cache_boundary_position_tokens` | Gauge | Current boundary position in tokens per session |\n| `distill_cache_boundary_advances_total` | Counter | Times the boundary moved forward (more content became stable) |\n| `distill_cache_boundary_retreats_total` | Counter | Times the boundary retreated (content changed or was evicted) |\n| `distill_cache_estimated_savings_tokens_total` | Counter | Estimated tokens saved by prompt caching |\n\n### Prometheus Scrape Config\n\n```yaml\nscrape_configs:\n  - job_name: distill\n    static_configs:\n      - targets: ['localhost:8080']\n```\n\n### Grafana Dashboard\n\nImport the included dashboard from `grafana/dashboard.json` or use dashboard UID `distill-overview`.\n\n### OpenTelemetry Tracing\n\nDistill supports distributed tracing via OpenTelemetry. Each pipeline stage (embedding, clustering, selection, MMR) is instrumented as a separate span.\n\nEnable via `distill.yaml`:\n\n```yaml\ntelemetry:\n  tracing:\n    enabled: true\n    exporter: otlp         # otlp, stdout, or none\n    endpoint: localhost:4317\n    sample_rate: 1.0\n    insecure: true\n```\n\nOr via environment variables:\n\n```bash\nexport DISTILL_TELEMETRY_TRACING_ENABLED=true\nexport DISTILL_TELEMETRY_TRACING_ENDPOINT=localhost:4317\n```\n\nSpans emitted per request:\n\n| Span | Attributes |\n|------|------------|\n| `distill.request` | endpoint |\n| `distill.embedding` | chunk_count |\n| `distill.clustering` | input_count, threshold |\n| `distill.selection` | cluster_count |\n| `distill.mmr` | input_count, lambda |\n| `distill.retrieval` | top_k, backend |\n\nResult attributes (`distill.result.*`) are added to the root span: input_count, output_count, cluster_count, latency_ms, reduction_ratio.\n\nW3C Trace Context propagation is enabled by default for cross-service tracing.\n\n## Pipeline Modules\n\n### Compression (`pkg/compress`)\n\nReduces token count while preserving meaning. Three strategies:\n\n- **Extractive** - Scores sentences by position, keyword density, and length; keeps the most salient spans\n- **Placeholder** - Replaces verbose JSON, XML, and table outputs with compact structural summaries\n- **Pruner** - Strips filler phrases, redundant qualifiers, and boilerplate patterns\n\nStrategies can be chained via `compress.Pipeline`. Configure with target reduction ratio (e.g., 0.3 = keep 30% of original).\n\n### Memory (`pkg/memory`)\n\nPersistent context memory across agent sessions. SQLite-backed with write-time deduplication via cosine similarity. Memories decay over time: full text → summary → keywords → evicted. Recall ranked by `(1-w)*similarity + w*recency`. Enable with `--memory` flag.\n\n#### Lifecycle events\n\nThe `DecayWorker` emits typed events on every state transition so that cache boundary managers and other subscribers can stay in sync:\n\n| Event | When | Cache boundary action |\n|-------|------|-----------------------|\n| `EventCompressed` | Entry compressed to summary or keywords | Retreat boundary: cached prefix is now stale |\n| `EventEvicted` | Entry removed from store | Retreat boundary: entry no longer exists |\n| `EventStabilized` | Entry promoted to stable | Advance boundary to include entry |\n\nRegister a handler on any `Store`:\n\n```go\nstore.OnLifecycleEvent(func(e memory.MemoryEvent) {\n    // e.Type, e.EntryID, e.TokensBefore, e.TokensAfter, e.CompressionLevel\n})\n```\n\nMultiple handlers can be registered; they are called in registration order. Handlers must be non-blocking.\n\n#### Cache boundary hint on recall\n\n`RecallResult` now includes a `CacheHint` field. Entries with recall relevance ≥ 0.7 are listed as stable candidates, giving the boundary manager early signal without waiting for the normal stability promotion cycle:\n\n```go\nresult, _ := store.Recall(ctx, req)\nif result.CacheHint != nil {\n    // result.CacheHint.StableEntryIDs - IDs likely stable this turn\n    // result.CacheHint.ConfidenceScore - mean relevance of returned entries\n}\n```\n\n### Session (`pkg/session`)\n\nToken-budgeted context windows for long-running tasks. Entries are deduplicated on push, compressed through hierarchical levels when the budget is exceeded, and evicted by importance. The `preserve_recent` setting keeps the N most recent entries at full fidelity. Enable with `--session` flag.\n\n#### Session-aware cache boundary manager\n\nAfter each push, Distill automatically evaluates the optimal `cache_control` placement for the next request. Entries that have been present for `min_stable_turns` (default: 2) consecutive pushes without modification are considered stable and included in the cached prefix.\n\n`PushResult` now includes a `cache_boundary` field:\n\n```json\n{\n  \"session_id\": \"task-42\",\n  \"accepted\": 2,\n  \"current_tokens\": 4200,\n  \"budget_remaining\": 123800,\n  \"cache_boundary\": {\n    \"markers\": [\n      {\"entry_id\": \"abc123\", \"tokens_up_to_here\": 3800, \"stable_since_turn\": 1}\n    ],\n    \"total_stable_tokens\": 3800,\n    \"advanced\": true,\n    \"retreated\": false\n  }\n}\n```\n\nConfigure via `distill.yaml`:\n\n```yaml\nsession:\n  cache_boundary:\n    enabled: true\n    min_stable_turns: 2     # pushes before an entry is considered stable\n    min_prefix_tokens: 1024 # Anthropic's minimum cacheable prefix size\n    max_markers: 4          # Anthropic allows up to 4 simultaneous markers\n```\n\n### Cache (`pkg/cache`)\n\nKV cache for repeated context patterns (system prompts, tool definitions, boilerplate). Sub-millisecond retrieval for cache hits.\n\n- **MemoryCache** - In-memory LRU with TTL, configurable size limits (entries and bytes), background cleanup\n- **PatternDetector** - Identifies cacheable content and emits `CacheAnnotation` per chunk. Use `AnnotateChunksForCache` to get a `CacheControlPlan` with up to 4 `cache_control` markers (Anthropic's limit) placed at the highest-token-count stable chunks. Auto-placement is skipped when the caller has already set markers manually.\n- **PrefixPartition** - Splits a chunk slice into a frozen cache prefix and a dedup-eligible suffix. Used by the `preserve_cache_prefix` dedup option to prevent Distill from reordering chunks that appear before a `cache_control` breakpoint.\n- **StabilityValidator** - Tracks prefix hashes across requests and detects dynamic content bleeding into cached prefixes. Reports instability with a likely cause and supports static text analysis for pre-flight checks.\n- **RedisCache** - Interface for distributed deployments (requires external Redis)\n\n#### Cache-aware dedup (`preserve_cache_prefix`)\n\nDistill's dedup pipeline can reorder chunks to improve context quality. When prompt caching is active, reordering chunks before the `cache_control` breakpoint changes the prefix hash and causes a cache miss. Use `preserve_cache_prefix` to freeze the prefix:\n\n```json\nPOST /v1/dedupe\n{\n  \"chunks\": [\n    {\"id\": \"sys\", \"text\": \"You are a helpful assistant.\", \"cache_control\": \"ephemeral\"},\n    {\"id\": \"tool1\", \"text\": \"Tool schema JSON...\", \"cache_control\": \"ephemeral\"},\n    {\"id\": \"msg1\", \"text\": \"What is the capital of France?\"},\n    {\"id\": \"msg2\", \"text\": \"What is the capital of Germany?\"}\n  ],\n  \"options\": {\"preserve_cache_prefix\": true}\n}\n```\n\nResponse stats when prefix is frozen:\n\n```json\n{\n  \"stats\": {\n    \"input_count\": 4, \"output_count\": 3,\n    \"cache_prefix_frozen\": true,\n    \"cache_prefix_tokens\": 320,\n    \"cache_prefix_hash\": \"a3f2c1d4e5b6\",\n    \"suffix_input_count\": 2,\n    \"suffix_output_count\": 1\n  }\n}\n```\n\n#### TTL-aware cache tracker\n\n`TTLTracker` monitors Anthropic's 5-minute prompt cache TTL per prefix hash. Use it to detect cold-start penalties and schedule batch requests before the cache expires:\n\n```go\ntracker := cache.NewTTLTracker(0) // 0 = use AnthropicCacheTTL (5 min)\n\n// After each request that carries a cache_control marker:\nwasAlive := tracker.Touch(plan.PrefixHash)\nif !wasAlive {\n    log.Warn(\"cache cold start: first request or TTL expired\")\n}\n\n// For batch workloads: latest safe time to send next request\ndeadline := tracker.ScheduleDeadline(plan.PrefixHash, 30*time.Second)\ntime.Sleep(time.Until(deadline))\n\n// Inspect expiry state\nentry := tracker.Entry(plan.PrefixHash)\nfmt.Printf(\"hits: %d  misses: %d  alive: %v\\n\", entry.HitCount, entry.MissCount, entry.IsAlive())\n```\n\n#### Prefix stability validator\n\nDetects dynamic content (timestamps, request IDs, UUIDs) bleeding into cached prefixes, which is the most common cause of 0% cache hit rates:\n\n```go\nvalidator := cache.NewStabilityValidator(cache.DefaultStabilityConfig())\n\n// Runtime check, call on every request\nissues := validator.Check(\"agent/planner.go:84\", chunks)\nfor _, issue := range issues {\n    log.Warnf(\"%s\", issue) // \"cache-prefix-unstable: stability=12%, likely dynamic interpolation: request id\"\n}\n\n// Static pre-flight check\nfound := validator.ValidateText(systemPromptText)\n// found = [\"request id\", \"timestamp\"] if dynamic patterns detected\n```\n\n#### Automatic cache_control placement\n\n```go\ndetector := cache.NewPatternDetector()\nplan := detector.AnnotateChunksForCache(chunks)\n// plan.Markers lists which chunk indices should receive cache_control markers\n// plan.ManualMarkersPresent is true if the caller already placed markers\n```\n\nPattern → annotation mapping:\n\n| Pattern | Recommended | Condition |\n|---------|-------------|-----------|\n| `system_prompt` | Yes | Always |\n| `tool_definition` | Yes | Always |\n| `code_block` | Conditional | Token count ≥ 512 |\n| `document` | Yes | Always |\n| `user_message` | No | Dynamic per turn |\n\n## Architecture\n\n```\n┌──────────────────────────────────────────────────────────────────────┐\n│                         Your App / Agent                             │\n└──────────────────────────────────────────────────────────────────────┘\n                                  │\n                                  ▼\n┌──────────────────────────────────────────────────────────────────────┐\n│                             Distill                                  │\n│                                                                      │\n│  Dedup Pipeline (shipped)                                            │\n│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌──────────┐  ┌─────────┐  │\n│  │  Cache  │→ │ Cluster │→ │ Select  │→ │ Compress │→ │  MMR    │  │\n│  │  check  │  │  dedup  │  │  best   │  │  prune   │  │ re-rank │  │\n│  └─────────┘  └─────────┘  └─────────┘  └──────────┘  └─────────┘  │\n│     \u003c1ms          6ms         \u003c1ms          2ms           3ms        │\n│                                                                      │\n│  Context Intelligence                                                │\n│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │\n│  │ Memory Store │  │ Impact Graph │  │ Session Context Windows  │   │\n│  │  (shipped)   │  │  (#30)       │  │  (shipped)               │   │\n│  └──────────────┘  └──────────────┘  └──────────────────────────┘   │\n│                                                                      │\n│  ┌──────────────────────────────────────────────────────────────┐    │\n│  │  /metrics (Prometheus)  ·  OTEL tracing  ·  MCP server      │    │\n│  └──────────────────────────────────────────────────────────────┘    │\n└──────────────────────────────────────────────────────────────────────┘\n                                  │\n                                  ▼\n┌──────────────────────────────────────────────────────────────────────┐\n│                              LLM                                     │\n└──────────────────────────────────────────────────────────────────────┘\n```\n\n## Supported Backends\n\n- **Pinecone** - Fully supported\n- **Qdrant** - Fully supported\n- **Weaviate** - Coming soon\n\n## Use Cases\n\n- **Code Assistants** - Dedupe context from multiple files/repos\n- **RAG Pipelines** - Remove redundant chunks before LLM\n- **Agent Workflows** - Clean up tool outputs + memory + docs\n- **Incident Triage** - Find similar past changes that caused outages\n- **Code Review** - Blast radius analysis for PRs\n- **Enterprise** - Deterministic outputs with source attribution\n\n## Embedding Providers\n\nDistill supports multiple embedding backends via a unified factory. Import the provider package to register it, then call `embedding.NewProvider`:\n\n```go\nimport (\n    \"github.com/Siddhant-K-code/distill/pkg/embedding\"\n    _ \"github.com/Siddhant-K-code/distill/pkg/embedding/openai\"  // register OpenAI\n    _ \"github.com/Siddhant-K-code/distill/pkg/embedding/ollama\"  // register Ollama\n    _ \"github.com/Siddhant-K-code/distill/pkg/embedding/cohere\"  // register Cohere\n)\n\nprovider, err := embedding.NewProvider(embedding.ProviderConfig{\n    Type:      embedding.ProviderOllama,   // \"openai\" | \"ollama\" | \"cohere\"\n    BaseURL:   \"http://localhost:11434\",   // optional override\n    Model:     \"nomic-embed-text\",         // optional override\n    CacheSize: 10000,                      // 0 = default (10k), -1 = disabled\n})\n```\n\n| Provider | Type string | Default model | Notes |\n|----------|-------------|---------------|-------|\n| OpenAI | `openai` | `text-embedding-3-small` | Requires `OPENAI_API_KEY` |\n| Ollama | `ollama` | `nomic-embed-text` | Local server, no API key |\n| Cohere | `cohere` | `embed-english-v3.0` | Requires `COHERE_API_KEY` |\n\nCustom providers can be registered at startup:\n\n```go\nembedding.RegisterFactory(\"my-provider\", func(cfg embedding.ProviderConfig) (embedding.Provider, error) {\n    return myProvider{apiKey: cfg.APIKey}, nil\n})\n```\n\n## Roadmap\n\nDistill is evolving from a dedup utility into a context intelligence layer. Here's what's next:\n\n### Context Memory\n\n| Feature | Issue | Status | Description |\n|---------|-------|--------|-------------|\n| **Context Memory Store** | [#29](https://github.com/Siddhant-K-code/distill/issues/29) | Shipped | Persistent, deduplicated memory across sessions. Write-time dedup, hierarchical decay, token-budgeted recall. See [Context Memory](#context-memory). |\n| **Session Management** | [#31](https://github.com/Siddhant-K-code/distill/issues/31) | Shipped | Stateful context windows with token budgets, hierarchical compression, and importance-based eviction. See [Session Management](#session-management). |\n| **PatternDetector cache_control annotations** | [#53](https://github.com/Siddhant-K-code/distill/issues/53) | Shipped | `PatternDetector` emits `CacheAnnotation` per chunk and `AnnotateChunksForCache` produces a `CacheControlPlan` with up to 4 Anthropic-compatible markers. |\n| **Session-aware cache boundary manager** | [#51](https://github.com/Siddhant-K-code/distill/issues/51) | Shipped | Auto-advances `cache_control` placement as sessions grow. Stable entries (present ≥ 2 turns unmodified) are included in the cached prefix; boundary retreats when content changes. |\n| **Cache write cost accounting** | [#52](https://github.com/Siddhant-K-code/distill/issues/52) | Shipped | 9 new Prometheus metrics covering Anthropic prompt cache token usage, hit rate, write efficiency, and boundary position. Feed API response usage via `RecordCacheUsage`. |\n| **Memory decay lifecycle events** | [#54](https://github.com/Siddhant-K-code/distill/issues/54) | Shipped | `DecayWorker` emits `EventCompressed` and `EventEvicted` on each transition. `RecallResult` includes a `CacheBoundaryHint` for high-relevance entries. |\n| **Cache-aware dedup** | [#50](https://github.com/Siddhant-K-code/distill/issues/50) | Shipped | `preserve_cache_prefix` option freezes chunks before the last `cache_control` marker so dedup cannot reorder them. Prefix hash and token count reported in stats. |\n| **Prefix stability validator** | [#48](https://github.com/Siddhant-K-code/distill/issues/48) | Shipped | `StabilityValidator` tracks prefix hashes across requests and detects dynamic content (timestamps, request IDs, UUIDs) bleeding into cached prefixes. |\n| **Per-call-site hit rate tracking** | [#47](https://github.com/Siddhant-K-code/distill/issues/47) | Shipped | `CallSiteTracker` records Anthropic cache usage per call site; `AllStats()` returns worst performers first. |\n| **TTL-aware cache tracker** | [#49](https://github.com/Siddhant-K-code/distill/issues/49) | Shipped | `TTLTracker` monitors Anthropic's 5-minute cache TTL per prefix hash. `ScheduleDeadline` tells batch jobs the latest safe time to send the next request. |\n| **Multi-provider embedding abstraction** | [#33](https://github.com/Siddhant-K-code/distill/issues/33) | Shipped | `embedding.NewProvider` factory supports OpenAI, Ollama, and Cohere via a unified `ProviderConfig`. Custom providers register via `RegisterFactory`. |\n\n### Code Intelligence\n\n| Feature | Issue | Status | Description |\n|---------|-------|--------|-------------|\n| **Change Impact Graph** | [#30](https://github.com/Siddhant-K-code/distill/issues/30) | Shipped | `pkg/graph`: BFS blast-radius queries over a dependency graph built from Go imports. |\n| **Semantic Commit Analysis** | [#32](https://github.com/Siddhant-K-code/distill/issues/32) | Shipped | `pkg/commits`: Conventional Commits parser, heuristic risk scoring, cosine similarity search over commit embeddings. |\n\n### Infrastructure\n\n| Feature | Issue | Status | Description |\n|---------|-------|--------|-------------|\n| **Multi-Provider Embeddings** | [#33](https://github.com/Siddhant-K-code/distill/issues/33) | Shipped | `embedding.NewProvider` factory: OpenAI, Ollama, Cohere via unified `ProviderConfig`. |\n| **Unified Pipeline** | [#4](https://github.com/Siddhant-K-code/distill/issues/4) | Shipped | `POST /v1/pipeline` + `distill pipeline` CLI: dedup → compress → summarize in one call with per-stage stats. |\n| **Batch API** | [#11](https://github.com/Siddhant-K-code/distill/issues/11) | Shipped | `POST /v1/batch`: async job queue with worker pool, progress polling, 24h result retention. |\n| **Structured Logging** | [#27](https://github.com/Siddhant-K-code/distill/issues/27) | Shipped | `pkg/logging`: JSON/text slog logger with debug/info/warn/error levels, request_id and trace_id helpers. |\n| **Shell Completions** | [#26](https://github.com/Siddhant-K-code/distill/issues/26) | Shipped | `distill completion [bash\\|zsh\\|fish\\|powershell]` generates shell completion scripts. |\n| **Benchmark Suite** | [#24](https://github.com/Siddhant-K-code/distill/issues/24) | Shipped | `go test -bench=. ./...` covers cluster, MMR, selector, and compress with deterministic synthetic data. |\n| **Makefile** | [#28](https://github.com/Siddhant-K-code/distill/issues/28) | Shipped | 20+ targets: build, test, bench, lint, fmt, vet, docker, release. |\n| **Python SDK** | [#5](https://github.com/Siddhant-K-code/distill/issues/5) | Planned | `pip install distill-ai` with LangChain/LlamaIndex integrations. |\n| **OpenAPI Spec** | [#23](https://github.com/Siddhant-K-code/distill/issues/23) | Planned | Swagger UI at `/docs`, auto-generated client SDKs. |\n\nSee all open issues: [github.com/Siddhant-K-code/distill/issues](https://github.com/Siddhant-K-code/distill/issues)\n\n## Why not just use an LLM?\n\nLLMs are non-deterministic. Reliability requires deterministic preprocessing.\n\n| | LLM Compression | Distill |\n|---|---|---|\n| Latency | ~500ms | ~12ms |\n| Cost per call | $0.01+ | $0.0001 |\n| Deterministic | No | Yes |\n| Lossless | No | Yes |\n| Auditable | No | Yes |\n\nUse LLMs for reasoning. Use deterministic algorithms for reliability.\n\n## Integrations\n\nWorks with your existing AI stack:\n\n- **LLM Providers:** OpenAI, Anthropic (more via [#33](https://github.com/Siddhant-K-code/distill/issues/33))\n- **Frameworks:** LangChain, LlamaIndex (SDKs planned: [#5](https://github.com/Siddhant-K-code/distill/issues/5))\n- **Vector DBs:** Pinecone, Qdrant\n- **AI Assistants:** Claude Desktop, Cursor (via MCP)\n- **Observability:** Prometheus, Grafana, OpenTelemetry (Jaeger, Tempo)\n\n## FAQ\n\n\u003cdetails\u003e\n\u003csummary\u003eIs this just removing exact duplicates?\u003c/summary\u003e\n\u003cp\u003eNo. Exact dedup is trivial (hash comparison). Distill does \u003cem\u003esemantic\u003c/em\u003e dedup - it identifies chunks that convey the same information in different words. Two paragraphs explaining \"how JWT auth works\" with different wording will be clustered together, and only the best one is kept.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWhy agglomerative clustering instead of K-Means?\u003c/summary\u003e\n\u003cp\u003eK-Means requires specifying K upfront and assumes spherical clusters. Agglomerative clustering adapts to the data - it stops merging when the distance between the closest clusters exceeds the threshold. If your 20 chunks have 8 natural groups, you get 8 clusters. If they have 15, you get 15. No tuning required.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWhat does the threshold of 0.15 mean?\u003c/summary\u003e\n\u003cp\u003eCosine distance of 0.15 means cosine similarity of 0.85. Two chunks with 85%+ similarity are considered \"saying the same thing.\" For code, use 0.10 (stricter). For prose, use 0.20 (looser).\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWhy cosine distance and not Euclidean?\u003c/summary\u003e\n\u003cp\u003eOpenAI embeddings (and most embedding models) are normalized to unit length. For unit vectors, cosine distance and Euclidean distance are monotonically related, but cosine is more interpretable: 0 = identical direction, 1 = orthogonal, 2 = opposite. The threshold of 0.15 means \"chunks whose embeddings point within ~22 degrees of each other.\"\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eHow does compression work without an LLM?\u003c/summary\u003e\n\u003cp\u003eThree rule-based strategies: (1) Extractive - scores sentences by position, length, and keyword signals, keeps the top ones. (2) Placeholder - detects JSON/XML/tables and replaces with structural summaries. (3) Pruner - removes filler phrases and intensifiers. No API calls needed.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eHow does Distill work with LangChain?\u003c/summary\u003e\n\u003cp\u003eThree paths: (1) MCP - \u003ccode\u003edistill mcp\u003c/code\u003e exposes tools that become LangChain tools via \u003ca href=\"https://github.com/langchain-ai/langchain-mcp-adapters\"\u003elangchain-mcp-adapters\u003c/a\u003e. (2) HTTP API - call \u003ccode\u003ePOST /v1/dedupe\u003c/code\u003e as a post-processing step on retrieval results. (3) Python SDK (planned - \u003ca href=\"https://github.com/Siddhant-K-code/distill/issues/5\"\u003e#5\u003c/a\u003e) - a \u003ccode\u003eDistillRetriever\u003c/code\u003e that wraps any LangChain retriever.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eHow is this different from LangChain's built-in MMR?\u003c/summary\u003e\n\u003cp\u003eLangChain's \u003ccode\u003esearch_type=\"mmr\"\u003c/code\u003e is a single re-ranking step at the vector DB level. Distill runs a multi-stage pipeline: cache, agglomerative clustering, representative selection, compression, then MMR. The clustering step understands group structure, not just pairwise similarity.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWhat's the time complexity?\u003c/summary\u003e\n\u003cp\u003eDistance matrix is O(N² x D) where N = chunks and D = embedding dimension. The merge loop is O(N³) worst case. For typical RAG inputs (N=20-50, D=1536), the full pipeline completes in ~12ms.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWhy not just increase the context window?\u003c/summary\u003e\n\u003cp\u003eLarger context windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. This wastes tokens, increases latency, and can confuse the model. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.\u003c/p\u003e\n\u003c/details\u003e\n\nSee [FAQ.md](FAQ.md) for the full list.\n\n## Contributing\n\nContributions welcome! Check the [open issues](https://github.com/Siddhant-K-code/distill/issues) for things to work on.\n\n```bash\ngit clone https://github.com/Siddhant-K-code/distill.git\ncd distill\ngo build -o distill .\ngo test ./...\n```\n\n## License\n\nMIT - see [LICENSE](LICENSE)\n\nFor commercial licensing, contact: siddhantkhare2694@gmail.com\n\n## Links\n\n- [Website](https://distill.siddhantkhare.com)\n- [Playground](https://distill.siddhantkhare.com/playground)\n- [The Agentic Engineering Guide](https://agents.siddhantkhare.com) - the book behind the concepts Distill implements\n- [FAQ](FAQ.md)\n- [Blog Post](https://dev.to/siddhantkcode/the-engineering-guide-to-context-window-efficiency-202b)\n- [MCP Configuration](mcp/README.md)\n- [Book a Demo](https://meet.siddhantkhare.com)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiddhant-k-code%2Fdistill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsiddhant-k-code%2Fdistill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiddhant-k-code%2Fdistill/lists"}