{"id":49499604,"url":"https://github.com/kvsankar/dryscope","last_synced_at":"2026-05-01T12:01:39.504Z","repository":{"id":354990018,"uuid":"1176667627","full_name":"kvsankar/dryscope","owner":"kvsankar","description":"Don't Repeat Yourself Scope - code and docs duplicate finder","archived":false,"fork":false,"pushed_at":"2026-05-01T10:26:47.000Z","size":2852,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-01T11:04:30.223Z","etag":null,"topics":["code-quality","dry"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kvsankar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-03-09T08:54:50.000Z","updated_at":"2026-05-01T10:28:35.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kvsankar/dryscope","commit_stats":null,"previous_names":["kvsankar/dryscope"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/kvsankar/dryscope","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvsankar%2Fdryscope","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvsankar%2Fdryscope/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvsankar%2Fdryscope/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvsankar%2Fdryscope/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kvsankar","download_url":"https://codeload.github.com/kvsankar/dryscope/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvsankar%2Fdryscope/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32495949,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-30T13:12:12.517Z","status":"online","status_checked_at":"2026-05-01T02:00:05.856Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-quality","dry"],"created_at":"2026-05-01T12:01:09.134Z","updated_at":"2026-05-01T12:01:39.478Z","avatar_url":"https://github.com/kvsankar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dryscope\n\n[![CI](https://github.com/kvsankar/dryscope/actions/workflows/ci.yml/badge.svg)](https://github.com/kvsankar/dryscope/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/dryscope.svg?cacheSeconds=300)](https://pypi.org/project/dryscope/)\n[![Python](https://img.shields.io/badge/python-%3E%3D3.10-blue.svg)](https://pypi.org/project/dryscope/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)\n[![pytest](https://img.shields.io/badge/tests-pytest-0A9EDC.svg)](https://docs.pytest.org/)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen.svg)](https://pre-commit.com/)\n[![Ruff](https://img.shields.io/badge/lint-ruff-46a2f1.svg)](https://docs.astral.sh/ruff/)\n[![ty](https://img.shields.io/badge/types-ty-222222.svg)](https://docs.astral.sh/ty/)\n[![Xenon](https://img.shields.io/badge/complexity-xenon-blue.svg)](https://xenon.readthedocs.io/)\n[![uv](https://img.shields.io/badge/packaging-uv-6f42c1.svg)](https://docs.astral.sh/uv/)\n\n`dryscope` helps you find the parts of a large repository that are actually\nworth reading before you ask an AI agent, stronger model, or human reviewer to\nclean it up.\n\nIt scans code and docs for repeated implementation shapes, copy-pasted helpers,\noverlapping document sections, and scattered documentation topics. The output is\na ranked shortlist of files and sections that deserve attention first, not a\nclaim that every match should be refactored.\n\n`dryscope` is a narrowing tool:\n- for code, **Code Match** (`code-match`) surfaces structural duplicate candidates and **Code Review** (`code-review`) filters them down to a shortlist\n- for docs, it has three named tracks:\n  - **Docs Map** (`docs-map`): profiles documents, discovers canonical labels, builds a topic/facet view, and suggests multi-document consolidation clusters\n  - **Section Match** (`docs-section-match`): compares heading-based sections and ranks concrete section-level consolidation/link recommendations\n  - **Doc Pair Review** (`docs-pair-review`): uses an LLM to review selected related document pairs\n\n![dryscope process diagram](./docs/images/dryscope-process.png)\n\n## Motivation\n\nLarge-repo cleanup usually starts with a vague but painful question: \"where is\nthe duplication, and what should I look at first?\" That is hard to answer by\nsearching for keywords or asking an agent to inspect the whole repository.\n\nDevelopers run into a few common problems:\n\n- an agent burns context reading unrelated files before it finds the repeated pattern\n- a refactor starts from one obvious copy-paste case and misses nearby structural clones\n- duplicate-code tools report a wall of boilerplate, generated code, and harmless conventions\n- agent-driven development creates requirements, design, research, planning, and status docs spread across the repository\n- documentation may overlap in intent even when it does not repeat the same text\n- reviewers know something is repeated, but not which files form the smallest useful work batch\n\n### Code Match Motivation\n\nCoding agents are only as good as the context they can see. If an agent does\nnot notice an existing helper, service method, parser, validation path, or UI\nbranch, it may solve the same local problem again somewhere else. Fast\nagent-generated or vibe-coded projects are especially prone to this: the code\nworks, but similar logic accumulates across commands, endpoints, components,\ntests, and migration scripts.\n\nThat redundancy is a code health problem, not just an aesthetic one. DRY is\nabout keeping one reason to change in one place. When the same behavior exists\nin several functions or classes, a bug fix may land in only one copy, edge-case\nhandling can drift, and future agents may copy the wrong version because there\nis no obvious canonical implementation.\n\nCode Match is built to find those candidates before a cleanup pass starts. It\nuses tree-sitter to parse supported languages into code units such as functions,\nclasses, methods, Java constructors, and JavaScript/TypeScript function-valued\ndeclarations. It normalizes each unit by removing comments/docstrings and\nreplacing identifiers and literals with placeholders, then embeds the normalized\ncode and ranks similar units with hybrid semantic and token similarity. The\nresult is a shortlist of exact, near-identical, and structural duplicate\ncandidates that a human developer, coding agent, or optional Code Review pass\ncan inspect before refactoring.\n\n### Docs Map Motivation\n\nThis is especially common with spec-driven work using coding agents. A repo can\naccumulate many documents that all address similar requirements, designs, or\nresearch questions from different angles. When those documents are handed back\nto an agent as context, the model gets a large, unfocused pile of partially\noverlapping intent instead of a clear source of truth.\n\nDocs Map is built for that problem. It profiles documents, normalizes\naboutness and reader-intent labels, builds a topic/facet view of the corpus,\nand suggests document groups that a human developer or agent can consolidate.\nThe goal is to make documentation context smaller, sharper, and easier to trust\nbefore it becomes input to more agent work.\n\n`dryscope` makes that first pass cheaper. It parses code into comparable units,\nchunks docs by section, ranks similar items, and can run an LLM-backed review\npass to separate likely `refactor`, `review`, and `noise` findings. The result\nis a concrete starting point: a smaller set of files, sections, and reasons that\nyou can hand to an agent or reviewer before spending expensive attention on the\nfull repository.\n\n### Docs Map Taxonomy Example\n\nSuppose a repo has agent-written docs like:\n\n| Document | Raw signals |\n| --- | --- |\n| `docs/search-requirements.md` | product requirements, search filters, ranking expectations |\n| `docs/search-design.md` | search architecture, indexing pipeline, query API |\n| `research/vector-search.md` | embeddings, retrieval quality, ranking experiments |\n| `plans/search-rollout.md` | rollout checklist, implementation status, open risks |\n\nDocs Map turns those local descriptions into a corpus-level taxonomy:\n\n| Taxonomy area | Example output |\n| --- | --- |\n| Canonical aboutness labels | `search experience`, `indexing pipeline`, `ranking quality`, `vector retrieval` |\n| Reader intents | `define requirements`, `explain architecture`, `compare approaches`, `track rollout` |\n| Facets | `doc_role: requirements/design/research/plan`, `lifecycle: current/draft`, `audience: maintainer/agent` |\n| Topic tree | `Search` -\u003e `Query behavior`, `Indexing`, `Ranking`, `Rollout` |\n| Consolidation cluster | group the requirements, design, research, and rollout docs around `search experience` when they should share one source of truth or cross-reference each other |\n\nThat taxonomy is what lets dryscope report \"these docs overlap in purpose\" even\nwhen the same paragraphs were not copied between files.\n\n### Section Match Example\n\nDocs Map works at the document and corpus level. Section Match works at the\nsection level.\n\nTwo documents can be different in purpose but still repeat the same supporting\nmaterial. For example, a requirements spec and a corresponding design doc should\nnot be merged just because they belong to the same feature. But both might\ncontain a `Configuration` section that explains the same environment variables,\n`.rc` files, feature flags, or deployment settings.\n\n| Document | Document-level purpose | Repeated section-level content |\n| --- | --- | --- |\n| `docs/search-requirements.md` | define user-visible behavior and constraints | `Configuration`: required environment variables and defaults |\n| `docs/search-design.md` | explain architecture, components, and data flow | `Configuration`: same environment variables, rc file, and feature flags |\n\nSection Match is built to find that microscopic redundancy. It points to\nspecific repeated sections where one shared reference, one canonical\nconfiguration page, or a cross-link would reduce drift. It does not imply the\nwhole documents have the same purpose.\n\n### How The Tracks Fit Together\n\n| Track | Motivation |\n| --- | --- |\n| Code Match | Find repeated implementation shapes before a refactor starts, especially in agent-generated code where similar logic may be recreated in different files. |\n| Code Review | Filter Code Match output so framework boilerplate, coincidental structure, and low-payoff matches do not consume expensive human or model attention. |\n| Docs Map | Find document-level intent overlap across requirements, designs, research, plans, and status docs so a repo can recover clearer sources of truth. |\n| Section Match | Find repeated section-level material inside otherwise distinct documents, such as duplicated configuration or deployment explanations. |\n| Doc Pair Review | Add deeper judgment for selected related document pairs when the relationship is not obvious from taxonomy or section similarity alone. |\n| Docs Report Pack | Package the docs findings as Markdown, HTML, and JSON so humans can review them and agents can consume the same narrowed context. |\n\n## Features\n\n- **Code Match** — Python, Go, Java, JavaScript, JSX, TypeScript, and TSX duplicate-code candidates via tree-sitter + embeddings\n- **Code Review** — optional LLM/policy pass that classifies Code Match findings as `refactor`, `review`, or `noise`\n- **Docs Map** — LLM document descriptors, canonical label taxonomy, topic tree, facets, diagnostics, and consolidation clusters\n- **Section Match** — Markdown, MDX, RST, AsciiDoc, and plaintext section-level redundancy via heading chunks and embedding similarity\n- **Doc Pair Review** — optional LLM analysis of selected related document pairs\n- **Docs Report Pack** — HTML/Markdown/JSON docs reports with numbered collapsible sections and the same structure across formats\n- **Saved report cleanup** — prune old `.dryscope/runs` outputs by count or date, dry-run first by default\n- **Hybrid similarity** — 70% embedding cosine + 30% token Jaccard with size-ratio filtering\n- **Deterministic escalation policy** — keeps `review` findings plus higher-value `refactor` findings for expensive follow-up\n- **Project profiles** — auto-detects Django and pytest-factories, applies smart exclusions\n- **Agent skills** — install as both a Claude Code and Codex skill\n- **Unified JSON output** — structured `findings[]` schema for agent consumption\n\n## Positioning\n\n`dryscope` is best used as:\n- a first-pass scanner before repository-wide refactors\n- a repo narrowing tool before handing work to an agent or stronger model\n- a Code Match candidate generator for structural refactor opportunities\n- a Docs Map and Section Match aid for answering \"how should these docs be organized?\"\n- a prefilter that helps decide what a deeper reviewer should read first\n\nIt is not positioned as:\n- a general-purpose lint replacement\n- a universal duplicate-code product for every developer workflow\n- a perfect semantic clone detector\n- a final refactor oracle\n- a complete replacement for human or stronger-model judgment\n\nThe strongest use case is not \"find every duplicate.\" It is \"before I ask an\nagent to clean this up, show me the small set of likely duplicate code and docs\nconsolidation targets worth spending attention on.\"\n\n## Installation\n\n`dryscope` is published on PyPI: \u003chttps://pypi.org/project/dryscope/\u003e.\n\nFor one-off CLI runs without a persistent install, use `uvx` or `pipx run`:\n\n```bash\nuvx dryscope --help\nuvx dryscope scan .\n```\n\n```bash\npipx run dryscope --help\n```\n\nFor a persistent isolated tool install, use either `uv tool install` or `pipx`:\n\n```bash\nuv tool install dryscope\ndryscope --help\n```\n\n```bash\npipx install dryscope\ndryscope --help\n```\n\nFor a project virtual environment, use `pip` or `uv pip`:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npython -m pip install dryscope\ndryscope --help\n```\n\n```bash\nuv venv\nuv pip install dryscope\nuv run dryscope --help\n```\n\nThere is no separate `uv pipx` command. The uv equivalents are `uvx` for\none-off tool runs and `uv tool install` for persistent tool installs.\n\nThe default install supports API embedding models through LiteLLM. Set the\nprovider API key for your embedding model, such as `OPENAI_API_KEY` for\n`text-embedding-3-small`. Local sentence-transformer embeddings are optional\nbecause they pull in PyTorch. Install the extra only when you need local\nembeddings:\n\n```bash\nuv tool install \"dryscope[local-embeddings]\"\npipx install \"dryscope[local-embeddings]\"\npython -m pip install \"dryscope[local-embeddings]\"\n```\n\nFor repository development from a source checkout:\n\n```bash\nuv sync --extra dev\nuv run dryscope --help\n```\n\n## Development Quality Gates\n\nFor repository development, install the dev extra and enable the checked-in\npre-commit hooks:\n\n```bash\nuv sync --extra dev\nuv run pre-commit install\nuv run pre-commit run --all-files\n```\n\nThe default commit hooks are intentionally fast and low-noise:\n\n- standard file checks for Python syntax, JSON/TOML/YAML, merge conflicts,\n  large files, private keys, case conflicts, executable/shebang consistency,\n  broken symlinks, debug statements, trailing whitespace, final newlines, and\n  LF line endings\n- `uv lock --check` when `pyproject.toml` or `uv.lock` changes\n- `ruff check --fix` for lint, import sorting, pyupgrade-style rewrites,\n  bugbear checks, comprehensions, and McCabe complexity rule coverage\n- `ruff format` for Python formatting\n\nStricter gates are available as manual hooks while the current baseline is\nbeing tightened:\n\n```bash\nuv run pre-commit run ty-check --hook-stage manual --all-files\nuv run pre-commit run xenon-complexity --hook-stage manual --all-files\n```\n\n`ty-check` runs static type analysis over `dryscope`, `tests`, and\n`benchmarks`. `xenon-complexity` reports cyclomatic-complexity hot spots and is\nconfigured as a ratchet for functions, modules, and repository average\ncomplexity. These manual checks are useful before larger refactors even when\nthey are not yet suitable as default commit blockers.\n\n## Quick Start\n\n```bash\n# Progressive help\ndryscope --help\ndryscope help output\ndryscope help json\ndryscope scan --help\n\n# Code Match (default)\ndryscope scan /path/to/project\n\n# Section Match (docs default)\ndryscope scan /path/to/docs --docs\n\n# Code Match with local embeddings, after installing dryscope[local-embeddings]\ndryscope scan /path/to/project --embedding-model all-MiniLM-L6-v2\n\n# Docs Report Pack: Docs Map + Section Match + Doc Pair Review\ndryscope scan /path/to/docs --docs --stage docs-report-pack --backend cli -f html\n\n# Code Match + Section Match\ndryscope scan /path/to/project --code --docs\n\n# Code Match JSON output for agents\ndryscope scan /path/to/project -f json\n\n# Code Match filtered by language\ndryscope scan /path/to/project --lang python\n\n# Code Review\ndryscope scan /path/to/project --verify\n\n# Bounded Code Review for large duplicate-rich repos\ndryscope scan /path/to/project --verify --max-findings 15\n\n# Stricter Code Match threshold and token filter\ndryscope scan /path/to/project -t 0.95 --min-tokens 15\n```\n\n## Real-World Examples\n\nPublic examples from recent validation passes:\n\n- `kvsankar/sattosat`\n  - code scan produced a 2-item shortlist\n  - one clear refactor candidate survived: duplicated TLE epoch parsing logic across two scripts and one library module\n  - docs scan produced 0 recommendations\n\n- `stellar/stellar-docs`\n  - docs scan found real overlap in repeated sequence-diagram flows\n  - grouped Section Match output reduced noisy pairwise suggestions into a compact 4-item shortlist\n\n- `gethomepage/homepage`\n  - docs scan found 0 overlap pairs\n  - with the old large-repo guard enabled, the pipeline exited early instead of spending LLM work on a large negative repo\n\nRecent AI-generated / agent-oriented public repo checks show the code path doing\nthe intended narrowing job:\n\n| Repo | Structural candidates | Verified shortlist from top 15 |\n| --- | ---: | ---: |\n| `CLI-Anything-WEB` | 94 | 5 |\n| `nanowave` | 82 | 10 |\n| `ClaudeCode_generated_app` | 51 | 6 |\n| `VibesOS` | 23 | 4 |\n\nThese are candidate shortlists, not precision/recall claims. The benchmark pack\nkeeps reviewed labels for selected findings, including real refactor candidates\nand at least one false-positive regression case.\n\nFor docs-heavy repositories, the current docs report is organized around named docs tracks:\n\n1. **Docs Map** (`docs-map`): document descriptors -\u003e canonical label normalization -\u003e topic tree/facets -\u003e docs map clusters.\n2. **Section Match** (`docs-section-match`): document sections -\u003e embeddings -\u003e matched section pairs -\u003e section match recommendations.\n3. **Doc Pair Review** (`docs-pair-review`): selected related document pairs -\u003e LLM relationship/action review.\n\n## Configuration\n\nGenerate a default config file:\n\n```bash\ndryscope init\n```\n\nThis creates `.dryscope.toml`:\n\n```toml\n[code]\nmin_lines = 6\nmin_tokens = 0\nmax_cluster_size = 15\nthreshold = 0.90\nembedding_model = \"text-embedding-3-small\"\nescalate_refactor_min_lines = 40\nescalate_refactor_min_actionability = 2.0\nescalate_refactor_min_units = 3\nkeep_same_file_refactors = false\n# exclude = [\"**/test_*.py\"]\n# exclude_type = [\"BaseModel\"]\n\n[docs]\ninclude = [\"*.md\", \"*.mdx\", \"*.rst\", \"*.txt\", \"*.adoc\"]\nexclude = [\"node_modules\", \"venv\", \".git\", \".dryscope\", \"*.lock\"]\nthreshold_similarity = 0.9\nthreshold_intent = 0.8\nmin_content_words = 15\ninclude_intra = false\ntoken_weight = 0.3\n# Same embedding backend choices as [code].\nembedding_model = \"text-embedding-3-small\"\nintent_max_docs = 0\nllm_max_doc_pairs = 250\nintent_skip_without_similarity_min_docs = 0\n\n[docs.map]\n# Generic seed dimensions shown to the LLM. These are suggestions, not a\n# product-specific taxonomy; dryscope still infers the corpus topic tree.\nfacet_dimensions = [\"doc_role\", \"audience\", \"lifecycle\", \"content_type\", \"surface\", \"canonicality\"]\n\n[docs.map.facet_values]\ndoc_role = [\"guide\", \"reference\", \"tutorial\", \"spec\", \"plan\", \"status\", \"research\", \"changelog\", \"architecture\", \"decision\", \"overview\", \"troubleshooting\"]\naudience = [\"user\", \"contributor\", \"maintainer\", \"operator\", \"internal\", \"agent\"]\nlifecycle = [\"current\", \"proposed\", \"historical\", \"deprecated\", \"draft\", \"unknown\"]\ncontent_type = [\"concept\", \"workflow\", \"api\", \"troubleshooting\", \"decision\", \"benchmark\", \"example\", \"architecture\", \"requirements\"]\nsurface = [\"public\", \"internal\", \"generated\", \"extension\", \"package\", \"integration\"]\ncanonicality = [\"primary\", \"supporting\", \"archive\", \"duplicate\", \"index\", \"unknown\"]\n\n[llm]\nmodel = \"claude-haiku-4-5-20251001\"\nbackend = \"cli\"       # \"cli\" (claude -p), \"codex-cli\", \"litellm\" (provider API keys), or \"ollama\" (local Ollama)\nmax_cost = 5.00\nconcurrency = 8\n# ollama_host = \"http://localhost:11434\"\n# cli_strip_api_key = true\n# cli_permission_mode = \"bypassPermissions\"\n# cli_dangerously_skip_permissions = false\n\n[cache]\nenabled = true\npath = \"~/.cache/dryscope/cache.db\"\n```\n\nConfiguration layers: defaults → `.dryscope.toml` → CLI flags.\n\n## LLM Backend Configuration\n\n`dryscope` supports four verification backends:\n\n- `cli`\n  - shells out to `claude -p`\n  - good when you use Claude CLI with OAuth/session auth\n- `codex-cli`\n  - shells out to `codex exec`\n  - good when you use Codex CLI directly\n- `litellm`\n  - uses provider APIs through LiteLLM\n  - good for OpenAI, Anthropic, Gemini, Azure OpenAI, Bedrock, OpenRouter, and other LiteLLM-supported providers\n- `ollama`\n  - uses the local Ollama HTTP API\n  - good for local/private verification without a cloud provider\n\n### Claude CLI\n\n```toml\n[llm]\nbackend = \"cli\"\nmodel = \"claude-haiku-4-5-20251001\"\n# cli_strip_api_key = true\n# cli_permission_mode = \"bypassPermissions\"\n# cli_dangerously_skip_permissions = false\n```\n\n```bash\ndryscope scan /path/to/project --verify --backend cli --llm-model claude-haiku-4-5-20251001\n```\n\n### Codex CLI\n\n```toml\n[llm]\nbackend = \"codex-cli\"\n# Use the Codex default model, or set one your Codex auth supports.\nmodel = \"gpt-5.4\"\n```\n\n```bash\ndryscope scan /path/to/project --verify --backend codex-cli --llm-model gpt-5.4\n```\n\n`codex-cli` shells out to `codex exec`. On this machine, explicit mini models like\n`gpt-4o-mini` were rejected under ChatGPT-account Codex auth, while the default\nCodex model worked. If you want mini models through Codex CLI, use API-key login\nwith `codex login --with-api-key` if your account supports them.\n\n### LiteLLM Providers\n\nUse `litellm` when you want hosted provider APIs.\n\nOpenAI example:\n\n```toml\n[llm]\nbackend = \"litellm\"\nmodel = \"gpt-4o\"\n```\n\n```bash\nOPENAI_API_KEY=... dryscope scan /path/to/project --verify --backend litellm --llm-model gpt-4o\n```\n\nAnthropic example:\n\n```toml\n[llm]\nbackend = \"litellm\"\nmodel = \"claude-3-5-sonnet-latest\"\n```\n\n```bash\nANTHROPIC_API_KEY=... dryscope scan /path/to/project --verify --backend litellm --llm-model claude-3-5-sonnet-latest\n```\n\n### Ollama\n\n```toml\n[llm]\nbackend = \"ollama\"\nmodel = \"qwen2.5:3b\"\n# ollama_host = \"http://localhost:11434\"\n```\n\n```bash\ndryscope scan /path/to/project --verify --backend ollama --llm-model qwen2.5:3b\n```\n\n## Agent Skills\n\n```bash\ndryscope install    # Install as both Claude Code and Codex skills\ndryscope uninstall  # Remove the skill\n```\n\n`dryscope install` creates a shared skill venv under\n`$XDG_DATA_HOME/dryscope/skill-venv` or `~/.local/share/dryscope/skill-venv`,\nthen renders `SKILL.md` into both `~/.claude/skills/dryscope` and\n`~/.codex/skills/dryscope`.\n\n## CLI Reference\n\n```\ndryscope scan \u003cpath\u003e [OPTIONS]\n```\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--code / --no-code` | `--code` | Run Code Match |\n| `--docs / --no-docs` | off | Run docs tracks |\n| `--lang` | all | Filter: `python`, `go`, `java`, `js`, `jsx`, `ts`, `tsx` |\n| `-t, --threshold` | `0.90` | Similarity threshold (0.0-1.0) |\n| `-f, --format` | `terminal` | Output: `terminal`, `json`, `markdown`, `html` |\n| `-m, --min-lines` | `6` | Minimum lines per code unit |\n| `--min-tokens` | `0` | Minimum unique normalized tokens |\n| `--max-cluster-size` | `15` | Drop clusters larger than this |\n| `--max-findings` | | Limit Code Match/Code Review to the top N code findings |\n| `-e, --exclude` | | Glob patterns to exclude; applies to Code Match and docs tracks |\n| `--exclude-type` | | Base class types to exclude (code) |\n| `--embedding-model` | `text-embedding-3-small` | Embedding model; API models use LiteLLM, local sentence-transformers such as `all-MiniLM-L6-v2` require the `dryscope[local-embeddings]` extra |\n| `--verify` | off | Run Code Review for code; run full docs tracks for docs |\n| `--llm-model` | `claude-haiku-4-5-20251001` | LLM model for Code Review and Doc Pair Review |\n| `--stage` | `docs-section-match` | Docs stage: `docs-section-match` runs Section Match; `docs-report-pack` adds Docs Map and Doc Pair Review |\n| `--resume` | off | Resume from latest docs run |\n| `--intra` | off | Include intra-document overlap (docs) |\n| `--threshold-intent` | `0.8` | Docs Map topic-pair threshold |\n| `--llm-max-doc-pairs` | config | Maximum document pairs for Doc Pair Review |\n| `--concurrency` | config | Max parallel LLM calls for docs full stage |\n| `--backend` | config | LLM backend: `cli`, `codex-cli`, `litellm`, or `ollama` |\n\nReport cleanup:\n\n| Command | Description |\n|---------|-------------|\n| `dryscope reports clean \u003cpath\u003e --keep-last N` | Keep the newest N saved report runs |\n| `dryscope reports clean \u003cpath\u003e --keep-since YYYY-MM-DD` | Keep runs on or after a calendar date |\n| `dryscope reports clean \u003cpath\u003e --keep-since YYYY-MM` | Keep runs on or after the first day of a month |\n| `dryscope reports clean \u003cpath\u003e --keep-days N` | Keep runs from the last N days |\n| `--force` | Actually delete runs; without this, cleanup is preview-only |\n\n```\ndryscope init         # Generate .dryscope.toml\ndryscope install      # Install Claude Code and Codex skills\ndryscope uninstall    # Remove Claude Code and Codex skills\ndryscope cache stats  # Show cache statistics\ndryscope cache clear  # Clear the cache\ndryscope reports clean /path/to/project --keep-last 5          # Preview deleting older saved runs\ndryscope reports clean /path/to/project --keep-days 30 --force # Delete runs older than 30 days\n```\n\n### Saved Report Cleanup\n\nDocs scans are saved under `.dryscope/runs/\u003crun-id\u003e/` with `report.md`,\n`report.html`, `report.json`, and resumable stage artifacts. Cleanup is dry-run\nby default:\n\n```bash\n# Keep the newest 10 runs; preview only\ndryscope reports clean /path/to/project --keep-last 10\n\n# Keep reports from April 2026 onward; preview only\ndryscope reports clean /path/to/project --keep-since 2026-04-01\n\n# Keep reports from the last 30 days and actually delete older runs\ndryscope reports clean /path/to/project --keep-days 30 --force\n```\n\nWhen multiple keep rules are supplied, dryscope keeps the union. For example,\n`--keep-last 5 --keep-days 30` preserves the newest five runs plus any run from\nthe last 30 days. After deletion, `.dryscope/latest` is repointed to the newest\nremaining run.\n\n### Report Format Structure\n\n`report.md`, `report.html`, and `report.json` use the same top-level section\norder: Run Overview, Docs Map, Docs Map Clusters, Section Match, optional Doc\nPair Review, Docs Map Taxonomy, and Methodology.\n\nFor machine-readable output contracts, see\n[JSON output](./docs/json-output.md).\n\nAt the top level, JSON keeps only run metadata, a compact summary, and the\nordered `report_structure`; detailed payloads live under their owning sections.\n\nEach detailed list is owned by one section. For example, topic documents live\ninside Docs Map, consolidation documents live inside Docs Map Clusters, and\ncanonical labels/aliases live inside Docs Map Taxonomy. The report avoids\n\"sample first, full list later\" output; long lists\nare collapsible in Markdown/HTML and nested under the corresponding section in\nJSON.\n\nCode findings use file paths relative to the scan root. That keeps JSON output\nstable across clone locations and prevents external artifact directories from\naffecting Code Review context.\n\n## How It Works\n\n### Code Pipeline\n1. **Parse** — tree-sitter extracts functions, classes, and methods\n2. **Normalize** — identifiers/literals replaced with placeholders; comments stripped\n3. **Embed** — API embeddings through LiteLLM or local sentence-transformers embeddings\n4. **Compare** — hybrid similarity (70% cosine + 30% token Jaccard) with size-ratio filtering\n5. **Cluster** — Union-Find groups similar pairs, scored by actionability\n6. **Code Review** _(optional)_ — LLM classifies each cluster as `refactor`, `review`, or `noise`\n7. **Escalate** _(with `--verify`)_ — deterministic policy keeps all `review` findings and only higher-value `refactor` findings\n\n### Docs Pipeline\n1. **Chunk** — split documents into heading-based sections\n2. **Embed** — API embeddings through LiteLLM or local sentence-transformers embeddings\n3. **Section Match** — hybrid similarity finds cross-document section overlap\n4. **Docs Map descriptors** _(full stage)_ — LLM profiles each document with title, summary, aboutness labels, reader intents, role, audience, lifecycle, content type, surface, and canonicality\n5. **Docs Map taxonomy** _(full stage)_ — deterministic matching plus optional LLM canonicalization turns raw aboutness/intent labels into a corpus-level canonical label taxonomy\n6. **Docs Map discovery** _(full stage)_ — LLM builds a candidate topic tree, facets, diagnostics, and consolidation clusters\n7. **Match intent pairs** _(full stage)_ — canonical labels are embedded to find related document pairs for optional deeper pair analysis\n8. **Doc Pair Review** _(full stage)_ — LLM classifies selected related document pairs with action recommendations when within cost limits\n9. **Docs Report Pack** — markdown, HTML, and JSON share the same top-down structure: run overview, Docs Map, Docs Map Clusters, Section Match, optional Doc Pair Review, and Docs Map Taxonomy\n\n## What Good Output Looks Like\n\nFor code:\n- a small shortlist of `refactor` and `review` findings\n- exact or near-exact helpers extracted across files\n- borderline same-file or low-payoff duplicates left as `review`\n\nFor docs:\n- a Docs Map section showing topic groups, facets, and diagnostics\n- Docs Map clusters from canonical labels shared by multiple documents\n- Section Match recommendations only when section-level overlap exists\n- 0 Section Match recommendations on clean negative repos, while Docs Map may still report organizational signals\n- a few grouped Section Match recommendations on docs-heavy repos\n- one family recommendation for many near-identical sibling docs, rather than many pairwise duplicates\n\n## Documentation\n\n- [Architecture](./docs/architecture.md) — how the code, docs, reporting, cache, and CLI pieces fit together\n- [Analysis](./docs/analysis.md) — positioning, alternatives, benchmark notes, and product-readiness context\n- [Process image brief](./docs/dryscope-process-image.md) — single-file brief for generating the dryscope engineering process diagram\n- [JSON output](./docs/json-output.md) — machine-readable output contracts for agents and scripts\n- [Roadmap](./docs/roadmap.md) — forward-looking planning notes kept separate from architecture\n- [Synthetic examples](./docs/synthetic-examples.md) — small exposition-only examples for similarity, Code Match, Docs Map, and Section Match\n- [Benchmark pack](./benchmarks/README.md) — public benchmark harness, artifact locations, and refresh commands\n- [Benchmark quality report](./benchmarks/quality_report.md) — readable TP/FP/FN summary generated from public labels\n- [Agent guidance](./AGENTS.md) and [Claude guidance](./CLAUDE.md) — repository-specific instructions for coding agents\n- [Packaged dryscope skill](./dryscope/skill/SKILL.md) — skill instructions installed for agent workflows\n\n## Benchmarking\n\n`dryscope` includes a checked-in public benchmark pack under [benchmarks/README.md](./benchmarks/README.md).\n\nIt only references public repositories and reviewed public labels. Private repo evaluation should remain local and out of the checked-in benchmark files.\n\nThe current benchmark evidence supports public alpha positioning: dryscope can\nfind and narrow repeated implementation shapes in AI-generated or\nagent-oriented repositories. The labels are still intentionally sparse, so the\nbenchmark pack should be read as regression evidence for the narrowing workflow,\nnot as a precision/recall claim.\n\nFor quality assessment, run:\n\n```bash\nuv run python benchmarks/run_quality_report.py\n```\n\nThat report scores generated benchmark outputs against curated public labels\nusing TP/FP/FN, labeled precision, curated recall, F1, and precision@K/recall@K.\nTrue negatives are intentionally omitted because the non-duplicate search space\nis too large to enumerate.\n\nBenchmark clones and generated outputs are kept outside the repo under\n`${DRYSCOPE_BENCHMARK_ROOT:-~/.dryscope/benchmarks}` by default. Result and\nreport directories are run-specific, and filenames/metadata identify benchmark\ninputs as `\u003crepo\u003e@\u003ccommit\u003e`. The runners refuse to reuse a non-empty result\ndirectory unless `--overwrite` is passed.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvsankar%2Fdryscope","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkvsankar%2Fdryscope","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvsankar%2Fdryscope/lists"}