{"id":51182491,"url":"https://github.com/Rafaelpta/dupehound","last_synced_at":"2026-06-29T19:00:48.993Z","repository":{"id":364076040,"uuid":"1266290680","full_name":"Rafaelpta/dupehound","owner":"Rafaelpta","description":"Finds the code your AI wrote twice. Fast, offline duplicate-code detector: scan, history chart, CI gate. No AI required.","archived":false,"fork":false,"pushed_at":"2026-06-19T10:03:01.000Z","size":524,"stargazers_count":43,"open_issues_count":8,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-19T12:06:54.316Z","etag":null,"topics":["ai","cli","code-quality","developer-tools","duplicate-detection","rust","static-analysis","tree-sitter"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rafaelpta.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-11T13:36:17.000Z","updated_at":"2026-06-18T22:37:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Rafaelpta/dupehound","commit_stats":null,"previous_names":["rafaelpta/dupehound"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Rafaelpta/dupehound","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rafaelpta%2Fdupehound","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rafaelpta%2Fdupehound/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rafaelpta%2Fdupehound/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rafaelpta%2Fdupehound/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rafaelpta","download_url":"https://codeload.github.com/Rafaelpta/dupehound/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rafaelpta%2Fdupehound/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34939227,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cli","code-quality","developer-tools","duplicate-detection","rust","static-analysis","tree-sitter"],"created_at":"2026-06-27T08:00:25.431Z","updated_at":"2026-06-29T19:00:48.783Z","avatar_url":"https://github.com/Rafaelpta.png","language":"Rust","funding_links":[],"categories":["Development tools"],"sub_categories":["Static analysis"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/hound.png\" alt=\"dupehound\" width=\"200\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003edupehound\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003eFinds functions duplicated by AI, even after every identifier is renamed.\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"./LICENSE\"\u003e\u003cimg alt=\"Open Source\" src=\"https://img.shields.io/badge/open%20source-100%25-success\"\u003e\u003c/a\u003e\n  \u003cimg alt=\"Platform\" src=\"https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-blue\"\u003e\n  \u003ca href=\"https://github.com/Rafaelpta/dupehound/actions/workflows/ci.yml\"\u003e\u003cimg alt=\"CI\" src=\"https://github.com/Rafaelpta/dupehound/actions/workflows/ci.yml/badge.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"./LICENSE\"\u003e\u003cimg alt=\"License: MIT\" src=\"https://img.shields.io/github/license/Rafaelpta/dupehound?color=blue\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/Rafaelpta/dupehound/stargazers\"\u003e\u003cimg alt=\"Stars\" src=\"https://img.shields.io/github/stars/Rafaelpta/dupehound\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\ndupehound is a duplicate-code detector built for codebases where agents write most of the code. It finds functions that exist more than once, even after every identifier and literal has been renamed, because it fingerprints the structure of the code instead of its text.\n\n| Command | What it does |\n|---------|--------------|\n| `scan` | reports every duplicate cluster and a repo-level slop score |\n| `history` | charts duplication across the git log and pinpoints when it took off |\n| `check` | fails CI when a change duplicates code that already exists, naming the original to reuse |\n\nEverything runs locally and deterministically (no network, API keys, or AI required).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/pipeline.svg\" alt=\"The pipeline: discover files, fingerprint every function via tree-sitter parsing and winnowing, match through an inverted index, report\" width=\"900\"\u003e\n\u003c/p\u003e\n\n## Install\n\nPrebuilt binaries for macOS, Linux and Windows are on the [releases page](https://github.com/Rafaelpta/dupehound/releases), or:\n\n```\ncargo install dupehound\n```\n\nOn macOS or Linux with Homebrew:\n\n```\nbrew install rafaelpta/dupehound/dupehound\n```\n\n`history` and `check` require `git` on PATH. `scan` works on any directory.\n\n## Usage\n\n### `scan`\n\n`dupehound scan [path]` ranks duplicate clusters by deletable lines:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/scan.png\" alt=\"dupehound scan on the vscode source tree: 2.8 percent slop, grade A, listing the top duplicate clusters\" width=\"880\"\u003e\n\u003c/p\u003e\n\nThe slop score is the percentage of code you could delete if every cluster kept only one copy; the largest copy is exempt and test files are excluded by default, since table-driven tests are repetitive by design. On Rust, trait-impl methods (`From`, `Display`, ...) are also kept out of the score, since each impl is required and cannot be merged.\n\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003e--explain N\u003c/code\u003e prints a cluster's code as proof\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003e--json\u003c/code\u003e emits a versioned schema\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003e--card\u003c/code\u003e writes a score card as SVG and PNG\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003e--include-classes\u003c/code\u003e flags C# classes with near-duplicate property and method signatures (experimental, opt-in, never affects the slop score)\u003c/li\u003e\n\u003c/ul\u003e\n\nLanguages: TypeScript, TSX, JavaScript, Python, Rust, Go, Java, Ruby, Swift, C, C++, PHP, C#, Kotlin.\n\n### `history`\n\n`dupehound history` measures the slop score at monthly snapshots, reading blobs straight from the object database (no checkouts), and reports when duplication took off:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/history.png\" alt=\"dupehound history charting the slop score across monthly snapshots, with the grade and the inflection point where duplication took off\" width=\"880\"\u003e\n\u003c/p\u003e\n\n### `check`\n\n`dupehound check` gates CI and pre-commit. It indexes the codebase at the base revision and probes only the functions a change adds or touches. \u003cbr\u003e\u003cbr\u003e Moved functions and in-place edits don't fire. \u003cbr\u003e\u003cbr\u003eExit codes: 0 clean, 1 findings, 2 error.\n\n```\n$ dupehound check --diff main .\nsrc/api/orders.ts:1 calculateOrderAmount() is a 100% duplicate of src/billing/invoice.ts:1 computeInvoiceTotal() — reuse it\n```\n\nA GitHub Actions recipe and a pre-commit setup are in [docs/ci.md](docs/ci.md). \u003cbr\u003e\u003cbr\u003eTo make a coding agent reuse code instead of rewriting it, feed `check` back to it from `CLAUDE.md` or `AGENTS.md`; the snippet is there too.\n\n### `mcp`\n\n`dupehound mcp` runs as an MCP server over stdio, exposing `check` and `scan` as tools an AI coding agent can call itself, mid-edit, to reuse existing code instead of rebuilding it. It stays local and offline (stdio is a local pipe), deterministic, and no AI. Add it to Claude Code with:\n\n```\nclaude mcp add dupehound -- dupehound mcp\n```\n\nThe agent then has a `check_duplication` tool (did this change duplicate existing code, and where is the original) and a `scan_duplication` tool (the repo's duplication score and clusters).\n\n## How it works\n\nFunction bodies are parsed with tree-sitter and normalized: identifiers, strings and numbers become sentinels, comments are dropped, structure stays. k-grams of 10 tokens are rolling-hashed and selected by robust winnowing ([Schleimer, Wilkerson \u0026 Aiken, SIGMOD 2003](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf)), which guarantees any shared run of 17 normalized tokens is caught.\u003cbr\u003e\u003cbr\u003e An inverted fingerprint index generates candidate pairs, boilerplate fingerprints are culled, similarity is exact Jaccard, union-find builds the clusters.\n\nThe defaults are conservative about false positives: generated, minified and vendored files are skipped, functions under 40 normalized tokens are ignored, and every match is verifiable with `--explain`. \u003cbr\u003e\u003cbr\u003eGrade buckets were calibrated against express (0.0%), gin (0.2%), tokio (1.1%), fastapi (1.7%) and vscode (2.8%), all grade A. vscode, at 3.0M lines and 54k functions, scans in 2.3s on a laptop. Full design notes in [docs/design.md](docs/design.md).\n\n## Why dupehound\n\nCoding agents don't know what a codebase already contains, so they re-implement it. `formatDate` becomes `renderTimestamp`, then `stringifyDate`: the same logic under several names, each copy aging independently. \u003cbr\u003e\u003cbr\u003eGitClear's [analysis of 211 million changed lines](https://www.gitclear.com/ai_assistant_code_quality_2025_research) found duplicated code blocks grew 8x in 2024, the first year copy-pasted lines outnumbered moved ones.\n\nAn LLM doesn't do this job well.\u003cbr\u003e\u003cbr\u003e Duplicate detection compares every function against every other; a model samples what fits in context, an index checks everything. A merge gate must be reproducible: same input, same verdict, an algorithm you can read. dupehound is the deterministic side of the loop: the agent writes, the index remembers.\n\n## Bugs\n\nPlease file issues on [the issue tracker](https://github.com/Rafaelpta/dupehound/issues). \u003cbr\u003e\u003cbr\u003eThe most useful false-positive report is a small code pair that matches but shouldn't, plus the `--explain` output; these become regression fixtures directly.\n\n## Contributing\n\nPRs welcome. Adding a language is the most wanted contribution and is roughly one tree-sitter query file; see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\n[MIT](./LICENSE). Bundled [JetBrains Mono](https://www.jetbrains.com/lp/mono/) subsets are under the [SIL OFL 1.1](assets/fonts/OFL.txt). The diagram uses Excalidraw's [Virgil](https://github.com/excalidraw/virgil) font (OFL).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRafaelpta%2Fdupehound","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRafaelpta%2Fdupehound","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRafaelpta%2Fdupehound/lists"}