{"id":50472414,"url":"https://github.com/semcod/redup","last_synced_at":"2026-06-01T11:03:25.878Z","repository":{"id":346371324,"uuid":"1188700137","full_name":"semcod/redup","owner":"semcod","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-31T10:37:33.000Z","size":67709,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T11:14:41.926Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/semcod.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-22T13:16:04.000Z","updated_at":"2026-05-31T10:37:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/semcod/redup","commit_stats":null,"previous_names":["semcod/redup"],"tags_count":69,"template":false,"template_full_name":null,"purl":"pkg:github/semcod/redup","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semcod%2Fredup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semcod%2Fredup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semcod%2Fredup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semcod%2Fredup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/semcod","download_url":"https://codeload.github.com/semcod/redup/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semcod%2Fredup/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33771630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-01T11:03:25.804Z","updated_at":"2026-06-01T11:03:25.869Z","avatar_url":"https://github.com/semcod.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# reDUP\n\n**Code duplication analyzer and refactoring planner for LLMs.**\n\n[![PyPI](https://img.shields.io/pypi/v/redup)](https://pypi.org/project/redup/)\n[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://python.org)\n[![Version](https://img.shields.io/badge/version-0.4.30-green.svg)](https://pypi.org/project/redup/)\n\n\n## AI Cost Tracking\n\n![PyPI](https://img.shields.io/badge/pypi-costs-blue) ![Version](https://img.shields.io/badge/version-0.4.30-blue) ![Python](https://img.shields.io/badge/python-3.9+-blue) ![License](https://img.shields.io/badge/license-Apache--2.0-green)\n![AI Cost](https://img.shields.io/badge/AI%20Cost-$31.52-orange) ![Human Time](https://img.shields.io/badge/Human%20Time-26.1h-blue) ![Model](https://img.shields.io/badge/Model-openrouter%2Fqwen%2Fqwen3--coder--next-lightgrey)\n\n- 🤖 **LLM usage:** $31.5200 (74 commits)\n- 👤 **Human dev:** ~$2609 (26.1h @ $100/h, 30min dedup)\n\nGenerated on 2026-05-31 using [openrouter/qwen/qwen3-coder-next](https://openrouter.ai/qwen/qwen3-coder-next)\n\n---\n\nreDUP scans codebases for duplicated functions, blocks, and structural patterns — then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.\n\n## Features\n\n- **Exact duplicate detection** via SHA-256 block hashing\n- **Structural clone detection** — same AST shape, different variable names\n- **LSH near-duplicate detection** for large code blocks (\u003e50 lines)\n- **Multi-language support** — 35+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Bash, SQL, HTML, CSS, Lua, Scala, Kotlin, Swift, Objective-C, JSON, YAML, TOML, XML, Markdown, GraphQL, Dockerfile, Makefile, Nginx, Vim, Svelte, Vue, and more)\n- **Parallel scanning** for large projects (2x+ performance improvement)\n- **Incremental scan cache** (`--incremental`) for faster repeat runs\n- **Changed-only scan mode** (`--changed-only`) for git-diff focused analysis\n- **Fuzzy near-duplicate matching** via SequenceMatcher / rapidfuzz\n- **Function-level analysis** using Python AST and tree-sitter extraction\n- **Impact scoring** — prioritizes duplicates by `saved_lines × similarity`\n- **Refactoring planner** — generates concrete extract/inline suggestions\n- **Multiple output formats**: JSON, YAML, TOON, Markdown\n- **Configuration system** — TOML files and environment variables\n- **CLI commands**: `scan`, `compare`, `diff`, `check`, `config`, `info`\n- **Cross-project comparison** — detect shared code between projects with merge/extract recommendations\n- **CI integration** with configurable quality gates\n- **Clean output** — no syntax warnings from external libraries\n\n## New Features (v0.4.20)\n\n### 🤖 MCP Server\n\nFull MCP (Model Context Protocol) server for AI assistant integration:\n\n```bash\n# Start MCP server\nredup-mcp\n\n# Or HTTP mode\nredup-mcp --transport http --port 8000\n```\n\n**Available Tools:**\n- `analyze_project` — Full duplication analysis\n- `find_duplicates` — Quick duplicate detection\n- `check_project` — Quality gate check\n- `compare_projects` — Cross-project comparison\n- `suggest_refactoring` — AI-powered refactoring suggestions\n- `project_info` — Project metadata\n\n### 🌐 Universal Fuzzy Similarity Detection\n\nCross-language duplicate detection across all 35+ supported languages:\n\n```bash\n# Detect similar code across languages\nredup scan . --fuzzy --fuzzy-threshold 0.65\n```\n\n**Cross-Language Matching:**\n- JavaScript ↔ Python functions: ~65% similarity\n- Docker ↔ YAML configs: ~40% similarity\n- Auth patterns across languages: ~70% similarity\n\n**Supported Patterns:**\n- Functions, classes, API endpoints\n- Database queries, web components\n- Auth/validation, error handling, logging\n- Configuration, infrastructure code\n\n### 🌳 Modular Tree-Sitter Extractor\n\nRefactored tree-sitter extraction with clean, modular architecture:\n\n```\nts_extractor/\n├── extractors/          # Modular per-language extractors\n│   ├── c_family.py      # C, C++, C#, Objective-C\n│   ├── go.py            # Go\n│   ├── java.py          # Java, Scala, Kotlin\n│   ├── markup.py        # HTML, XML, Svelte, Vue\n│   ├── web.py           # JavaScript, TypeScript\n│   └── ...\n├── dispatcher.py        # Smart language routing\n├── config.py            # Language registry\n└── main.py              # Unified API\n```\n\n**Benefits:**\n- Easier to add new languages\n- Better testability\n- Cleaner separation of concerns\n- 35+ languages supported\n\n---\n\n## New Features (v0.5.0+)\n\n### 🌐 Universal Fuzzy Similarity Detection\n\nCross-language fuzzy matching for detecting similar code patterns across **all 35+ supported languages**:\n\n```bash\n# Detect similar patterns across different languages\nredup scan . --fuzzy --ext .py,.js,.ts\n\n# Cross-project comparison with fuzzy matching\nredup compare ./project-a ./project-b --fuzzy --threshold 0.65\n```\n\n**Features:**\n- Detects similar functions, API endpoints, validation logic across languages (e.g., JS ↔ Python)\n- Pattern recognition: authentication, error handling, database queries, web components\n- Language-agnostic signature generation with identifier normalization\n- Complexity scoring (0.0-1.0) for each detected pattern\n\n**Example patterns detected:**\n- Express.js route handler ↔ Flask endpoint (70% similarity)\n- Docker Compose service ↔ Kubernetes deployment (40% similarity)\n- Auth middleware patterns across frameworks\n\n### 🧩 Modular ts_extractor Architecture\n\nThe tree-sitter multi-language extractor has been refactored from a 782-line god module into a clean package:\n\n```\nredup/core/ts_extractor/\n├── extractors/\n│   ├── web.py        # JavaScript/TypeScript\n│   ├── c_family.py   # C/C++\n│   ├── dotnet.py     # C#\n│   ├── ruby.py       # Ruby\n│   ├── php.py        # PHP\n│   └── ...           # 10+ language-specific modules\n```\n\n**Benefits:**\n- Better maintainability (avg 100 lines per module vs 782)\n- Easier to add new language extractors\n- Shared base utilities for common operations\n- Full backward compatibility maintained\n\n### 🎯 Enriched TOON Reporter\n\nThe TOON format now includes actionable sections for practical refactoring:\n\n- **HOTSPOTS** — Top 7 files with most duplicated lines (where to focus effort)\n- **QUICK_WINS** — Low-risk, high-savings suggestions (do first)\n- **DEPENDENCY_RISK** — Duplicates spanning multiple packages (cross-module risk)\n- **EFFORT_ESTIMATE** — Time estimates per task with difficulty (easy/medium/hard)\n\n### 🤖 LLM-Powered Refactoring Plans\n\nGenerate AI-assisted refactoring TODO lists from cross-project comparisons:\n\n```bash\nredup compare ./project-a ./project-b --refactor-plan --env .env --output report.json\n```\n\n- Uses `litellm` for flexible LLM provider support\n- Compact metadata-only prompts for efficiency\n- Structured JSON output with prioritized tasks\n- Token usage tracking\n\n### 📊 Simplified Compare Reports\n\nCross-project comparison reports are now more compact and human-readable:\n\n- Relative file paths instead of absolute\n- Matches deduplicated by function pair\n- Communities with compact member dicts\n- Filtered trivial entries to reduce noise\n- ~60% smaller JSON size\n\n## Installation\n\n```bash\npip install redup\n```\n\nWith optional dependencies:\n\n```bash\npip install redup[all]       # Everything\npip install redup[fuzzy]     # rapidfuzz for better similarity matching\npip install redup[ast]       # tree-sitter for multi-language AST\npip install redup[lsh]       # datasketch for LSH near-duplicate detection\npip install redup[compare]   # networkx for cross-project community detection\npip install redup[llm]       # litellm for LLM-powered refactoring plans\n```\n\n## Quick Start\n\n### CLI\n\n```bash\n# Scan current directory, output TOON to stdout\nredup scan .\n\n# Scan with JSON output saved to file\nredup scan ./src --format json --output ./reports/\n\n# Parallel scanning for large projects\nredup scan . --parallel --max-workers 4\n\n# Reuse cache between runs for faster rescans\nredup scan . --incremental\n\n# Scan only files changed vs branch tip (git diff based)\nredup scan . --changed-only --base-ref origin/main --incremental\n\n# Multi-language scanning with 35+ supported languages\nredup scan . --ext \".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue\"\n\n# CI gate with thresholds\nredup check . --max-groups 10 --max-lines 100\n\n# Compare two scans\nredup diff before.json after.json\n\n# Cross-project comparison (merge vs extract decision)\nredup compare ./project-a ./project-b --threshold 0.75\n\n# With LLM-powered refactoring plan (requires litellm + .env with API keys)\nredup compare ./project-a ./project-b --refactor-plan --env .env --output comparison.json\n\n# Specify custom LLM model\nredup compare ./project-a ./project-b --refactor-plan --llm-model openrouter/anthropic/claude-3.5-sonnet\n\n# Initialize configuration\nredup config --init\n```\n\n```bash\n# Scan with all formats\nredup scan . --format all --output ./redup_output/\n\n# Only function-level duplicates (faster)\nredup scan . --functions-only\n\n# Custom thresholds\nredup scan . --min-lines 5 --min-sim 0.9\n\n# Show installed optional dependencies\nredup info\n\n# Export duplications as tasks to TODO.md (requires: pip install redup[tasks])\nredup tasks ./my-project\n\n# Export with GitHub sync\nredup tasks ./my-project --backend github --milestone \"Sprint 1\"\n\n# Export with GitLab sync and custom output\nredup tasks ./my-project -b gitlab -o refactoring-tasks.md\n\n# Preview tasks without creating files\nredup tasks ./my-project --dry-run\n```\n\n### Task Management with Planfile (Optional)\n\nWhen you install `redup[tasks]`, you can export duplication findings as\nactionable tasks in TODO.md format with synchronization to GitHub, GitLab,\nor Jira:\n\n```bash\n# Install with planfile support\npip install redup[tasks]\n\n# Generate TODO.md from duplications\nredup tasks ./my-project --output TODO.md\n\n# The generated TODO.md includes:\n# - Priority-based task organization (critical/major/minor)\n# - Difficulty estimation (easy/medium/hard)\n# - Line savings potential\n# - Detailed refactoring suggestions\n# - Planfile export configuration\n```\n\nExample TODO.md output:\n```markdown\n# TODO - Duplication Refactoring Tasks\n\n## CRITICAL (3 tasks)\n- [ ] **Refactor: process_file (4x duplication)** 🔴\n   Priority: critical | Savings: 124L\n   \u003cdetails\u003e\n   Extract function to shared utility module.\n   Files: src/core/scanner.py, src/core/planner.py, ...\n   \u003c/details\u003e\n\n## MAJOR (5 tasks)\n- [ ] **Refactor: validate_input (3x duplication)** 🟡\n   Priority: major | Savings: 45L\n   ...\n```\n\n### Configuration\n\nCreate a `redup.toml` file:\n\n```toml\n[scan]\nextensions = \".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue\"\nmin_lines = 3\nmin_similarity = 0.85\ninclude_tests = false\n\n[lsh]\nenabled = true\nmin_lines = 50\nthreshold = 0.8\n\n[check]\nmax_groups = 10\nmax_lines = 100\n\n[output]\nformat = \"toon\"\noutput = \"redup_output\"\n\n[reporting]\ninclude_snippets = true\ngenerate_suggestions = true\n```\n\nOr use `[tool.redup]` in `pyproject.toml`. Environment variables with `REDUP_` prefix override file settings.\n\n### Python API\n\n```python\nfrom pathlib import Path\nfrom redup import ScanConfig, analyze\nfrom redup.reporters.toon_reporter import to_toon\nfrom redup.reporters.json_reporter import to_json\n\nconfig = ScanConfig(\n    root=Path(\"./my_project\"),\n    extensions=[\".py\", \".js\", \".ts\", \".go\", \".rs\", \".java\", \".rb\", \".php\", \".html\", \".css\"],\n    min_block_lines=3,\n    min_similarity=0.85,\n)\n\nresult = analyze(config=config, function_level_only=True)\n\nprint(f\"Found {result.total_groups} duplicate groups\")\nprint(f\"Lines recoverable: {result.total_saved_lines}\")\n\n# For LLM consumption\nprint(to_toon(result))\n\n# For tooling / CI\nPath(\"duplication.json\").write_text(to_json(result))\n```\n\n## Output Formats\n\n### TOON (LLM-optimized)\n\n```\n# redup/duplication | 15 groups | 86f 10453L | 2026-04-16\n\nSUMMARY:\n  files_scanned: 86\n  total_lines:   10453\n  dup_groups:    15\n  dup_fragments: 36\n  saved_lines:   217\n  scan_ms:       3620\n\nHOTSPOTS[7] (files with most duplication):\n  src/redup/core/ts_extractor.py  dup=74L  groups=4  frags=11  (0.7%)\n  src/redup/core/scanner_utils.py  dup=70L  groups=3  frags=3  (0.7%)\n  src/redup/core/scanner_loader.py  dup=52L  groups=1  frags=1  (0.5%)\n\nDUPLICATES[15] (ranked by impact):\n  [E0001] ! EXAC  _preload_files  L=52 N=2 saved=52 sim=1.00\n      src/redup/core/scanner_loader.py:9-60  (_preload_files)\n      src/redup/core/scanner_utils.py:53-104  (_preload_files)\n\nREFACTOR[15] (ranked by priority):\n  [1] ◐ extract_module     → src/redup/core/utils/_preload_files.py\n      WHY: 2 occurrences of 52-line block across 2 files — saves 52 lines\n      FILES: src/redup/core/scanner_loader.py, src/redup/core/scanner_utils.py\n\nQUICK_WINS[8] (low risk, high savings — do first):\n  [3] extract_function   saved=26L  → src/redup/core/utils/find_exact_duplicates_lazy.py\n      FILES: lazy_grouper.py\n  [4] extract_function   saved=21L  → src/redup/core/utils/_extract_functions_go.py\n      FILES: ts_extractor.py\n\nDEPENDENCY_RISK[3] (duplicates spanning multiple packages):\n  validate_input  packages=2  files=2\n      api/routes/users.py\n      services/auth/validate.py\n\nEFFORT_ESTIMATE (total ≈ 8.7h):\n  hard   _preload_files                      saved=52L  ~156min\n  hard   __init__                            saved=36L  ~108min\n  medium find_exact_duplicates_lazy          saved=26L  ~52min\n  easy   _is_test_file                       saved=12L  ~24min\n\nMETRICS-TARGET:\n  dup_groups:  15 → 0\n  saved_lines: 217 lines recoverable\n```\n\n### JSON (machine-readable)\n\n```json\n{\n  \"summary\": {\n    \"total_groups\": 3,\n    \"total_saved_lines\": 84\n  },\n  \"groups\": [\n    {\n      \"id\": \"E0001\",\n      \"type\": \"exact\",\n      \"normalized_name\": \"calculate_tax\",\n      \"fragments\": [\n        {\"file\": \"billing.py\", \"line_start\": 1, \"line_end\": 8},\n        {\"file\": \"shipping.py\", \"line_start\": 1, \"line_end\": 8}\n      ],\n      \"saved_lines_potential\": 16\n    }\n  ],\n  \"refactor_suggestions\": [\n    {\n      \"priority\": 1,\n      \"action\": \"extract_function\",\n      \"new_module\": \"utils/calculate_tax.py\",\n      \"risk_level\": \"low\"\n    }\n  ]\n}\n```\n\n## Cross-Project Comparison\n\nThe `redup compare` command analyzes two separate projects to detect shared code and recommends a refactoring strategy:\n\n- **Merge projects** — if \u003e60% code overlap\n- **Extract shared library** — if 5-60% overlap with well-defined clusters\n- **Keep separate** — if \u003c5% overlap\n\n### CLI Usage\n\n```bash\n# Basic comparison\nredup compare ./project-a ./project-b --threshold 0.75\n\n# With semantic similarity (slower, more accurate)\nredup compare ./project-a ./project-b --semantic --threshold 0.70\n\n# Multi-language projects\nredup compare ./backend ./frontend --ext \".py,.js,.ts\" --threshold 0.80\n\n# Skip community detection (faster, no networkx required)\nredup compare ./a ./b --no-community\n\n# Generate LLM-powered refactoring plan (requires redup[llm])\nredup compare ./a ./b --refactor-plan --env .env --output plan.json\n```\n\n### Sample Output\n\n```\nComparing project-a ↔ project-b (threshold=0.75)\n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃      Cross-Project Comparison                        ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ Metric                  │ Value                      │\n├─────────────────────────┼────────────────────────────┤\n│ Project A files         │ 42                         │\n│ Project B files         │ 38                         │\n│ Project A lines         │ 8500                       │\n│ Project B lines         │ 7200                       │\n│ Cross matches           │ 15                         │\n│ Shared LOC (potential)  │ 1200                       │\n└─────────────────────────┴────────────────────────────┘\n\nRecommendation: extract_shared_lib\n15% overlap (1200 shared lines, 5 clusters). Extract to shared library.\nConfidence: 80%\n\nTop Communities (shared code candidates):\n┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┓\n┃ ID ┃ Name                 ┃ Similarity ┃ LOC ┃ Members  ┃\n┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━┩\n│  0 │ validate_input       │ 0.89       │ 180 │ 5        │\n│  1 │ parse_config         │ 0.82       │ 140 │ 4        │\n│  2 │ format_response      │ 0.76       │ 100 │ 3        │\n└────┴──────────────────────┴────────────┴─────┴──────────┘\n```\n\n### Report JSON Structure\n\n```json\n{\n  \"project_a\": \"./project-a\",\n  \"project_b\": \"./project-b\",\n  \"stats\": {\n    \"a\": {\"files\": 42, \"lines\": 8500},\n    \"b\": {\"files\": 38, \"lines\": 7200}\n  },\n  \"total_matches\": 15,\n  \"shared_loc_potential\": 1200,\n  \"recommendation\": {\n    \"decision\": \"extract_shared_lib\",\n    \"rationale\": \"15% overlap (1200 shared lines, 5 clusters). Extract to shared library.\",\n    \"overlap_pct\": 0.1523,\n    \"shared_loc\": 1200,\n    \"confidence\": 0.8\n  },\n  \"communities\": [\n    {\n      \"name\": \"validate_input\",\n      \"similarity\": 0.89,\n      \"loc\": 180,\n      \"members\": [\n        {\"project\": \"A\", \"file\": \"api/validators.py\", \"function\": \"validate_input\"},\n        {\"project\": \"B\", \"file\": \"utils/validation.py\", \"function\": \"validate_input\"}\n      ]\n    }\n  ],\n  \"matches\": [...]\n}\n```\n\n### Algorithm Overview\n\nThe comparison uses a **3-tier similarity detection**:\n\n1. **Structural hash** — exact AST matches (fast, O(n+m))\n2. **LSH (Locality Sensitive Hashing)** — near-duplicates via MinHash\n3. **Semantic similarity** — CodeBERT embeddings (optional, slowest)\n\nMatches are deduplicated by `(function_a, function_b, file_a, file_b)` with the highest similarity score retained.\n\n### Community Detection\n\nRequires `networkx` (`pip install redup[compare]`).\n\nUses **greedy modularity communities** on a similarity graph where:\n- Nodes = functions from both projects\n- Edges = similarity score (filtered by `--threshold`)\n- Communities = clusters of mutually similar functions\n\nEach community gets a generated name based on longest common prefix of its member functions (e.g., `validate_*` → `validate_input`).\n\n## Architecture\n\n```\nsrc/redup/\n├── __init__.py            # Public API\n├── __main__.py            # python -m redup\n├── mcp_server.py          # MCP server entry point (re-exports from mcp package)\n├── mcp/                   # MCP server package\n│   ├── __init__.py        # Public MCP API\n│   ├── handlers.py        # Tool handlers\n│   ├── schemas.py         # JSON-RPC schemas\n│   ├── server.py          # JSON-RPC server core\n│   └── utils.py           # Shared utilities\n├── core/\n│   ├── models.py          # Pydantic data models\n│   ├── scanner.py         # File discovery + block extraction\n│   ├── scanner/           # Scanner package\n│   │   ├── __init__.py    # Public scanner API\n│   │   ├── cache.py       # Memory cache\n│   │   ├── filters.py     # File filtering\n│   │   ├── loader.py      # File preloading\n│   │   └── types.py       # Scanner types\n│   ├── hasher.py          # SHA-256 / structural fingerprinting\n│   ├── matcher.py         # Fuzzy similarity comparison\n│   ├── planner.py         # Refactoring suggestion generator\n│   ├── pipeline.py        # Legacy: re-exports from pipeline package\n│   └── pipeline/          # Pipeline package (new)\n│       ├── __init__.py    # analyze(), analyze_optimized(), analyze_parallel()\n│       ├── phases.py      # scan_phase(), process_blocks()\n│       ├── duplicate_finder.py  # Duplicate finding phases\n│       └── groups.py      # Group creation, deduplication\n│   └── ts_extractor/        # Tree-sitter extraction (35+ languages)\n│       ├── __init__.py    # Public API\n│       ├── main.py        # Core extraction API\n│       ├── dispatcher.py  # Language routing\n│       ├── config.py      # Language registry\n│       └── extractors/    # Per-language extractors\n├── reporters/\n│   ├── json_reporter.py   # JSON output\n│   ├── yaml_reporter.py   # YAML output\n│   └── toon_reporter.py   # TOON output (LLM-optimized)\n└── cli_app/\n    └── main.py            # Typer CLI\n```\n\n## Analysis Pipeline\n\n```\n1. SCAN      Walk project, read files, extract function-level + sliding-window blocks\n2. HASH      Generate exact (SHA-256) and structural (normalized AST) fingerprints\n3. GROUP     Bucket by hash, keep only groups with 2+ blocks from different locations\n4. MATCH     Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)\n5. DEDUP     Remove overlapping groups (keep highest-impact)\n6. PLAN      Generate prioritized refactoring suggestions with risk assessment\n7. REPORT    Export to JSON / YAML / TOON\n```\n\n## Recent Improvements (v0.5.0)\n\n### 🏗️ **Modular Architecture Refactoring**\n\nMajor internal restructuring for better maintainability and extensibility:\n\n#### MCP Server Package\nThe MCP server has been split from a 675-line monolith into a clean package:\n```\nredup/mcp/\n├── __init__.py      # Public API\n├── handlers.py      # 8 tool handlers\n├── schemas.py       # JSON-RPC schemas\n├── server.py        # Server core\n└── utils.py         # Utilities\n```\n- **82% code reduction** in main file\n- **Backward compatible**: `mcp_server.py` re-exports all APIs\n- **Better testability**: Isolated handlers can be tested independently\n\n#### Pipeline Package\nThe analysis pipeline (714 lines) now lives in a modular package:\n```\nredup/core/pipeline/\n├── __init__.py          # analyze(), analyze_optimized(), analyze_parallel()\n├── phases.py            # scan_phase(), process_blocks()\n├── duplicate_finder.py  # find_exact_groups(), find_structural_groups(), etc.\n└── groups.py            # deduplicate_groups(), blocks_to_group(), etc.\n```\n- **66% reduction** in main orchestrator file\n- **Phases can be used independently** for custom workflows\n- **Cleaner separation** of concerns\n\n#### Scanner Improvements\nThe scanner has been refactored with extracted helpers:\n- `_init_strategy()` - Strategy initialization\n- `_process_single_file()` - Per-file processing\n- `_extract_blocks_for_file()` - Block extraction\n- **Reduced CC** and **fan-out** in main `scan_project()` function\n\n### 🎯 **Sprint 1 Refactoring Complete**\n- **Reduced cyclomatic complexity** from CC̄=4.2 to CC̄=3.5\n- **Eliminated all critical functions** (CC \u003e 10): 2 → 0\n- **Achieved HEALTHY status** with no structural issues\n- **Dispatch pattern implementation** for AST node processing\n- **Modular TOON reporter** split into 5 focused functions\n- **CLI refactoring** with helper functions for better maintainability\n\n### 🚀 **Technical Achievements**\n- **`_process_ast_node`**: CC=14 → CC=6 (dispatch dict pattern)\n- **`to_toon`**: CC=12 → CC=8 (5 helper functions)\n- **CLI `scan()`**: fan-out=18 → ≤10 (4 helper functions)\n- **Code quality**: 0 high-complexity functions\n- **Test coverage**: 64/64 tests passing (100%)\n\n### 📊 **Quality Metrics**\n- **Health status**: ✅ HEALTHY (no critical issues)\n- **Cyclomatic complexity**: CC̄=3.5 (target ≤ 3.0 achieved)\n- **Maximum CC**: 9 (target ≤ 10 achieved)\n- **Code maintainability**: Significantly improved\n- **Duplication**: Minimal (2 groups, 6 lines - acceptable patterns)\n\n### 🔧 **Code Architecture**\n- **Dispatch tables** for extensible AST processing\n- **Single responsibility** functions throughout codebase\n- **Clean separation** of concerns in CLI pipeline\n- **Type safety** improvements with proper annotations\n- **Error handling** enhanced for edge cases\n\n---\n\n## Integration with wronai Toolchain\n\nreDUP is part of the [wronai](https://github.com/wronai) developer toolchain:\n\n- **[code2llm](https://github.com/wronai/code2llm)** — static analysis engine (health diagnostics, complexity)\n- **reDUP** — deep duplication analysis and refactoring planning\n- **[code2docs](https://github.com/wronai/code2docs)** — automatic documentation generation\n- **[vallm](https://github.com/semcod/vallm)** — validation of LLM-generated code proposals\n\n### 📈 **Typical workflow:**\n\n1. `code2llm` analyzes the project → `.toon` diagnostics\n2. `redup` finds duplicates → `duplication.toon.yaml`  \n3. Feed both to an LLM for targeted refactoring\n4. `vallm` validates the LLM's proposals before merging\n\n### 🎯 **Why reDUP?**\n\n- **LLM-ready**: TOON format optimized for LLM consumption\n- **Actionable**: Generates concrete refactoring suggestions\n- **Prioritized**: Ranks duplicates by impact and risk\n- **Integrated**: Works seamlessly with wronai toolchain\n- **Fast**: Scans 1000+ lines in \u003c 1 second\n- **Clean**: No syntax warnings, professional output\n\n---\n\n## Development\n\n```bash\ngit clone https://github.com/semcod/redup.git\ncd redup\npip install -e \".[dev]\"\npytest\n```\n\n## License\n\nLicensed under Apache-2.0.\n## Author\n\nTom Sapletta\n## Status\n\n_Last updated by [taskill](https://github.com/oqlos/taskill) at 2026-04-25 13:46 UTC_\n\n| Metric | Value |\n|---|---|\n| HEAD | `7055183` |\n| Coverage | 42.9% |\n| Failing tests | — |\n| Commits in last cycle | 50 |\n\n\u003e Added markdown output and a configuration management system, with numerous docs and code-analysis refactors and some test additions. Several refactors target the code analysis engine and TypeScript extractor components.\n\n\u003c!-- taskill:status:end --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemcod%2Fredup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsemcod%2Fredup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemcod%2Fredup/lists"}