{"id":50870091,"url":"https://github.com/fadhlidev/dedupx","last_synced_at":"2026-06-15T04:30:50.483Z","repository":{"id":354654535,"uuid":"1224084762","full_name":"fadhlidev/dedupx","owner":"fadhlidev","description":"A fast CLI tool for deduplicating database records using configurable similarity rules","archived":false,"fork":false,"pushed_at":"2026-04-29T12:34:01.000Z","size":63,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-29T14:39:58.565Z","etag":null,"topics":["bun","cli","database","deduplication","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fadhlidev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-04-29T00:19:22.000Z","updated_at":"2026-04-29T12:34:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fadhlidev/dedupx","commit_stats":null,"previous_names":["fadhlidev/dedupx"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/fadhlidev/dedupx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fadhlidev%2Fdedupx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fadhlidev%2Fdedupx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fadhlidev%2Fdedupx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fadhlidev%2Fdedupx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fadhlidev","download_url":"https://codeload.github.com/fadhlidev/dedupx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fadhlidev%2Fdedupx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34348291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bun","cli","database","deduplication","typescript"],"created_at":"2026-06-15T04:30:49.789Z","updated_at":"2026-06-15T04:30:50.474Z","avatar_url":"https://github.com/fadhlidev.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DedupX\n\nA fast CLI tool for deduplicating database records using configurable similarity rules. Built with Bun.\n\n## Features\n\n- **Multiple Comparison Algorithms**: exact, fuzzy, soundex, ngram, numeric\n- **Flexible Configuration**: YAML-based configuration with weighted rules\n- **Parallel Processing**: Worker pool with configurable concurrency\n- **Blocking Strategies**: Reduce comparison space with blocking columns\n- **Progress Tracking**: Real-time progress bars and detailed logging\n- **Database Support**: PostgreSQL, MySQL, SQLite\n\n## Install\n\n```bash\nbun install\n```\n\n## Quick Start\n\n1. Create a configuration file (see `config.example.yaml` for reference):\n\n```yaml\nsource:\n  connection: \"postgres://user:password@localhost:5432/mydb\"\n  driver: \"postgres\"\n  table: \"users\"\n\nrules:\n  - name: \"email_match\"\n    columns: [\"email\"]\n    comparator: \"exact\"\n    weight: 1.0\n  - name: \"name_similarity\"\n    columns: [\"first_name\", \"last_name\"]\n    comparator: \"fuzzy\"\n    weight: 0.8\n\nthreshold: 0.85\n\nprocessing:\n  strategy: \"block\"\n  blocking_column: \"email_domain\"\n  batch_size: 500\n  concurrency: 4\n```\n\n2. Run deduplication:\n\n```bash\nbun run index.ts run -c config.yaml\n```\n\nOr run the built binary:\n\n```bash\n./dedupx run -c config.yaml\n```\n\n## Commands\n\n### `run`\n\nRuns the deduplication process based on the provided configuration.\n\n```bash\nbun run index.ts run -c config.yaml\n```\n\n### `check`\n\nValidates the configuration file and tests the database connection.\n\n```bash\nbun run index.ts check -c config.yaml\n```\n\n### `rules`\n\nDry-run: shows which rules would fire for a sample of data without writing results.\n\n```bash\nbun run index.ts rules -c config.yaml -s 100\n```\n\nOptions:\n- `-s, --sample-size \u003cnumber\u003e` - Number of rows to fetch for dry-run (default: 100)\n\n## Configuration Reference\n\n### Source\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `connection` | string | Yes | Database connection string (URI format) |\n| `driver` | string | Yes | Database driver: `postgres`, `mysql`, or `sqlite` |\n| `table` | string | Yes | Source table name to deduplicate |\n\n### Rules\n\nArray of rules defining how to compare records. Each rule consists of:\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `name` | string | Yes | Human-readable name for the rule |\n| `columns` | array | Yes | List of column names to compare (concatenated for comparison) |\n| `comparator` | string | Yes | Comparison algorithm: `exact`, `fuzzy`, `soundex`, `ngram`, `numeric` |\n| `weight` | number | No | Weight for weighted averaging (0-1), default: 1.0 |\n| `options` | object | No | Comparator-specific options |\n\n#### Comparator Details\n\n| Comparator | Description | Options |\n|------------|-------------|---------|\n| `exact` | Exact string match | None |\n| `fuzzy` | Levenshtein distance-based similarity | `threshold`: minimum similarity (default: 0.8) |\n| `soundex` | Phonetic algorithm matching | None |\n| `ngram` | N-gram based similarity | `n`: n-gram size (default: 3), `threshold`: minimum similarity |\n| `numeric` | Numeric comparison | `tolerance`: allowed difference for numbers |\n\n### Threshold\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `threshold` | number | Yes | Similarity threshold (0-1) to consider records as duplicates |\n\n### Processing\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `strategy` | string | No | Comparison strategy: `block` or `full_scan` (default: `block`) |\n| `blocking_column` | string | Only if strategy is `block` | Column to group records for blocking |\n| `batch_size` | number | No | Number of rows to process in one batch (default: 500) |\n| `concurrency` | number | No | Number of parallel workers (default: 4) |\n\n### Output\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `schema` | string | No | Output schema name (for databases that support schemas) |\n\n## Sample Configuration\n\n### Basic Email Deduplication\n\n```yaml\nsource:\n  connection: \"postgres://user:pass@localhost:5432/db\"\n  driver: \"postgres\"\n  table: \"customers\"\n\nrules:\n  - name: \"email_exact\"\n    columns: [\"email\"]\n    comparator: \"exact\"\n    weight: 1.0\n\nthreshold: 1.0\n\nprocessing:\n  strategy: \"block\"\n  blocking_column: \"email\"\n```\n\n### Multi-Column Person Deduplication\n\n```yaml\nsource:\n  connection: \"sqlite://data.db\"\n  driver: \"sqlite\"\n  table: \"people\"\n\nrules:\n  - name: \"full_name\"\n    columns: [\"first_name\", \"last_name\"]\n    comparator: \"fuzzy\"\n    weight: 0.9\n    options:\n      threshold: 0.85\n  - name: \"phone_exact\"\n    columns: [\"phone\"]\n    comparator: \"exact\"\n    weight: 1.0\n  - name: \"address_soundex\"\n    columns: [\"address\"]\n    comparator: \"soundex\"\n    weight: 0.7\n\nthreshold: 0.8\n\nprocessing:\n  strategy: \"block\"\n  blocking_column: \"zip_code\"\n  concurrency: 8\n```\n\n### Fuzzy Product Deduplication\n\n```yaml\nsource:\n  connection: \"mysql://user:pass@localhost:3306/inventory\"\n  driver: \"mysql\"\n  table: \"products\"\n\nrules:\n  - name: \"product_name\"\n    columns: [\"name\", \"brand\"]\n    comparator: \"ngram\"\n    weight: 1.0\n    options:\n      n: 3\n      threshold: 0.7\n  - name: \"sku_exact\"\n    columns: [\"sku\"]\n    comparator: \"exact\"\n    weight: 1.0\n  - name: \"price_numeric\"\n    columns: [\"price\"]\n    comparator: \"numeric\"\n    weight: 0.5\n    options:\n      tolerance: 1.0\n\nthreshold: 0.75\n\nprocessing:\n  strategy: \"full_scan\"\n  concurrency: 4\n  batch_size: 1000\n```\n\n## Output\n\nAfter running deduplication, results are written to a new table with the format `{table}_dedup_{timestamp}`. The output contains:\n\n- Original record IDs\n- Group IDs (all records in the same group are considered duplicates)\n- Canonical (representative) record for each group\n\n## Build\n\n```bash\nbun run build   # outputs 'dedupx' binary\nbun run clean  # removes binary\n```\n\n## Run with Hot Reload\n\n```bash\nbun --hot src/index.ts run -c config.yaml\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffadhlidev%2Fdedupx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffadhlidev%2Fdedupx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffadhlidev%2Fdedupx/lists"}