{"id":30772182,"url":"https://github.com/query-farm/rapidfuzz","last_synced_at":"2025-09-05T00:52:42.257Z","repository":{"id":309042965,"uuid":"1034970100","full_name":"Query-farm/rapidfuzz","owner":"Query-farm","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-09T12:06:13.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-09T14:15:26.131Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Query-farm.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-09T11:26:14.000Z","updated_at":"2025-08-09T12:06:16.000Z","dependencies_parsed_at":"2025-08-09T14:15:31.020Z","dependency_job_id":"f7627188-ba31-4d11-85f9-775504cafc35","html_url":"https://github.com/Query-farm/rapidfuzz","commit_stats":null,"previous_names":["query-farm/rapidfuzz"],"tags_count":null,"template":false,"template_full_name":"duckdb/extension-template","purl":"pkg:github/Query-farm/rapidfuzz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Query-farm%2Frapidfuzz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Query-farm%2Frapidfuzz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Query-farm%2Frapidfuzz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Query-farm%2Frapidfuzz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Query-farm","download_url":"https://codeload.github.com/Query-farm/rapidfuzz/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Query-farm%2Frapidfuzz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273695251,"owners_count":25151484,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-05T00:52:39.465Z","updated_at":"2025-09-05T00:52:42.245Z","avatar_url":"https://github.com/Query-farm.png","language":"Python","readme":"\n# RapidFuzz Extension for DuckDB\n\nThis `rapidfuzz` extension adds high-performance fuzzy string matching functions to DuckDB, powered by the RapidFuzz C++ library.\n\n## Installation\n\n**`rapidfuzz` is a [DuckDB Community Extension](https://github.com/duckdb/community-extensions).**\n\nYou can use it in DuckDB SQL:\n\n```sql\ninstall rapidfuzz from community;\nload rapidfuzz;\n```\n\n## What is Fuzzy String Matching?\n\nFuzzy string matching allows you to compare strings and measure their similarity, even when they are not exactly the same. This is useful for:\n\n- Data cleaning and deduplication\n- Record linkage\n- Search and autocomplete\n- Spell checking\n\nRapidFuzz provides fast, high-quality algorithms for string similarity and matching.\n\n## Available Functions\n\nThis extension exposes several core RapidFuzz algorithms as DuckDB scalar functions:\n\n### `rapidfuzz_ratio(a, b)`\n- **Returns**: `DOUBLE` (similarity score between 0 and 100)\n- **Description**: Computes the similarity ratio between two strings.\n\n```sql\nSELECT rapidfuzz_ratio('hello world', 'helo wrld');\n┌─────────────────────────────────────────────┐\n│ rapidfuzz_ratio('hello world', 'helo wrld') │\n│                   double                    │\n├─────────────────────────────────────────────┤\n│                    90.0                     │\n└─────────────────────────────────────────────┘\n```\n\n### `rapidfuzz_partial_ratio(a, b)`\n- **Returns**: `DOUBLE`\n- **Description**: Computes the best partial similarity score between substrings of the two inputs.\n\n```sql\nSELECT rapidfuzz_partial_ratio('hello world', 'world');\n┌─────────────────────────────────────────────────┐\n│ rapidfuzz_partial_ratio('hello world', 'world') │\n│                     double                      │\n├─────────────────────────────────────────────────┤\n│                      100.0                      │\n└─────────────────────────────────────────────────┘\n```\n\n### `rapidfuzz_token_sort_ratio(a, b)`\n- **Returns**: `DOUBLE`\n- **Description**: Compares strings after sorting their tokens (words), useful for matching strings with reordered words.\n\n```sql\nSELECT rapidfuzz_token_sort_ratio('world hello', 'hello world');\n┌──────────────────────────────────────────────────────────┐\n│ rapidfuzz_token_sort_ratio('world hello', 'hello world') │\n│                          double                          │\n├──────────────────────────────────────────────────────────┤\n│                          100.0                           │\n└──────────────────────────────────────────────────────────┘\n```\n\n### `rapidfuzz_token_set_ratio(a, b)`\n- **Returns**: `DOUBLE`\n- **Description**: A similarity metric that compares sets of tokens between two strings, ignoring duplicated words and word order.\n\n```sql\nSELECT rapidfuzz_token_set_ratio('new york new york city', 'new york city');\n┌──────────────────────────────────────────────────────────────────────┐\n│ rapidfuzz_token_set_ratio('new york new york city', 'new york city') │\n│                                double                                │\n├──────────────────────────────────────────────────────────────────────┤\n│                                100.0                                 │\n└──────────────────────────────────────────────────────────────────────┘\n```\n\n\n## Supported Data Types\n\nAll functions support DuckDB `VARCHAR` type. For best results, use with textual data.\n\n## Usage Examples\n\n### Basic Similarity\n\n```sql\nSELECT rapidfuzz_ratio('database', 'databse');\nSELECT rapidfuzz_partial_ratio('duckdb extension', 'extension');\nSELECT rapidfuzz_token_sort_ratio('fuzzy string match', 'string fuzzy match');\nSELECT rapidfuzz_token_set_ratio('fuzzy string match', 'string fuzzy match');\n```\n\n### Data Deduplication\n\n```sql\nSELECT name, rapidfuzz_ratio(name, 'Jon Smith') AS similarity\nFROM users\nWHERE rapidfuzz_ratio(name, 'Jon Smith') \u003e 80;\n```\n\n### Record Linkage\n\n```sql\nSELECT a.id, b.id, rapidfuzz_ratio(a.name, b.name) AS score\nFROM table_a a\nJOIN table_b b ON rapidfuzz_ratio(a.name, b.name) \u003e 85;\n```\n\n### Search and Autocomplete\n\n```sql\nSELECT query, candidate, rapidfuzz_partial_ratio(query, candidate) AS score\nFROM search_candidates\nORDER BY score DESC\nLIMIT 10;\n```\n\n## Algorithm Selection Guide\n\n- **General similarity**: Use `rapidfuzz_ratio` for overall similarity.\n- **Partial matches**: Use `rapidfuzz_partial_ratio` for substring matches.\n- **Reordered words**: Use `rapidfuzz_token_sort_ratio` for strings with the same words in different orders.\n\n## Performance Tips\n\n1. RapidFuzz algorithms are highly optimized for speed and accuracy.\n2. For large datasets, use WHERE clauses to filter by similarity threshold.\n3. Preprocess your data (e.g., lowercase, trim) for best results.\n\n## License\n\nMIT Licensed\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquery-farm%2Frapidfuzz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquery-farm%2Frapidfuzz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquery-farm%2Frapidfuzz/lists"}