{"id":17486079,"url":"https://github.com/pydatablog/simstring_rs","last_synced_at":"2025-04-09T22:42:48.661Z","repository":{"id":247464205,"uuid":"820578689","full_name":"PyDataBlog/simstring_rs","owner":"PyDataBlog","description":"A native Rust implementation of the CPMerge algorithm, designed for approximate string matching","archived":false,"fork":false,"pushed_at":"2025-01-23T09:35:46.000Z","size":87,"stargazers_count":1,"open_issues_count":6,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T13:18:48.246Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://docs.rs/simstring_rust/latest/simstring_rust/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyDataBlog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-26T18:51:42.000Z","updated_at":"2025-01-23T09:35:51.000Z","dependencies_parsed_at":"2024-07-08T23:45:25.195Z","dependency_job_id":"c7687427-d4ce-46d4-9b85-f0af7f84a792","html_url":"https://github.com/PyDataBlog/simstring_rs","commit_stats":null,"previous_names":["pydatablog/simstring_rs"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyDataBlog%2Fsimstring_rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyDataBlog%2Fsimstring_rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyDataBlog%2Fsimstring_rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyDataBlog%2Fsimstring_rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyDataBlog","download_url":"https://codeload.github.com/PyDataBlog/simstring_rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248124929,"owners_count":21051757,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-19T02:09:31.867Z","updated_at":"2025-04-09T22:42:48.634Z","avatar_url":"https://github.com/PyDataBlog.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# simstring_rust\n\n[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)\n[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)\n[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)\n[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)\n\nA native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.\n\n## Features\n\n- ✅ Fast algorithm for string matching\n- ✅ 100% exact retrieval\n- ✅ Support for Unicode\n- [ ] Support for building databases directly from text files\n- [ ] Mecab-based tokenizer support\n\n## Supported String Similarity Measures\n\n- ✅ Dice coefficient\n- ✅ Jaccard coefficient\n- ✅ Cosine coefficient\n- ✅ Overlap coefficient\n- ✅ Exact match\n\n## Installation\n\nAdd `simstring_rust` to your `Cargo.toml`:\n\n```toml\n[dependencies]\nsimstring_rust = \"0.1.0\" # change version accordingly\n```\n\nFor the latest features, you can add the master branch by specifying the Git repository:\n\n```toml\n[dependencies]\nsimstring_rust = { git = \"https://github.com/PyDataBlog/simstring_rs.git\", branch = \"main\" }\n```\n\nNote: Using the master branch may include experimental features and potential breakages. Use with caution!\n\nTo revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.\n\n## Usage\n\nHere is a basic example of how to use simstring_rs in your Rust project:\n\n```Rust\nuse simstring_rust::database::HashDB;\nuse simstring_rust::extractors::CharacterNGrams;\nuse simstring_rust::measures::Cosine;\n\nfn main() {\n    let feature_extractor = CharacterNGrams {\n        n: 2,\n        padder: \" \".to_string(),\n    };\n    let measure = Cosine::new();\n    let mut db = HashDB::new(feature_extractor, measure);\n\n    db.insert(\"hello\".to_string());\n    db.insert(\"help\".to_string());\n    db.insert(\"halo\".to_string());\n    db.insert(\"world\".to_string());\n\n    let threshold = 0.5;\n    let results = db.search(\"hell\", threshold);\n\n    if results.is_empty() {\n        println!(\"No results found with threshold {}\", threshold);\n    } else {\n        println!(\"Results with threshold {}:\", threshold);\n        for result in results {\n            println!(\"Match: '{}' (score: {})\", result.value, result.score);\n        }\n    }\n}\n```\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request on GitHub.\nLicense\n\nThis project is licensed under the MIT License.\n\n## Acknowledgements\n\nInspired by the [SimString.jl](https://github.com/PyDataBlog/SimString.jl) project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpydatablog%2Fsimstring_rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpydatablog%2Fsimstring_rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpydatablog%2Fsimstring_rs/lists"}