https://github.com/yutanagano/symscan
Fast discovery of similar strings in bulk
https://github.com/yutanagano/symscan
edit-distance levenshtein-distance string-matching string-search string-similarity
Last synced: 6 months ago
JSON representation
Fast discovery of similar strings in bulk
- Host: GitHub
- URL: https://github.com/yutanagano/symscan
- Owner: yutanagano
- License: apache-2.0
- Created: 2024-11-03T17:42:28.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-12T23:29:49.000Z (7 months ago)
- Last Synced: 2025-12-13T00:08:34.735Z (7 months ago)
- Topics: edit-distance, levenshtein-distance, string-matching, string-search, string-similarity
- Language: Rust
- Homepage:
- Size: 10.5 MB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
# SymScan
### Check out the [documentation page](https://symscan.readthedocs.io).
**SymScan** enables extremely fast discovery of pairs of similar strings within
and across large collections.
SymScan is a variation on the [symmetric deletion
](https://seekstorm.com/blog/1000x-spelling-correction/) algorithm that is
optimised for bulk-searching similar strings within one or across two large
string collections at once (e.g. searching for similar protein sequences among
a collection of 10M). The key algorithmic difference between SymScan and
traditional symmetric deletion is the use of a [sort-merge
join](https://en.wikipedia.org/wiki/Sort-merge_join) approach in place of hash
maps to discover input strings that share common deletion variants. This
sort-and-scan approach trades off an additional factor of O(log N) (with N the
total number of strings being compared) in expected time complexity for
improved cache locality and effective parallelization, and ends up being much
faster for the above use case.
## Installing
### CLI
```sh
brew install yutanagano/tap/symscan-cli
```
### Rust library
```sh
cargo add symscan
```
### Python package
```sh
pip install symscan
```
## Licensing
SymScan is dual-licensed under the MIT and Apache 2.0 licenses. Unless
explicitly stated otherwise, any contribution submitted by you, as defined in
the Apache license, shall be dual-licensed as above, without any additional
terms and conditions.