https://github.com/yutanagano/symscan

Fast discovery of similar strings in bulk
https://github.com/yutanagano/symscan

edit-distance levenshtein-distance string-matching string-search string-similarity

Last synced: 6 months ago
JSON representation

Fast discovery of similar strings in bulk

Host: GitHub
URL: https://github.com/yutanagano/symscan
Owner: yutanagano
License: apache-2.0
Created: 2024-11-03T17:42:28.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-12-12T23:29:49.000Z (7 months ago)
Last Synced: 2025-12-13T00:08:34.735Z (7 months ago)
Topics: edit-distance, levenshtein-distance, string-matching, string-search, string-similarity
Language: Rust
Homepage:
Size: 10.5 MB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

# SymScan

### Check out the [documentation page](https://symscan.readthedocs.io).

**SymScan** enables extremely fast discovery of pairs of similar strings within
and across large collections.

SymScan is a variation on the [symmetric deletion
](https://seekstorm.com/blog/1000x-spelling-correction/) algorithm that is
optimised for bulk-searching similar strings within one or across two large
string collections at once (e.g. searching for similar protein sequences among
a collection of 10M). The key algorithmic difference between SymScan and
traditional symmetric deletion is the use of a [sort-merge
join](https://en.wikipedia.org/wiki/Sort-merge_join) approach in place of hash
maps to discover input strings that share common deletion variants. This
sort-and-scan approach trades off an additional factor of O(log N) (with N the
total number of strings being compared) in expected time complexity for
improved cache locality and effective parallelization, and ends up being much
faster for the above use case.

## Installing

### CLI

```sh
brew install yutanagano/tap/symscan-cli
```

### Rust library

```sh
cargo add symscan
```

### Python package

```sh
pip install symscan
```

## Licensing

SymScan is dual-licensed under the MIT and Apache 2.0 licenses. Unless
explicitly stated otherwise, any contribution submitted by you, as defined in
the Apache license, shall be dual-licensed as above, without any additional
terms and conditions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yutanagano/symscan

Awesome Lists containing this project

README