Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/divvun/divvunspell
Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)
https://github.com/divvun/divvunspell
box fst hfst rust spellchecking
Last synced: 2 months ago
JSON representation
Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)
- Host: GitHub
- URL: https://github.com/divvun/divvunspell
- Owner: divvun
- License: apache-2.0
- Created: 2017-04-11T11:43:22.000Z (almost 8 years ago)
- Default Branch: main
- Last Pushed: 2024-11-12T00:28:05.000Z (3 months ago)
- Last Synced: 2024-11-12T01:23:05.729Z (3 months ago)
- Topics: box, fst, hfst, rust, spellchecking
- Language: Rust
- Homepage:
- Size: 793 KB
- Stars: 14
- Watchers: 3
- Forks: 7
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE
- Support: support/accuracy-viewer/.gitignore
Awesome Lists containing this project
- low-resource-languages - divvunspell - `hfst-ospell` (below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster than `hfst-ospell`. It uses the same zhfst files as `hfst-ospell`, which are available for all languages in the [GiellaLT](https://github.com/giellalt/) GitHub org (see below). (Software / Utilities)
README
# divvunspell
An implementation of [hfst-ospell](https://github.com/hfst/hfst-ospell) in Rust, with added features like tokenization, case handling, and parallelisation.
[![CI](https://github.com/divvun/divvunspell/actions/workflows/ci.yml/badge.svg)](https://github.com/divvun/divvunspell/actions/workflows/ci.yml)
## Building and installing commandline tools
```sh
# For the `divvunspell` binary:
cargo install divvunspell-bin# For `thfst-tools` binary (most people can skip this one):
cargo install thfst-tools# To build the development version from this source, cd into the relevant directory and:
cargo install --path .
```### Building with `gpt2` support on macOS aarch64
(Skip this if you are not experimenting with gpt2 support. So skip. Now.)
Clone this repo then:
```bash
brew install libtorch
LIBTORCH=/opt/homebrew/opt/libtorch cargo build --features gpt2 --bin divvunspell
```### No Rust?
```sh
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
rustup default stable
cargo build --release
```### divvunspell
Usage:```sh
Usage: divvunspell SUBCOMMAND [OPTIONS]Optional arguments:
-h, --help print help messageAvailable subcommands:
suggest get suggestions for provided input
tokenize print input in word-separated tokenized form
predict predict next words using GPT2 model$ divvunspell suggest -h
Usage: divvunspell suggest [OPTIONS]Positional arguments:
inputs words to be processedOptional arguments:
-h, --help print help message
-a, --archive ARCHIVE BHFST or ZHFST archive to be used
-S, --always-suggest always show suggestions even if word is correct
-w, --weight WEIGHT maximum weight limit for suggestions
-n, --nbest NBEST maximum number of results
--no-case-handling disables case-handling algorithm (makes results more like hfst-ospell)
--json output in JSON format
```### accuracy
Building:
```sh
cd accuracy/
cargo install --path .
```The resulting binary `accuracy` is placed in `$HOME/.cargo/bin/`, make sure it is on the path.
Usage:
```
divvunspell-accuracy 1.0.0-beta.1
Accuracy testing for DivvunSpell.USAGE:
accuracy [OPTIONS] [ARGS]FLAGS:
-h, --help Prints help information
-V, --version Prints version informationOPTIONS:
-c Provide JSON config file to override test defaults
-o The file path for the JSON report output
-w Truncate typos list to max number of words specified
-t The file path for the TSV line appendARGS:
The 'input -> expected' list in tab-delimited value file (TSV)
Use the given ZHFST file
```### thfst-tools
Convert hfst and zhfst files to thfst and bhfst formats.
- **thfst**: byte-aligned hfst for fast and efficient loading and memory mapping, required to run `divvunspell` on ARM processors
- **bhfst**: thfst files wrapped in a [box](https://github.com/bbqsrc/box) container; in the case of zhfst files converted to bhfst, the metadata file (`index.xml` in the zhfst archive) is converted to a json file for faster and leaner processing by the `divvunspell` library.Usage:
```
thfst-tools 1.0.0-alpha.5
Tromsø-Helsinki Finite State Transducer toolkit.USAGE:
thfst-toolsFLAGS:
-h, --help Prints help information
-V, --version Prints version informationSUBCOMMANDS:
bhfst-info Print metadata for BHFST
help Prints this message or the help of the given subcommand(s)
hfst-to-thfst Convert an HFST file to THFST
thfsts-to-bhfst Convert a THFST acceptor/errmodel pair to BHFST
zhfst-to-bhfst Convert a ZHFST file to BHFST
```## Speller testing
There's a prototype-level testing tool in `support/accuracy-viewer`. Use it like:
```
accuracy -o support/accuracy-viewer/public/report.json typos.txt sma.zhfst
cd support/accuracy-viewer
npm i && npm run dev
```View in `http://localhost:5000`.
`typos.txt` is a TSV file with typos in the first column and expected correction in the second.
More info by `accuracy --help`.## License
The crate `divvunspell` is licensed under either of
* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)at your option.
The `divvunspell`, `thfst-tools` and `accuracy` binaries are licensed under the GPL version 3 license.