https://github.com/k4black/fast-aug
Fast Augmentation library for NLP
https://github.com/k4black/fast-aug
augmentation natural-language-processing nlp python rust text-augmentation
Last synced: 7 months ago
JSON representation
Fast Augmentation library for NLP
- Host: GitHub
- URL: https://github.com/k4black/fast-aug
- Owner: k4black
- License: mit
- Created: 2023-12-17T08:30:50.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-17T14:44:07.000Z (8 months ago)
- Last Synced: 2025-02-28T10:35:22.821Z (7 months ago)
- Topics: augmentation, natural-language-processing, nlp, python, rust, text-augmentation
- Language: Rust
- Homepage:
- Size: 802 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# fast-aug
[](https://github.com/k4black/fast-aug/actions/workflows/test-rust.yml)
[](https://github.com/k4black/fast-aug/actions/workflows/test-python.yml)[](https://crates.io/crates/fast-aug)
[](https://pypi.org/project/fast-aug/)
[](https://github.com/k4black/fast-aug/blob/main/LICENSE)`fast-aug` is a library for fast text augmentation, available for both Rust and Python as `fast-aug`.
It is designed with focus on performance and real-time usage (e.g. during training), while providing a wide range of text augmentation methods.---
Please refer to respective READMEs for details:
* `rust` - see [fast_aug/README.md](fast_aug/README.md);
* `python` - see [bindings/python/README.md](bindings/python/README.md).## Features and TODO
Flow
- [x] ChanceAugmenter
- [x] SelectorAugmenter
- [x] SequentialAugmenterText
- [ ] RandomWordsAugmenter
- [x] Base - swaps/deletions
- [x] Insertions/Substitutions (from alphabet)
- [ ] RandomCharsAugmenter
- [x] Base - swaps/deletions
- [x] Insertions/Substitutions (from provided list)
- [ ] Insertions/Substitutions (from vocab by language tag)
- [ ] RandomSpellingAugmenter
- [ ] RandomKeyboardAugmenter
- [ ] RandomEmbeddingsAugmenter
- [ ] RandomTfIdfAugmenter
- [ ] RandomPosAugmenter
- [ ] [EmojiNormalizer](https://github.com/unicode-org/cldr-json/blob/858baad63c1d51e1d576ef99dccc229d92cedda4/cldr-json/cldr-annotations-full/annotations/en-AU/annotations.json#L1498)
- [ ] Keep labels (e.g. POS tags) unchangedModels and utils
- [ ] Models lazy loading
- [ ] At creation time
- [ ] At first use
- [ ] Background after creation
- [ ] [candle](https://github.com/huggingface/candle) support for DL models loading
- [ ] HF loading
- [ ] ONNX loading
- [ ] Optimizations (fp16/int8/int4/layers/etc)
- [ ] GPU support
- [ ] TF-IDF model
- [ ] json file loading
- [ ] sklearn model loading
- [x] Alphabet model
- [ ] Language Vocab model
- [ ] Embeddings model
- [ ] fasttext model loading
- [ ] word2vec model loading
- [ ] WordNet model
- [ ] English
- [ ] German
- [ ] More?Rust
- [x] Formatting
- [x] rustfmt
- [x] clippy
- [x] [rust flamegraph profiling](https://www.jibbow.com/posts/criterion-flamegraphs/)
- [x] Unit tests
- [ ] Integration tests
- [x] CI build and tests
- [ ] CI publish to crates.ioPython
- [ ] Custom Python Augmenter class (user provided to use in pipelines)
- [ ] Bindings with
- [x] Base pyo3 bindings
- [x] [maturin](https://github.com/PyO3/maturin) auto build from pyproject.toml
- [x] Stubs (.pyi) files generation
- [ ] Auto generate stubs on maturing build
- [x] Text
- [x] Flow
- [ ] Auto generate return type in stubs, see [pyo3 issue](https://github.com/PyO3/pyo3/issues/1112)
- [x] [flamegraph profiling](https://ohadravid.github.io/posts/2023-03-rusty-python/)
- [ ] Optimizations - see [this](https://ohadravid.github.io/posts/2023-03-rusty-python/)
- [ ] Integration tests
- [x] CI build and tests
- [ ] CI publish to pypi## Development
### Prerequisites
Clone the repository:
```shell
git clone git@github.com:k4black/fast-aug.git
cd fast-aug
```For rust library development:
* [rustup](https://rustup.rs/)
* [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html)For python bindings development:
* All rust library prerequisites
* `cd bindings/python && python -m venv .venv`
* `pip >= 23.1` to use `--config-settings`, see [pip issue](https://github.com/pypa/pip/issues/11859)### Make
The `Makefile` contains all the commands needed for development.
```shell
make help
```- `*-rust` - all targets related to rust library (`fast_aug/` folder)
- `*-python` - all targets related to python bindings (`bindings/python/` folder)### Benchmarks
All text benchmarks are run on the [tweet_eval dataset](https://github.com/cardiffnlp/tweeteval/) - sentiment task, test set, 12k rows.
```shell
cat test_data/tweet_eval_sentiment_test_text.txt | wc
12284 182576 1156877
```## License
This project and respective libraries are licensed under the MIT License - see the [LICENSE](LICENSE) file for details.