Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/huu4ontocord/rio

Text pre-processing for NLP datasets
https://github.com/huu4ontocord/rio

language natural-language-processing nlp

Last synced: 9 days ago
JSON representation

Text pre-processing for NLP datasets

Host: GitHub
URL: https://github.com/huu4ontocord/rio
Owner: huu4ontocord
License: apache-2.0
Created: 2022-01-12T04:16:15.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-12-26T23:31:02.000Z (almost 2 years ago)
Last Synced: 2024-07-29T12:16:06.058Z (4 months ago)
Topics: language, natural-language-processing, nlp
Language: Python
Homepage:
Size: 526 MB
Stars: 11
Watchers: 10
Forks: 6
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# Intro
Rio (spanish for river) is a library for using backtranslation or round trip translation to do text pre-processing, filtering, and augmentation. It is intended to be used to process text datasets for training NLP models. This is based on the original Muliwai repo but the PII code has been refactored to live in its own repo at https://www.github.com/piisa/muliwai. Rio no longer does PII processing. Please use https://www.github.com/piisa/muliwai instead.

# Installing
If you want to be able to do gender detection and coref detection, you will need to load neuralcoref below. However, you will only be able to use spacy english if you load neural coref. You can also load a larger spacy model for more accuracy but more memory.
```
git clone https://github.com/ontocord/rio
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install spacy==2.1.0 regex==2022.3.2 dateparser python-stdnum protobuf neuralcoref cdifflib transformers datasets langid faker sentencepiece fsspec tqdm sentence-transformers nltk
python -m nltk.downloader punkt wordnet
```

# License
- The source code authored by Ontocord LLC and contributed by contributors of this project is licensed under Apache 2.0.

# Contributors

We welcome all contributions. Please feel free to send a PR. Please follow the code of conduct: https://github.com/ontocord/rio/blob/main/CODE_OF_CONDUCT.md
Special thanks to these people not just for code contributions but for comments and reviews (in no particular order) from the original Muliwai repo:
- @dadelani
- @edugp
- @vumichien
- @ianyu93
- @j-chim
- @justinphan3110
- @mapama247
- @paulovn
- @PierreColombo
- @piesauce
- @mmitchellai
- @shamikbose

# Acknowledgements

We heavily use the models trained by @dadelani and the excelent work by https://github.com/masakhane-io.