https://github.com/andythefactory/ro-diacritics
Python package for Romanian diacritics restoration
https://github.com/andythefactory/ro-diacritics
bert diacritics diacritics-removal diacritics-restoration nlp romanian romanian-bert romanian-diacritics-restoration romanian-language transformers transformers-models
Last synced: about 1 year ago
JSON representation
Python package for Romanian diacritics restoration
- Host: GitHub
- URL: https://github.com/andythefactory/ro-diacritics
- Owner: AndyTheFactory
- License: mit
- Created: 2022-04-07T19:43:03.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-01-03T21:26:12.000Z (over 2 years ago)
- Last Synced: 2024-10-11T09:26:34.928Z (over 1 year ago)
- Topics: bert, diacritics, diacritics-removal, diacritics-restoration, nlp, romanian, romanian-bert, romanian-diacritics-restoration, romanian-language, transformers, transformers-models
- Language: Python
- Homepage:
- Size: 43 KB
- Stars: 4
- Watchers: 5
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RO Diacritics module
**RO Diacritics** is a straightforward diacritics restoration module for Romanian Language
```python
from ro_diacritics import restore_diacritics
print(restore_diacritics("fara poezie, viata e pustiu"))
```
or correcting a pandas dataframe:
```python
from ro_diacritics import restore_diacritics
df['text-diacritice'] = df['text'].apply(restore_diacritics)
```
## Installing
```console
$ python -m pip install ro-diacritics
```
or
```console
$ pip install ro-diacritics
```
## Requirements
* torch and torchtext
* numpy
* nltk and scikit-learn (for training)
* needs nltk.download('punkt') for tokenization
## References
- Ruseti, S., Cotet, T. M., & Dascalu, M. (2020). Romanian Diacritics Restoration Using Recurrent Neural Networks. arXiv preprint arXiv:2009.02743.
- https://github.com/teodor-cotet/DiacriticsRestoration