Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ndamulelonemakh/our-stopwords
Auto-generated stopwords for South African Bantu Languages
https://github.com/ndamulelonemakh/our-stopwords
african-languages africanlp dataset low-resource-languages natural-language-processing nlp stopwords tshivenda
Last synced: about 2 months ago
JSON representation
Auto-generated stopwords for South African Bantu Languages
- Host: GitHub
- URL: https://github.com/ndamulelonemakh/our-stopwords
- Owner: ndamulelonemakh
- Created: 2024-06-01T14:35:15.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-06-01T16:16:13.000Z (7 months ago)
- Last Synced: 2024-06-02T16:51:25.359Z (7 months ago)
- Topics: african-languages, africanlp, dataset, low-resource-languages, natural-language-processing, nlp, stopwords, tshivenda
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Our Stopwords
* Auto-generated stopwords for South African Languages
## Introduction
* We present a list of auto-translated stopwords from English and adapt them to native [South African Bantu Languages](https://pubs.cs.uct.ac.za/id/eprint/1334/1/icadl_2019_banturecognition.pdf)
## Usage- The data is provided in [JSON Lines](https://jsonlines.org/) format. Here is an example of using the stopwords in Python:
### Python Example
```python
import json# Load the stop words from the JSON lines file
stop_words = []
with open('za_stopwords.main.jsonl', 'r', encoding='utf-8') as file:
for line in file:
stop_words.append(json.loads(line.strip()))# Example: Print stop words in Zulu
for word in stop_words:
print(f"English: {word['eng']}, Zulu: {word['zul']}")```
> Refer to [`training_example.py`](./training_example.py) for a full working example on how you can use these stopwords in your model training scripts
## Supported Languages
* ven - Tshivenda
* tso - Xitsonga
* sot - Southern Sotho
* nso - Northen Sotho
* tsn - Setswana
* zul - IsiZulu
* xho - IsiXhosa
* ...
* Coming soon:
* nbl
* ssw## Contributions
* Feel free to create a pull request if you have a suggestion
## License
* This work is published under the CC BY-NC 4.0 license
# Citation
```bibtex
@misc{multilingual_stop_words,
author = {Ndamulelo Nemakhavhani},
title = {Autogenerated Stop Words for South African Bantu Languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ndamulelonemakh/our-stopwords}}
}
```## Contact
* If you have any questions or suggestions, please feel free to open an issue or [contact us](endeesa@yahoo.com).