An open API service indexing awesome lists of open source software.

https://github.com/dsfsi/puodata

Curated corpora for Setswana. Used to train PuoBERTa.
https://github.com/dsfsi/puodata

african-languages african-nlp corpora dsfsi-datasets natural-language-processing setswana south-africa tn tsn

Last synced: 4 months ago
JSON representation

Curated corpora for Setswana. Used to train PuoBERTa.

Awesome Lists containing this project

README

          

# PuoData: A curated corpora for Setswana

[![arXiv](https://img.shields.io/badge/arXiv-2310.09141-b31b1b.svg)](https://arxiv.org/abs/2310.09141)

Give Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse)

We believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.

## Dataset Curation

| Dataset Name | Kind | Num. of Tokens |
|---|---|---|
| *PuoData* | | |
| NCHLT Setswana \cite{eiselen2014developing} | Government Documents | 1,010,147 |
| Nalibali Setswana | Childrens Books | 57,654 |
| Setswana Bible | Book(s) | 879,630 |
| SA Constitution | Official Document | 56,194 |
| Leipzig Setswana Corpus BW | Curated Dataset | 219,149 |
| Leipzig Setswana Corpus ZA | Curated Dataset | 218,037 |
| SABC Dikgang tsa Setswana FB (Facebook) | News Headlines | 167,119 |
| SABC MotswedingFM FB | Online Content | 33,092 |
| Leipzig Setswana Wiki | Online Content | 230,333 |
| Setswana Wiki | Online Content | 183,168 |
| Vukuzenzele Monolingual TSN | Government News | 157,798 |
| gov-za Cabinet speeches TSN | Government Speeches | 591,920 |
| Department Basic Education TSN | Education Material | 708,965 |
| **PuoData Total** | 25MB on disk | **4,513,206** |
| *PuoData+JW300* | | |
| JW300 Setswana| Book(s) | 19,782,122 |
| **PuoData+JW300** | 124MB on disk | **24,295,328** |

## Dataset Uses

We used this corpus to train [PuoBERTa](https://github.com/dsfsi/PuoBERTa), 🤗 [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa). It is also part of the corpus used for [PuoBERTaJW300](https://huggingface.co/dsfsi/PuoBERTaJW300).

## Citation Information

Bibtex Reference

```
@inproceedings{marivate2023puoberta,
title = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
year = {2023},
booktitle= {SACAIR 2023 (To Appear)},
keywords = {NLP},
preprint_url = {https://arxiv.org/abs/2310.09141},
dataset_url = {https://github.com/dsfsi/PuoBERTa},
software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}
```

## License

The license of PuoData is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license
* License for Data - [CC-BY-SA-4.0](LICENSE)

## Dataset Contact

For more details, reach out or check our [website](https://dsfsi.github.io/).

Email: vukosi.marivate@cs.up.ac.za

**Enjoy exploring Setswana through AI!**