{"id":24480852,"url":"https://github.com/dsfsi/puodata","last_synced_at":"2026-01-30T09:25:47.064Z","repository":{"id":200588895,"uuid":"703986644","full_name":"dsfsi/PuoData","owner":"dsfsi","description":"Curated corpora for Setswana. Used to train PuoBERTa.","archived":false,"fork":false,"pushed_at":"2023-10-26T07:19:26.000Z","size":8720,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-21T11:17:27.062Z","etag":null,"topics":["african-languages","african-nlp","corpora","dsfsi-datasets","natural-language-processing","setswana","south-africa","tn","tsn"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dsfsi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-12T10:00:03.000Z","updated_at":"2024-04-06T19:38:07.000Z","dependencies_parsed_at":"2023-10-26T08:30:37.803Z","dependency_job_id":null,"html_url":"https://github.com/dsfsi/PuoData","commit_stats":null,"previous_names":["dsfsi/puodata"],"tags_count":0,"template":false,"template_full_name":"dsfsi/dsfsi-project-starter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2FPuoData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2FPuoData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2FPuoData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2FPuoData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dsfsi","download_url":"https://codeload.github.com/dsfsi/PuoData/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243624103,"owners_count":20321029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["african-languages","african-nlp","corpora","dsfsi-datasets","natural-language-processing","setswana","south-africa","tn","tsn"],"created_at":"2025-01-21T11:17:33.110Z","updated_at":"2026-01-30T09:25:47.035Z","avatar_url":"https://github.com/dsfsi.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# PuoData: A curated corpora for Setswana\n\n[![arXiv](https://img.shields.io/badge/arXiv-2310.09141-b31b1b.svg)](https://arxiv.org/abs/2310.09141)\n\nGive Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse)\n\nWe believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.\n\n## Dataset Curation\n\n| Dataset Name | Kind | Num. of Tokens |\n|---|---|---|\n| *PuoData* |  |  |\n| NCHLT Setswana \\cite{eiselen2014developing} | Government Documents | 1,010,147 |\n| Nalibali Setswana | Childrens Books | 57,654 |\n| Setswana Bible | Book(s) | 879,630 |\n| SA Constitution | Official Document | 56,194 |\n| Leipzig Setswana Corpus BW | Curated Dataset | 219,149 |\n| Leipzig Setswana Corpus ZA | Curated Dataset | 218,037 |\n| SABC Dikgang tsa Setswana FB (Facebook) | News Headlines | 167,119 |\n| SABC MotswedingFM FB | Online Content | 33,092 |\n| Leipzig Setswana Wiki | Online Content | 230,333 |\n| Setswana Wiki | Online Content | 183,168 |\n| Vukuzenzele Monolingual TSN | Government News | 157,798 |\n| gov-za Cabinet speeches TSN | Government Speeches | 591,920 |\n| Department Basic Education TSN | Education Material | 708,965 |\n| **PuoData Total** | 25MB on disk | **4,513,206** |\n| *PuoData+JW300* |  |  |\n| JW300 Setswana| Book(s) | 19,782,122 |\n| **PuoData+JW300** | 124MB on disk | **24,295,328** |\n\n## Dataset Uses\n\nWe used this corpus to train [PuoBERTa](https://github.com/dsfsi/PuoBERTa), 🤗 [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa). It is also part of the corpus used for [PuoBERTaJW300](https://huggingface.co/dsfsi/PuoBERTaJW300). \n\n## Citation Information\n\nBibtex Reference\n\n```\n@inproceedings{marivate2023puoberta,\n  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},\n  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},\n  year    = {2023},\n  booktitle= {SACAIR 2023 (To Appear)},\n  keywords = {NLP},\n  preprint_url = {https://arxiv.org/abs/2310.09141},\n  dataset_url = {https://github.com/dsfsi/PuoBERTa},\n  software_url = {https://huggingface.co/dsfsi/PuoBERTa}\n}\n```\n\n## License\n\nThe license of PuoData is in CC-BY-SA-4.0.  the monolingual data have difference licenses depending on the news website license\n* License for Data - [CC-BY-SA-4.0](LICENSE)\n  \n## Dataset Contact\n\nFor more details, reach out or check our [website](https://dsfsi.github.io/).\n\nEmail: vukosi.marivate@cs.up.ac.za\n\n**Enjoy exploring Setswana through AI!**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsfsi%2Fpuodata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdsfsi%2Fpuodata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsfsi%2Fpuodata/lists"}