{"id":20457832,"url":"https://github.com/pythainlp/classification-benchmarks","last_synced_at":"2026-03-08T11:37:33.265Z","repository":{"id":104622716,"uuid":"168530522","full_name":"PyThaiNLP/classification-benchmarks","owner":"PyThaiNLP","description":"Thai text classification benchmarks","archived":false,"fork":false,"pushed_at":"2020-05-26T04:07:57.000Z","size":70,"stargazers_count":34,"open_issues_count":1,"forks_count":6,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-05-18T19:17:35.704Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyThaiNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-31T13:34:56.000Z","updated_at":"2023-04-16T13:26:03.000Z","dependencies_parsed_at":"2023-06-08T10:15:52.646Z","dependency_job_id":null,"html_url":"https://github.com/PyThaiNLP/classification-benchmarks","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fclassification-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fclassification-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fclassification-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fclassification-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyThaiNLP","download_url":"https://codeload.github.com/PyThaiNLP/classification-benchmarks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242013624,"owners_count":20057868,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T12:09:23.303Z","updated_at":"2026-03-08T11:37:33.229Z","avatar_url":"https://github.com/PyThaiNLP.png","language":null,"readme":"# Thai Text Classification Benchmarks\n\nWe provide 4 datasets for Thai text classification in different styles, objectives, and number of labels. We also created some preliminary benchmarks using [fastText](https://fasttext.cc), linear models (linearSVC and logistic regression), and [thai2fit](https://github.com/cstorm125/thai2fit)'s implementation of [ULMFit](https://arxiv.org/abs/1801.06146).\n\n`prachathai-67k`, `truevoice-intent`, and all code in this repository are released under Apache License 2.0 by [pyThaiNLP](https://github.com/PyThaiNLP/). `wisesight-sentiment` is released to public domain, using Creative Commons Zero v1.0 Universal license, by [Wisesight](https://wisesight.com/). `wongnai-corpus` is released under GNU Lesser General Public License v3.0 by [Wongnai](https://www.wongnai.com/).\n\n## Dataset Description\n\n| Datasets                                                    | Style    | Objective | Labels | Size | \n|-------------------------------------------------------------|----------|-----------|--------|------|\n| [prachathai-67k](https://github.com/PyThaiNLP/prachathai-67k): body_text | Formal (online newspapers), News   | Topic     | 12    | 67k  |\n| [truevoice-intent](https://github.com/PyThaiNLP/truevoice-intent): destination  | Informal (call center transcription), Customer service | Intent    | 7    | 16k  |\n| [wisesight-sentiment](https://github.com/PyThaiNLP/wisesight-sentiment)  | Informal (social media), Conversation/opinion | Sentiment | 4   | 28k  | \n| [wongnai-corpus](https://github.com/wongnai/wongnai-corpus) | Informal (review site), Restuarant review | Sentiment | 5      | 40k  |(https://github.com/cstorm125/thai2fit/blob/master/notebook/lstm_wongnai.ipynb)               |\n\n## [prachathai-67k](https://github.com/PyThaiNLP/prachathai-67k): body_text\n\nWe benchmark [prachathai-67k](https://github.com/PyThaiNLP/prachathai-67k) by using `body_text` as text features and construct a 12-label multi-label classification. The performance is measured by macro-averaged accuracy and F1 score. Codes can be run to  confirm performance at this [notebook](https://github.com/PyThaiNLP/prachathai-67k/blob/master/classification.ipynb). We also provide performance metrics by class in the notebook.\n\n| model     | macro-accuracy | macro-F1 |\n|-----------|----------------|----------|\n| fastText  | 0.9302         | 0.5529   |\n| LinearSVC | 0.513277       | 0.552801 |\n| **ULMFit**    | **0.948737**       | **0.744875**\t |\n| [USE](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)    | 0.856091    | 0.696172\t |\n\n## [truevoice-intent](https://github.com/PyThaiNLP/truevoice-intent): destination\n\nWe benchmark [truevoice-intent](https://github.com/PyThaiNLP/truevoice-intent) by using `destination` as target and construct a 7-class multi-class classification. The performance is measured by micro-averaged and macro-averaged accuracy and F1 score. Codes can be run to  confirm performance at this [notebook](https://github.com/PyThaiNLP/truevoice-intent/blob/master/classification.ipynb). We also provide performance metrics by class in the notebook.\n\n| model     | macro-accuracy | micro-accuracy | macro-F1       | micro-F1   |\n|-----------|----------------|----------------|----------------|------------|\n| **LinearSVC** | **0.957806**       | **0.95747712**     |       **0.869411** | **0.85116993** |\n| ULMFit    | 0.955066       | 0.84273111     | 0.852149       | 0.84273111 |\n| [BERT](https://github.com/KongpolC/thai_intent_classification_using_bert) | 0.8921 | 0.85 | 0.87 | 0.85 |\n| USE    | 0.943559       | 0.94355855    | 0.787686       | 0.802455 |\n\n## [wisesight-sentiment](https://github.com/PyThaiNLP/wisesight-sentiment)\n\nPerformance of [wisesight-sentiment](https://github.com/PyThaiNLP/wisesight-sentiment) is based on the test set of [WISESIGHT Sentiment Analysis](https://www.kaggle.com/account/login?ReturnUrl=/t/0b22205d288143bb8672527b04690a97). Codes can be run to confirm performance at this [notebook](https://github.com/PyThaiNLP/wisesight-sentiment/blob/master/kaggle-competition/competition.ipynb). \n\n**Disclaimer** Note that the labels are obtained manually and are prone to errors so if you are planning to apply the models in the benchmark for real-world applications, be sure to benchmark it with **your own dataset**.\n\n| Model               | Public Accuracy | Private Accuracy |\n|---------------------|-----------------|------------------|\n| Logistic Regression | 0.72781         | 0.7499           |\n| FastText            | 0.63144         | 0.6131           |\n| ULMFit              | 0.71259         | 0.74194          |\n| ULMFit Semi-supervised    | 0.73119     | 0.75859      |\n| **[ULMFit Semi-supervised Repeated One Time](https://github.com/PyThaiNLP/wisesight-sentiment/blob/master/competition.ipynb)**    | **0.73372**     | **0.75968**      |\n| [USE](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)    | 0.63987*   |\n* Done after competition with a test set that was cleaned from 3946 rows to 2674 rows\n\n\n## [wongnai-corpus](https://github.com/wongnai/wongnai-corpus)\n\nPerformance of [wongnai-corpus](https://github.com/wongnai/wongnai-corpus) is based on the test set of [Wongnai Challenge: Review Rating Prediction](https://www.kaggle.com/account/login?ReturnUrl=%2Ft%2F5db04b4da3264e1091d83463b110153b). Codes can be run to confirm performance at this [notebook](https://github.com/cstorm125/thai2fit/blob/master/wongnai_cls/classification.ipynb).\n\n| Model     | Public Micro-F1 | Private Micro-F1 | \n|-----------|-----------------|------------------|\n| [**ULMFit Knight**](https://www.facebook.com/photo.php?fbid=10215789035573261\u0026set=pcb.795048317543327\u0026type=3\u0026theater\u0026ifg=1) | **0.61109** | **0.62580** |\n| [ULMFit](https://github.com/cstorm125/thai2fit/) | 0.59313          | 0.60322           |\n| fastText | 0.5145          | 0.5109           |\n| LinearSVC | 0.5022          | 0.4976           |\n| Kaggle Score | 0.59139          | 0.58139          |\n| [BERT](https://github.com/ThAIKeras/bert) | 0.56612 | 0.57057 |\n| [USE](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) | 0.42688 | 0.41031 |\n\n## BibTeX\n```\n@software{cstorm125_2020_3852912,\n  author       = {cstorm125 and\n                  lukkiddd},\n  title        = {PyThaiNLP/classification-benchmarks: v0.1-alpha},\n  month        = may,\n  year         = 2020,\n  publisher    = {Zenodo},\n  version      = {v0.1-alpha},\n  doi          = {10.5281/zenodo.3852912},\n  url          = {https://doi.org/10.5281/zenodo.3852912}\n}\n```\n\n## Acknowledgements\n\n* [Ekapol Chuangsuwanich](https://github.com/ekapolc) for pioneering [wongnai-corpus](https://github.com/wongnai/wongnai-corpus), [wisesight-sentiment](https://github.com/PyThaiNLP/wisesight-sentiment), and [truevoice-intent](https://github.com/PyThaiNLP/truevoice-intent) for his [NLP classes](https://github.com/ekapolc/nlp_course) at Chulalongkorn University. \n\n* [@lukkiddd](https://github.com/lukkiddd) for data exploration and linear model codes.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fclassification-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythainlp%2Fclassification-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fclassification-benchmarks/lists"}