{"id":20457872,"url":"https://github.com/pythainlp/thairath-228k","last_synced_at":"2026-02-15T03:33:16.811Z","repository":{"id":104622723,"uuid":"218714484","full_name":"PyThaiNLP/thairath-228k","owner":"PyThaiNLP","description":"A Large Dataset for Thai Text Summarization from thairath.co.th","archived":false,"fork":false,"pushed_at":"2019-10-30T21:32:00.000Z","size":373,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-08T22:13:04.053Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyThaiNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-31T08:03:49.000Z","updated_at":"2020-03-08T14:12:54.000Z","dependencies_parsed_at":"2023-05-31T01:30:28.054Z","dependency_job_id":null,"html_url":"https://github.com/PyThaiNLP/thairath-228k","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PyThaiNLP/thairath-228k","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthairath-228k","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthairath-228k/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthairath-228k/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthairath-228k/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyThaiNLP","download_url":"https://codeload.github.com/PyThaiNLP/thairath-228k/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthairath-228k/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29466929,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-15T01:01:38.065Z","status":"online","status_checked_at":"2026-02-15T02:00:07.449Z","response_time":118,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T12:09:33.061Z","updated_at":"2026-02-15T03:33:16.795Z","avatar_url":"https://github.com/PyThaiNLP.png","language":null,"readme":"# thairath-228k\n**A Large Dataset for Thai Text Summarization from thairath.co.th.**\n\t Download the dataset [here](https://dl.orangedox.com/OKBIWku5Nv6gi2LBkH).\n\n\nThe `thairath-228k` dataset is crawled from the news site [Thairath](https://www.thairath.co.th/home \"Thairath\"). This dataset is purposefully scraped for evaluating various Thai NLP tasks especially text summarization and classification-benchmarks. We filtered out those articles which match, at least, one of following conditions:\n- Article that contains following tags: `นิยาย` (novel), อินสตราแกรมดารา (celebrity Instagram), `คลิปสุดฮา` (funny clip), `สรุปข่าว` (highlight news), `ดวง` (horoscope )\n- Article body contains less than 230 words.\n- Summary contains less than 8 words.\n- The abstractedness of the summary at 1-grams is less than 65%. \n\nAfter filtering, it contains 228,937 articles with 388,383 tags from October 1, 2014 to October 21, 2019. This dataset was crawled and cleaned by [Nakhun Chumpolsathien](https://github.com/nakhunchumpolsathien) and [Tanachat Arayachutinan](https://github.com/caramelWaffle). You can see preliminary exploration in `exploration.ipynb`.\n\n### `thairath-228k` Dataset Statistics\n\n| Properties     | Value |\n| :--------- | -----:|\n| Dataset Size  | 228,937 |\n| Average Article Length     |   478.44 |\n| Average Summary Length     |    46.54 |\n| Average Title Length |      12.43|\n| Unique Tag Size |  388,383 |\n| Vocabulary Size | To be updated |\n### Level of Abstractedness\nAbstractedness of the dataset is measured by calculating the unique n-grams in the reference summary which are not in the article. We compare the abstractedness level of `thairath-228k` dataset to `CNN/Daily Mail` and `WikiHow` dataset. The comparison is shown below figure.\n\n![](data/comparison.png)\n\n\u003e ※ The abstractedness at sentence level of `thairath-228k` is to be updated.\n\n### Experimental Results\n\n#### Classification-benchmarks\n \u003e※ To be updated \n#### Thai Text Summarization\n \u003e※ To be updated ","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fthairath-228k","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythainlp%2Fthairath-228k","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fthairath-228k/lists"}