https://github.com/pythainlp/thairath-228k
A Large Dataset for Thai Text Summarization from thairath.co.th
https://github.com/pythainlp/thairath-228k
Last synced: about 2 months ago
JSON representation
A Large Dataset for Thai Text Summarization from thairath.co.th
- Host: GitHub
- URL: https://github.com/pythainlp/thairath-228k
- Owner: PyThaiNLP
- License: apache-2.0
- Created: 2019-10-31T08:03:49.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-10-30T21:32:00.000Z (over 6 years ago)
- Last Synced: 2025-10-08T22:13:04.053Z (6 months ago)
- Homepage:
- Size: 364 KB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# thairath-228k
**A Large Dataset for Thai Text Summarization from thairath.co.th.**
Download the dataset [here](https://dl.orangedox.com/OKBIWku5Nv6gi2LBkH).
The `thairath-228k` dataset is crawled from the news site [Thairath](https://www.thairath.co.th/home "Thairath"). This dataset is purposefully scraped for evaluating various Thai NLP tasks especially text summarization and classification-benchmarks. We filtered out those articles which match, at least, one of following conditions:
- Article that contains following tags: `นิยาย` (novel), อินสตราแกรมดารา (celebrity Instagram), `คลิปสุดฮา` (funny clip), `สรุปข่าว` (highlight news), `ดวง` (horoscope )
- Article body contains less than 230 words.
- Summary contains less than 8 words.
- The abstractedness of the summary at 1-grams is less than 65%.
After filtering, it contains 228,937 articles with 388,383 tags from October 1, 2014 to October 21, 2019. This dataset was crawled and cleaned by [Nakhun Chumpolsathien](https://github.com/nakhunchumpolsathien) and [Tanachat Arayachutinan](https://github.com/caramelWaffle). You can see preliminary exploration in `exploration.ipynb`.
### `thairath-228k` Dataset Statistics
| Properties | Value |
| :--------- | -----:|
| Dataset Size | 228,937 |
| Average Article Length | 478.44 |
| Average Summary Length | 46.54 |
| Average Title Length | 12.43|
| Unique Tag Size | 388,383 |
| Vocabulary Size | To be updated |
### Level of Abstractedness
Abstractedness of the dataset is measured by calculating the unique n-grams in the reference summary which are not in the article. We compare the abstractedness level of `thairath-228k` dataset to `CNN/Daily Mail` and `WikiHow` dataset. The comparison is shown below figure.

> ※ The abstractedness at sentence level of `thairath-228k` is to be updated.
### Experimental Results
#### Classification-benchmarks
>※ To be updated
#### Thai Text Summarization
>※ To be updated