{"id":20457876,"url":"https://github.com/pythainlp/thaiqa_squad","last_synced_at":"2025-07-06T02:35:34.308Z","repository":{"id":104622725,"uuid":"319562003","full_name":"PyThaiNLP/thaiqa_squad","owner":"PyThaiNLP","description":"SQuAD version of thaiqa (https://aiforthai.in.th/corpus.php)","archived":false,"fork":false,"pushed_at":"2020-12-08T08:02:11.000Z","size":9592,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-05T10:48:10.778Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyThaiNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-08T07:36:44.000Z","updated_at":"2021-12-13T10:15:02.000Z","dependencies_parsed_at":"2023-05-31T02:23:48.467Z","dependency_job_id":null,"html_url":"https://github.com/PyThaiNLP/thaiqa_squad","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthaiqa_squad","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthaiqa_squad/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthaiqa_squad/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthaiqa_squad/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyThaiNLP","download_url":"https://codeload.github.com/PyThaiNLP/thaiqa_squad/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fthaiqa_squad/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259200417,"owners_count":22820615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T12:09:35.151Z","updated_at":"2025-06-11T04:34:36.604Z","avatar_url":"https://github.com/PyThaiNLP.png","language":"Jupyter Notebook","readme":"---\nannotations_creators:\n- expert-generated\nlanguage_creators:\n- found\nlanguages:\n- th\nlicenses:\n- cc-by-nc-sa-3.0\nmultilinguality:\n- monolingual\nsize_categories:\n- 1K\u003cn\u003c10K\nsource_datasets:\n- extended|other-thaiqa\ntask_categories:\n- question-answering\ntask_ids:\n- extractive-qa\n- open-domain-qa\n---\n\n# Dataset Card for `thaiqa-squad`\n\n## Table of Contents\n- [Dataset Description](#dataset-description)\n  - [Dataset Summary](#dataset-summary)\n  - [Supported Tasks](#supported-tasks-and-leaderboards)\n  - [Languages](#languages)\n- [Dataset Structure](#dataset-structure)\n  - [Data Instances](#data-instances)\n  - [Data Fields](#data-instances)\n  - [Data Splits](#data-instances)\n- [Dataset Creation](#dataset-creation)\n  - [Curation Rationale](#curation-rationale)\n  - [Source Data](#source-data)\n  - [Annotations](#annotations)\n  - [Personal and Sensitive Information](#personal-and-sensitive-information)\n- [Considerations for Using the Data](#considerations-for-using-the-data)\n  - [Social Impact of Dataset](#social-impact-of-dataset)\n  - [Discussion of Biases](#discussion-of-biases)\n  - [Other Known Limitations](#other-known-limitations)\n- [Additional Information](#additional-information)\n  - [Dataset Curators](#dataset-curators)\n  - [Licensing Information](#licensing-information)\n  - [Citation Information](#citation-information)\n\n## Dataset Description\n\n- **Homepage:** http://github.com/pythainlp/thaiqa_squad (original `thaiqa` at https://aiforthai.in.th/)\n- **Repository:** http://github.com/pythainlp/thaiqa_squad\n- **Paper:**\n- **Leaderboard:**\n- **Point of Contact:**http://github.com/pythainlp/ (original `thaiqa` at https://aiforthai.in.th/)\n\n### Dataset Summary\n\n`thaiqa_squad` is an open-domain, extractive question answering dataset (4,000 questions in `train` and 74 questions in `dev`) in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, originally created by [NECTEC](https://www.nectec.or.th/en/) from Wikipedia articles and adapted to [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format by [PyThaiNLP](https://github.com/PyThaiNLP/).\n\n### Supported Tasks and Leaderboards\n\nextractive question answering\n\n### Languages\n\nThai\n\n## Dataset Structure\n\n### Data Instances\n\n[More Information Needed]\n\n### Data Fields\n\n[More Information Needed]\n\n### Data Splits\n\n|                         | train       | valid       |\n|-------------------------|-------------|-------------|\n| # questions             | 4000        | 74          |\n| # avg words in context  | 1186.740750 | 1016.459459 |\n| # avg words in question | 14.325500   | 12.743243   |\n| # avg words in answer   | 3.279750    | 4.608108    |\n\n## Dataset Creation\n\n### Curation Rationale\n\n[PyThaiNLP](https://github.com/PyThaiNLP/) created `thaiqa_squad` as a [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) version of [thaiqa](http://copycatch.in.th/thai-qa-task.html). [thaiqa](https://aiforthai.in.th/corpus.php) is part of [The 2nd Question answering program from Thai Wikipedia](http://copycatch.in.th/thai-qa-task.html) of [National Software Contest 2020](http://nsc.siit.tu.ac.th/GENA2/login.php).\n\n### Source Data\n\n#### Initial Data Collection and Normalization\n\n[More Information Needed]\n\n#### Who are the source language producers?\n\nWikipedia authors for contexts and [NECTEC](https://www.nectec.or.th/en/) for questions and answer annotations\n\n### Annotations\n\n#### Annotation process\n\n[More Information Needed]\n\n#### Who are the annotators?\n\n[NECTEC](https://www.nectec.or.th/en/)\n\n### Personal and Sensitive Information\n\nAll contents are from Wikipedia. No personal and sensitive information is expected to be included.\n\n## Considerations for Using the Data\n\n### Social Impact of Dataset\n\n- open-domain, extractive question answering in Thai\n\n### Discussion of Biases\n\n[More Information Needed]\n\n### Other Known Limitations\n\n- The contexts include `\u003cdoc\u003e` tags at start and at the end\n\n## Additional Information\n\n### Dataset Curators\n\n[NECTEC](https://www.nectec.or.th/en/) for original [thaiqa](https://aiforthai.in.th/corpus.php). SQuAD formattting by [PyThaiNLP](https://github.com/PyThaiNLP/).\n\n### Licensing Information\n\nCC-BY-NC-SA 3.0\n\n### Citation Information\n\n[More Information Needed]\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fthaiqa_squad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythainlp%2Fthaiqa_squad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fthaiqa_squad/lists"}