{"id":15749082,"url":"https://github.com/aashrafh/anees-dataset","last_synced_at":"2026-03-18T16:57:15.351Z","repository":{"id":49806041,"uuid":"516778599","full_name":"aashrafh/anees-dataset","owner":"aashrafh","description":"The dataset used to fine-tune the GPT-2 model used in Anees for the multi-turn dialogue generation.","archived":false,"fork":false,"pushed_at":"2022-07-27T16:32:43.000Z","size":48239,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T11:54:13.466Z","etag":null,"topics":["arabic-nlp","dialogue-generation","gpt-2","multi-turn","multi-turn-dialogue","nlp"],"latest_commit_sha":null,"homepage":"https://github.com/aashrafh/Anees","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aashrafh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-22T14:22:11.000Z","updated_at":"2023-07-12T04:07:59.000Z","dependencies_parsed_at":"2022-09-01T14:51:43.177Z","dependency_job_id":null,"html_url":"https://github.com/aashrafh/anees-dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aashrafh%2Fanees-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aashrafh%2Fanees-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aashrafh%2Fanees-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aashrafh%2Fanees-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aashrafh","download_url":"https://codeload.github.com/aashrafh/anees-dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246429493,"owners_count":20775808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arabic-nlp","dialogue-generation","gpt-2","multi-turn","multi-turn-dialogue","nlp"],"created_at":"2024-10-04T06:01:30.912Z","updated_at":"2026-01-12T06:20:29.471Z","avatar_url":"https://github.com/aashrafh.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Anees Dataset\nThe dataset used to fine-tune the GPT-2 model used in Anees for the multi-turn dialogue generation.\n\n## Introduction\nThe dataset is a combination of 4 multi-turn dialogue datasets:\n  - DailyDialog: a high-quality multi-turn open-domain English dialog dataset. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.\n  - EmpatheticDialogues: a large-scale multi-turn empathetic dialogue dataset collected on Amazon Mechanical Turk, containing 24,850 one-to-one open-domain conversations.\n  - Persona-Chat: crowd-sourced dialogues where each participant plays the part of an assigned persona; and each persona has a word-distinct paraphrase.\n  - BlendedSkillTalk: an English-language dataset blending three conversation skills in balanced proportions (demonstrating knowledge, empathy, or ability to talk about oneself).\n\n## Analysis\n\n| Dataset               | # of training dialogues | # of training utterances | # of validation dialogues | # of validation utterances |\n| --------------------- | ----------------------- | ------------------------ | ------------------------- | -------------------------- |\n| DailyDialog | 11150 | 87467 | 1968 | 15512 |\n| EmpatheticDialogues | 19628 | 84674 | 3464 | 14912 |\n| Persona-Chat | 16046 | 212873 | 2832 | 37788 |\n| BlendedSkillTalk | 5786 | 76435 | 1022 | 13482 |\n| Total | 52610 | 461449 | 9286 | 81694 |\n\n## Tokenization and Translation\n  - The English dataset was tokenized using the [GPT2 Tokenizer](https://huggingface.co/gpt2).\n  - The Arabic dataset was tokenized using the [AraGPT2 Tokenizer](https://huggingface.co/aubmindlab/aragpt2-base).\n  - The translation from English to Arabic was done using [Opus-MT](https://huggingface.co/Helsinki-NLP/opus-mt-ar-en) on [Colab](https://colab.research.google.com/drive/1d-ynR5qfv22zRwRs3QNLKC8-s0ux2PRC?usp=sharing).\n  - The preprocessing and loading details of the data can be found on [Anees repository](https://github.com/aashrafh/Anees).\n  \n## References\n  - [Li, Y., Su, H., Shen, X., Li, W., Cao, Z., \u0026 Niu, S. (2017). Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957.](https://arxiv.org/abs/1710.03957)\n  - [Rashkin, H., Smith, E. M., Li, M., \u0026 Boureau, Y. L. (2018). Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.](https://arxiv.org/abs/1811.00207)\n  - [Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., \u0026 Weston, J. (2018). Personalizing dialogue agents: I have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243.](https://arxiv.org/abs/1801.07243)\n  - [Smith, E. M., Williamson, M., Shuster, K., Weston, J., \u0026 Boureau, Y. L. (2020). Can You Put it All Together: Evaluating Conversational Agents' Ability to Blend Skills. arXiv preprint arXiv:2004.08449.](https://arxiv.org/abs/2004.08449)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faashrafh%2Fanees-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faashrafh%2Fanees-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faashrafh%2Fanees-dataset/lists"}