{"id":13771293,"url":"https://github.com/sileod/tasksource","last_synced_at":"2026-01-26T15:11:55.806Z","repository":{"id":65043388,"uuid":"575023263","full_name":"sileod/tasksource","owner":"sileod","description":"Datasets collection and preprocessings framework for NLP extreme multitask learning","archived":false,"fork":false,"pushed_at":"2025-07-09T13:04:28.000Z","size":385,"stargazers_count":186,"open_issues_count":4,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-29T19:32:48.253Z","etag":null,"topics":["benchmark","bigbench","crossfit","curated-datasets","dataset-collection","discriminative","extreme-mtl","extreme-multi-task-learning","glue","huggingface","instruction-tuning","meta-learning","multi-task-learning","multi-task-learning-scaling","natural-language-inference","nlp","preprocessings","reward-modeling","scaling","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sileod.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-12-06T15:29:02.000Z","updated_at":"2025-08-05T07:36:19.000Z","dependencies_parsed_at":"2024-05-03T12:45:16.539Z","dependency_job_id":"a177208d-3f34-4f88-bbbb-effbdd478a66","html_url":"https://github.com/sileod/tasksource","commit_stats":{"total_commits":132,"total_committers":2,"mean_commits":66.0,"dds":"0.037878787878787845","last_synced_commit":"85fda12ee1e81c86149497e076efc2c633f02487"},"previous_names":[],"tags_count":48,"template":false,"template_full_name":null,"purl":"pkg:github/sileod/tasksource","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sileod%2Ftasksource","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sileod%2Ftasksource/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sileod%2Ftasksource/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sileod%2Ftasksource/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sileod","download_url":"https://codeload.github.com/sileod/tasksource/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sileod%2Ftasksource/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28781308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T13:55:28.044Z","status":"ssl_error","status_checked_at":"2026-01-26T13:55:26.068Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","bigbench","crossfit","curated-datasets","dataset-collection","discriminative","extreme-mtl","extreme-multi-task-learning","glue","huggingface","instruction-tuning","meta-learning","multi-task-learning","multi-task-learning-scaling","natural-language-inference","nlp","preprocessings","reward-modeling","scaling","text-classification"],"created_at":"2024-08-03T17:00:49.952Z","updated_at":"2026-01-26T15:11:55.791Z","avatar_url":"https://github.com/sileod.png","language":"Python","funding_links":[],"categories":["Benchmark \u0026 Dataset","Python"],"sub_categories":["NLP"],"readme":"## tasksource ![](https://aeiljuispo.cloudimg.io/v7/https://s3.amazonaws.com/moonup/production/uploads/5fc0bcb41160c47d1d43856b/j06-U5e2Tifi2xOnTudqS.jpeg?w=20\u0026h=20\u0026f=face) 600+ curated datasets and preprocessings for instant and interchangeable use\n\nHuggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably.\n`tasksource` streamlines interchangeable datasets usage to scale evaluation or multi-task learning.\n\nEach dataset is standardized to a `MultipleChoice`, `Classification`, or `TokenClassification` template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) for our annotations but also provide a `SequenceToSequence` template. All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.\n\n### Installation and usage:\n`pip install tasksource`\n```python\nfrom tasksource import list_tasks, load_task\ndf = list_tasks(multilingual=False) # takes some time\n\nfor id in df[df.task_type==\"MultipleChoice\"].id:\n    dataset = load_task(id) # all yielded datasets can be used interchangeably\n```\n\nBrowse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (like any Hugging Face dataset), so ensure you have more than 100GB of space available.\n\nYou can now also use:\n```python\nload_dataset(\"tasksource/data\", \"glue/rte\",max_rows=30_000)\n```\n\n### Pretrained models:\n\nText encoder pretrained on tasksource reached state-of-the-art results: [🤗/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli)\n\nTasksource pretraining is notably helpful for RLHF reward modeling or any kind of classification, including zero-shot. You can also find a large and a multilingual version.\n\n### tasksource-instruct\n\nThe repo also contains some recasting code to convert tasksource datasets to instructions, providing one of the richest instruction-tuning datasets:\n[🤗/tasksource-instruct-v0](https://hf.co/datasets/tasksource/tasksource-instruct-v0)\n\n\n### tasksource-label-nli\n\nWe also recast all classification tasks as natural language inference, to improve entailment-based zero-shot classification detection:\n[🤗/zero-shot-label-nli](https://huggingface.co/datasets/tasksource/zero-shot-label-nli)\n\n### Write and use custom preprocessings\n\n```python\nfrom tasksource import MultipleChoice\n\ncodah = MultipleChoice('question_propmt',choices_list='candidate_answers',\n    labels='correct_answer_idx',\n    dataset_name='codah', config_name='codah')\n    \nwinogrande = MultipleChoice('sentence',['option1','option2'],'answer',\n    dataset_name='winogrande',config_name='winogrande_xl',\n    splits=['train','validation',None]) # test labels are not usable\n    \ntasks = [winogrande.load(), codah.load()]) #  Aligned datasets (same columns) can be used interchangably  \n```\n\n ### Citation and contact\n\nFor more details, refer to this [article:](https://arxiv.org/abs/2301.05948) \n```bib\n@inproceedings{sileo-2024-tasksource,\n    title = \"tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework\",\n    author = \"Sileo, Damien\",\n    booktitle = \"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)\",\n    month = may,\n    year = \"2024\",\n    address = \"Torino, Italia\",\n    publisher = \"ELRA and ICCL\",\n    url = \"https://aclanthology.org/2024.lrec-main.1361\",\n    pages = \"15655--15684\",\n}\n```\nFor help integrating tasksource into your experiments, please contact [damien.sileo@inria.fr](mailto:damien.sileo@inria.fr).\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsileod%2Ftasksource","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsileod%2Ftasksource","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsileod%2Ftasksource/lists"}