{"id":15014039,"url":"https://github.com/explosion/ml-datasets","last_synced_at":"2025-10-19T14:31:53.789Z","repository":{"id":36841273,"uuid":"230664027","full_name":"explosion/ml-datasets","owner":"explosion","description":"🌊 Machine learning dataset loaders for testing and example scripts","archived":false,"fork":false,"pushed_at":"2022-03-29T15:14:27.000Z","size":76,"stargazers_count":47,"open_issues_count":0,"forks_count":15,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-01-29T18:38:17.554Z","etag":null,"topics":["datasets","machine-learning","machine-learning-datasets","spacy","testing","thinc"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-28T20:37:11.000Z","updated_at":"2024-11-29T04:34:03.000Z","dependencies_parsed_at":"2022-08-08T17:31:38.556Z","dependency_job_id":null,"html_url":"https://github.com/explosion/ml-datasets","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fml-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fml-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fml-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fml-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/ml-datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237152765,"owners_count":19263780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","machine-learning","machine-learning-datasets","spacy","testing","thinc"],"created_at":"2024-09-24T19:45:06.126Z","updated_at":"2025-10-19T14:31:53.461Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# Machine learning dataset loaders for testing and examples\n\nLoaders for various machine learning datasets for testing and example scripts.\nPreviously in `thinc.extra.datasets`.\n\n[![PyPi Version](https://img.shields.io/pypi/v/ml-datasets.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.python.org/pypi/ml-datasets)\n\n## Setup and installation\n\nThe package can be installed via pip:\n\n```bash\npip install ml-datasets\n```\n\n## Loaders\n\nLoaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.\n\n```python\n# Import directly\nfrom ml_datasets import imdb\ntrain_data, dev_data = imdb()\n```\n\n```python\n# Load via registry\nfrom ml_datasets import loaders\nimdb_loader = loaders.get(\"imdb\")\ntrain_data, dev_data = imdb_loader()\n```\n\n### Available loaders\n\n#### NLP datasets\n\n| ID / Function        | Description                                  | NLP task                                  | From URL |\n| -------------------- | -------------------------------------------- | ----------------------------------------- | :------: |\n| `imdb`               | IMDB sentiment dataset                       | Binary classification: sentiment analysis |    ✓     |\n| `dbpedia`            | DBPedia ontology dataset                     | Multi-class single-label classification   |    ✓     |\n| `cmu`                | CMU movie genres dataset                     | Multi-class, multi-label classification   |    ✓     |\n| `quora_questions`    | Duplicate Quora questions dataset            | Detecting duplicate questions             |    ✓     |\n| `reuters`            | Reuters dataset (texts not included)         | Multi-class multi-label classification    |    ✓     |\n| `snli`               | Stanford Natural Language Inference corpus   | Recognizing textual entailment            |    ✓     |\n| `stack_exchange`     | Stack Exchange dataset                       | Question Answering                        |          |\n| `ud_ancora_pos_tags` | Universal Dependencies Spanish AnCora corpus | POS tagging                               |    ✓     |\n| `ud_ewtb_pos_tags`   | Universal Dependencies English EWT corpus    | POS tagging                               |    ✓     |\n| `wikiner`            | WikiNER data                                 | Named entity recognition                  |          |\n\n#### Other ML datasets\n\n| ID / Function | Description | ML task           | From URL |\n| ------------- | ----------- | ----------------- | :------: |\n| `mnist`       | MNIST data  | Image recognition |    ✓     |\n\n### Dataset details\n\n#### IMDB\n\nEach instance contains the text of a movie review, and a sentiment expressed as `0` or `1`.\n\n```python\ntrain_data, dev_data = ml_datasets.imdb()\nfor text, annot in train_data[0:5]:\n    print(f\"Review: {text}\")\n    print(f\"Sentiment: {annot}\")\n```\n\n- Download URL: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)\n- Citation: [Andrew L. Maas et al., 2011](https://www.aclweb.org/anthology/P11-1015/)\n\n| Property            | Training         | Dev              |\n| ------------------- | ---------------- | ---------------- |\n| # Instances         | 25000            | 25000            |\n| Label values        | {`0`, `1`}       | {`0`, `1`}       |\n| Labels per instance | Single           | Single           |\n| Label distribution  | Balanced (50/50) | Balanced (50/50) |\n\n#### DBPedia\n\nEach instance contains an ontological description, and a classification into one of the 14 distinct labels.\n\n```python\ntrain_data, dev_data = ml_datasets.dbpedia()\nfor text, annot in train_data[0:5]:\n    print(f\"Text: {text}\")\n    print(f\"Category: {annot}\")\n```\n\n- Download URL: [Via fast.ai](https://course.fast.ai/datasets)\n- Original citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)\n\n| Property            | Training | Dev      |\n| ------------------- | -------- | -------- |\n| # Instances         | 560000   | 70000    |\n| Label values        | `1`-`14` | `1`-`14` |\n| Labels per instance | Single   | Single   |\n| Label distribution  | Balanced | Balanced |\n\n#### CMU\n\nEach instance contains a movie description, and a classification into a list of appropriate genres.\n\n```python\ntrain_data, dev_data = ml_datasets.cmu()\nfor text, annot in train_data[0:5]:\n    print(f\"Text: {text}\")\n    print(f\"Genres: {annot}\")\n```\n\n- Download URL: [http://www.cs.cmu.edu/~ark/personas/](http://www.cs.cmu.edu/~ark/personas/)\n- Original citation: [David Bamman et al., 2013](https://www.aclweb.org/anthology/P13-1035/)\n\n| Property            | Training                                                                                      | Dev |\n| ------------------- | --------------------------------------------------------------------------------------------- | --- |\n| # Instances         | 41793                                                                                         | 0   |\n| Label values        | 363 different genres                                                                          | -   |\n| Labels per instance | Multiple                                                                                      | -   |\n| Label distribution  | Imbalanced: 147 labels with less than 20 examples, while `Drama` occurs more than 19000 times | -   |\n\n#### Quora\n\n```python\ntrain_data, dev_data = ml_datasets.quora_questions()\nfor questions, annot in train_data[0:50]:\n    q1, q2 = questions\n    print(f\"Question 1: {q1}\")\n    print(f\"Question 2: {q2}\")\n    print(f\"Similarity: {annot}\")\n```\n\nEach instance contains two quora questions, and a label indicating whether or not they are duplicates (`0`: no, `1`: yes).\nThe ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.\n\n- Download URL: [http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv](http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv)\n- Original citation: [Kornél Csernai et al., 2017](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)\n\n| Property            | Training                  | Dev                       |\n| ------------------- | ------------------------- | ------------------------- |\n| # Instances         | 363859                    | 40429                     |\n| Label values        | {`0`, `1`}                | {`0`, `1`}                |\n| Labels per instance | Single                    | Single                    |\n| Label distribution  | Imbalanced: 63% label `0` | Imbalanced: 63% label `0` |\n\n### Registering loaders\n\nLoaders can be registered externally using the `loaders` registry as a decorator. For example:\n\n```python\n@ml_datasets.loaders(\"my_custom_loader\")\ndef my_custom_loader():\n    return load_some_data()\n\nassert \"my_custom_loader\" in ml_datasets.loaders\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fml-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fml-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fml-datasets/lists"}