{"id":15014077,"url":"https://github.com/explosion/span-labeling-datasets","last_synced_at":"2026-02-01T22:01:40.658Z","repository":{"id":66040076,"uuid":"495221749","full_name":"explosion/span-labeling-datasets","owner":"explosion","description":"Loaders for various span labeling datasets","archived":false,"fork":false,"pushed_at":"2023-02-21T04:43:02.000Z","size":132,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-06-21T16:44:18.104Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-23T01:32:45.000Z","updated_at":"2024-12-31T17:46:53.000Z","dependencies_parsed_at":"2023-06-13T06:00:28.994Z","dependency_job_id":null,"html_url":"https://github.com/explosion/span-labeling-datasets","commit_stats":{"total_commits":101,"total_committers":3,"mean_commits":"33.666666666666664","dds":"0.17821782178217827","last_synced_commit":"a35c3948d04892e53465bb9b874298a6de609743"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/explosion/span-labeling-datasets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspan-labeling-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspan-labeling-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspan-labeling-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspan-labeling-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/span-labeling-datasets/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspan-labeling-datasets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28992633,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T20:57:35.821Z","status":"ssl_error","status_checked_at":"2026-02-01T20:57:29.580Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:45:09.637Z","updated_at":"2026-02-01T22:01:40.643Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003c!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) --\u003e\n\n# 🪐 spaCy Project: Spancat datasets\n\nThis project compiles various spancat datasets and their converters into the\n[spaCy format](https://spacy.io/api/data-formats). You can use this in tandem\nwith the [`spancat-encoders`](https://github.com/explosion/spancat-encoders)\nrepository to run various experiments on these datasets.\n\n\n## 📋 project.yml\n\nThe [`project.yml`](project.yml) defines the data assets required by the\nproject, as well as the available commands and workflows. For details, see the\n[spaCy projects documentation](https://spacy.io/usage/projects).\n\n### ⏯ Commands\n\nThe following commands are defined by the project. They\ncan be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).\nCommands are only re-run if their inputs have changed.\n\n| Command | Description |\n| --- | --- |\n| `convert-wnut17-ents` | Convert WNUT17 dataset into the spaCy format |\n| `convert-wnut17-spans` | Convert WNUT17 dataset into the spaCy format |\n| `clean-wikineural` | Remove unnecessary indices from wikineural data |\n| `convert-wikineural-spans` | Convert WikiNeural dataset (de, en, es, nl) into the spaCy format |\n| `convert-wikineural-ents` | Convert WikiNeural dataset (de, en, es, nl) into the spaCy format |\n| `clean-conll` | Remove unnecessary indices from ConLL data |\n| `convert-conll-spans` | Convert CoNLL dataset (de, en, es, nl) into the spaCy format |\n| `convert-conll-ents` | Convert CoNLL dataset (de, en, es, nl) into the spaCy format |\n| `convert-archaeo-spans` | Convert Dutch Archaeology dataset into the spaCy format |\n| `convert-archaeo-ents` | Convert Dutch Archaeology dataset into the spaCy format |\n| `convert-anem-spans` | Convert AnEM dataset into the spaCy format |\n| `convert-anem-ents` | Convert AnEM dataset into the spaCy format |\n| `clean` | Remove intermediary files |\n\n### ⏭ Workflows\n\nThe following workflows are defined by the project. They\ncan be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)\nand will run the specified commands in order. Commands are only re-run if their\ninputs have changed.\n\n| Workflow | Steps |\n| --- | --- |\n| `wnut17` | `convert-wnut17-ents` \u0026rarr; `convert-wnut17-spans` |\n| `wikineural` | `clean-wikineural` \u0026rarr; `convert-wikineural-ents` \u0026rarr; `convert-wikineural-spans` |\n| `conll` | `clean-conll` \u0026rarr; `convert-conll-spans` \u0026rarr; `convert-conll-ents` |\n| `archaeo` | `convert-archaeo-ents` \u0026rarr; `convert-archaeo-spans` |\n| `anem` | `convert-anem-ents` \u0026rarr; `convert-anem-spans` |\n\n### 🗂 Assets\n\nThe following assets are defined by the project. They can\nbe fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)\nin the project directory.\n\n| File | Source | Description |\n| --- | --- | --- |\n| `assets/wnut17-train.iob` | URL | WNUT17 training dataset for Emerging and Rare Entities Task from Derczynski et al., 2017 |\n| `assets/wnut17-dev.iob` | URL | WNUT17 dev dataset for Emerging and Rare Entities Task from Derczynski et al., 2017 |\n| `assets/wnut17-test.iob` | URL | WNUT17 test dataset for Emerging and Rare Entities Task from Derczynski et al., 2017 |\n| `assets/raw-en-wikineural-train.iob` | URL | WikiNeural (en) training dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-en-wikineural-dev.iob` | URL | WikiNeural (en) dev dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-en-wikineural-test.iob` | URL | WikiNeural (en) test dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-de-wikineural-train.iob` | URL | WikiNeural (de) training dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-de-wikineural-dev.iob` | URL | WikiNeural (de) dev dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-de-wikineural-test.iob` | URL | WikiNeural (de) test dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-es-wikineural-train.iob` | URL | WikiNeural (es) training dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-es-wikineural-dev.iob` | URL | WikiNeural (es) dev dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-es-wikineural-test.iob` | URL | WikiNeural (es) test dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-nl-wikineural-train.iob` | URL | WikiNeural (nl) training dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-nl-wikineural-dev.iob` | URL | WikiNeural (nl) dev dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-nl-wikineural-test.iob` | URL | WikiNeural (nl) test dataset from Tedeschi et al. (EMNLP 2021) |\n| `assets/raw-en-conll-train.iob` | URL | CoNLL 2003 (en) training dataset |\n| `assets/raw-en-conll-dev.iob` | URL | CoNLL 2003 (en) dev dataset |\n| `assets/raw-en-conll-test.iob` | URL | CoNLL 2003 (en) test dataset |\n| `assets/raw-de-conll-train.iob` | URL | CoNLL 2003 (de) training dataset |\n| `assets/raw-de-conll-dev.iob` | URL | CoNLL 2003 (de) dev dataset |\n| `assets/raw-de-conll-test.iob` | URL | CoNLL 2003 (de) test dataset |\n| `assets/raw-es-conll-train.iob` | URL | CoNLL 2002 (es) training dataset |\n| `assets/raw-es-conll-dev.iob` | URL | CoNLL 2002 (es) dev dataset |\n| `assets/raw-es-conll-test.iob` | URL | CoNLL (es) test dataset |\n| `assets/raw-nl-conll-train.iob` | URL | CoNLL 2002 (nl) training dataset |\n| `assets/raw-nl-conll-dev.iob` | URL | CoNLL 2002 (nl) dev dataset |\n| `assets/raw-nl-conll-test.iob` | URL | CoNLL 202 (nl) test dataset |\n| `assets/archaeo.bio` | URL | Dutch Archaeological NER dataset by Alex Brandsen (LREC 2020) |\n| `assets/anem-train.iob` | URL | Anatomical Entity Mention (AnEM) training corpus containing abstracts and full-text biomedical papers from Ohta et al. (ACL 2012) |\n| `assets/anem-test.iob` | URL | Anatomical Entity Mention (AnEM) test corpus containing abstracts and full-text biomedical papers from Ohta et al. (ACL 2012) |\n\n\u003c!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) --\u003e","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspan-labeling-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspan-labeling-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspan-labeling-datasets/lists"}