{"id":18256484,"url":"https://github.com/baderlab/biomedical-corpora","last_synced_at":"2026-02-02T18:04:13.145Z","repository":{"id":146738136,"uuid":"137799835","full_name":"BaderLab/Biomedical-Corpora","owner":"BaderLab","description":"A collection of annotated biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.","archived":false,"fork":false,"pushed_at":"2018-09-18T22:38:41.000Z","size":46721,"stargazers_count":36,"open_issues_count":5,"forks_count":4,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-04-08T22:32:19.624Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BaderLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-18T19:56:49.000Z","updated_at":"2025-01-12T04:36:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"a7a16485-3bbb-4b5e-9e67-0dae6cea91e5","html_url":"https://github.com/BaderLab/Biomedical-Corpora","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BaderLab/Biomedical-Corpora","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaderLab%2FBiomedical-Corpora","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaderLab%2FBiomedical-Corpora/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaderLab%2FBiomedical-Corpora/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaderLab%2FBiomedical-Corpora/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BaderLab","download_url":"https://codeload.github.com/BaderLab/Biomedical-Corpora/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaderLab%2FBiomedical-Corpora/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261601527,"owners_count":23183097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T10:22:09.125Z","updated_at":"2026-02-02T18:04:13.064Z","avatar_url":"https://github.com/BaderLab.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Biomedical Corpora\n\nA collection of annotated, freely distributable, biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.\n\nAll corpora are provided in `corpora`. They are divided into subdirectories `NER`, for corpora which can be used to train **named entity recognition** (NER) solutions, and `Relation Extraction`, for corpora which can be used to train **relation/event extraction** solutions. Corpora are provided in both a CoNLL-like format and a [Standoff](http://brat.nlplab.org/standoff.html) format.\n\nMost corpora in the CoNLL-like format were originally collected [here](https://github.com/cambridgeltl/MTL-Bioinformatics-2016). In many cases, the tags were mapped to 4-letter codes:\n\n| Old tag  | New tag |\n| ------------- | ------------- |\n| `Chemical`, `Simple_chemical`  | `CHED`  |\n| `Disease` | `DISO`  |\n| `Organism`, `Species`, `NCBITaxon`, `Taxon`  | `LIVB`  |\n| `Cellular_component` | `COMP`  |\n| `Cell`, `cell_type`  | `CLTP`  |\n| `cell_line` | `CLLN`  |\n| `Gene`, `Protein`, `Gene_or_gene_product`, `GGP`  | `PRGE`  |\n\n\u003e Mappings were largely inspired by this [API](http://bioinformatics.ua.pt/becas/#!/api).\n\nCorpora names (loosely) follow the naming scheme: `\u003ccorpus_name\u003e_\u003centity\u003e_\u003ctagset\u003e`.\n\n## Download\n\nTo download the corpora, simply clone the repository locally:\n\n```bash\n$ git clone https://github.com/BaderLab/Biomedical-Corpora.git\n```\n\nOr click the green `Clone or download` button and select `Download ZIP`.\n\n## Resources\n\nhttps://github.com/spyysalo provides many useful repositories for working with these corpora. Many of the most popular corpora have their own repositories (e.g. [S800](https://github.com/spyysalo/s800), [NCBI-Disease](https://github.com/spyysalo/ncbi-disease)) which contain code for collecting the corpus from its original source and converting it into a format suitable for training a machine learning classifier (e.g. CoNLL or [Standoff](http://brat.nlplab.org/standoff.html)).\n\n## Table of Corpora\n\nA list of various biomedical corpora and their corresponsding publications:\n\n| Corpora | Text Genre | Standard | Entities (Count) | Publication |\n| --- | --- | --- | --- | --- |\n| [AnatEM](http://nactem.ac.uk/anatomytagger/) | Scientific Article | Gold | 12 Anatomical entities | [link](https://academic.oup.com/bioinformatics/article/30/6/868/285282) |\n| [AZDC](http://diego.asu.edu/downloads/AZDC_6-26-2009.txt) | Scientific Article | Gold | Disease | [link](https://scholar.google.com/citations?view_op=view_citation\u0026hl=en\u0026user=FLnUx4cAAAAJ\u0026citation_for_view=FLnUx4cAAAAJ:ufrVoPGSRksC) |\n| [BioCreative II GM](https://sourceforge.net/projects/biocreative/files/biocreative2entitytagging/1.1/) | Scientific Article | Gold | Genes/proteins (24,583) | [link](https://doi.org/10.1186/gb-2008-9-s2-s2) |\n| [BioInfer](http://mars.cs.utu.fi/BioInfer/?q=download) | Scientific Article | Gold | Genes/proteins | [link](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50) |\n| [BioSemantics](https://biosemantics.org/index.php/resources/chemical-patent-corpus) | Patent | Gold | Chemicals, Disease | [link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477) |\n| BC4CHEMD | Scientific Article | Gold | Chemicals (84,310) | [link](https://www.ncbi.nlm.nih.gov/pubmed/25810766) |\n| [BC5CDR](http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/) | Scientific Article | Gold | Chemicals (15,935), Disease (12,852) | [link](academic.oup.com/database/article/doi/10.1093/database/baw068/2630414) |\n| BioNLP09 | Scientific Article | Gold | Genes/proteins (14,963) | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-10) |\n| BioNLP11EPI | Scientific Article | Gold | Genes/proteins (15,811) | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S11-S2) |\n| BioNLP11ID | Scientific Article | Gold | Genes/proteins (6551), Organisms (3471), Chemicals (973), Regulon-operon (87) | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S11-S2) |\n| BioNLP13GE | Scientific Article | Gold | Genes/proteins (12,057) | [link](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.380.5420) |\n| BioNLP13PC | Scientific Article | Gold | Genes/proteins (10,891), Chemicals (2487), Complexes (1502), Cellular component (1013) | [link](http://www.aclweb.org/anthology/W/W13/W13-2009.pdf) |\n| [CRAFT](https://github.com/UCDenver-ccp/CRAFT) | Scientific Article | Gold | Sequence Ontology (18,974), Gene/proteins (16,064), Taxonomy (6868), Chemicals of biological interest (6053), Cell lines (5495), GO-CC (4180) | [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161)|\n| [CellFinder](https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/resources/cellfinder) | Scientific Article | Gold | Species, Gene/proteins, Cell type, Anatomy | [link](https://www.informatik.hu-berlin.de/de/forschung/gebiete/sar/wbi/research/publications/2012/lrec2012_corpus.pdf)|\n|[CHEMDNER Patent](http://www.biocreative.org/tasks/biocreative-v/track-2-chemdner/)| Patent | Gold | Chemicals |[link](https://jcheminf.springeropen.com/articles/10.1186/1758-2946-7-S1-S2)|\n|[DECA](http://www.nactem.ac.uk/deca/)| Scientific Article | Gold | Genes/proteins |[link](http://bioinformatics.oxfordjournals.org/content/26/5/661.abstract?keytype=ref\u0026ijkey=6nc2iFEN0sYYYz1)|\n|Ex-PTM| Scientific Article | Gold | Genes/proteins (4698) |[link](https://dl.acm.org/citation.cfm?id=2002920)|\n|[FSU-PRGE](http://pubannotation.org/projects/FSU-PRGE)| Scientific Article | Gold | Genes/proteins|[link](http://aclweb.org/anthology/W/W10/W10-1838.pdf)|\n|JNLPBA| Scientific Article | Gold | Genes/proteins (35,336), DNA (10,589), Cell type (8639), Cell line (4330), RNA (1069) | [link](https://dl.acm.org/citation.cfm?id=1567610)|\n|[Linneaus](http://linnaeus.sourceforge.net/)| Scientific Article | Gold | Organisms (4263) | [link](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-85)|\n|[LocText](https://www.tagtog.net/-corpora/loctext)| Scientific Article | Gold | Organisms, Genes/proteins | [link](http://bmcproc.biomedcentral.com/articles/10.1186/1753-6561-9-S5-A4)|\n|[IEPA](http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/iepa_bioc.xml.zip) | Scientific Article | Gold | Genes/proteins | [link](http://psb.stanford.edu/psb-online/proceedings/psb02/abstracts/p326.html) |\n|[miRNA](http://www.scai.fraunhofer.de/mirna-corpora.html)| Scientific Article | Gold | Disease, Organisms, Genes/proteins | [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/) |\n|[NCBI disease](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/)| Scientific Article | Gold | Disease (6881) |[link](http://www.sciencedirect.com/science/article/pii/S1532046413001974)|\n|[S800](http://species.jensenlab.org/)| Scientific Article | Gold | Organisms (3708) |[link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390)|\n|[Variome](http://www.opennicta.com.au/home/health/variome)| Scientific Article | Gold | Disease, Organisms, Genes/proteins|[link](http://database.oxfordjournals.org/content/2013/bat019.abstract)|\n\n\u003e Note, some corpora included in this table are not included for download in this repository because they are not freely distributable.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaderlab%2Fbiomedical-corpora","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaderlab%2Fbiomedical-corpora","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaderlab%2Fbiomedical-corpora/lists"}