{"id":13727654,"url":"https://github.com/chanzuckerberg/MedMentions","last_synced_at":"2025-05-07T23:31:54.007Z","repository":{"id":89777061,"uuid":"132938133","full_name":"chanzuckerberg/MedMentions","owner":"chanzuckerberg","description":"A corpus of Biomedical papers annotated with mentions of UMLS entities.","archived":false,"fork":false,"pushed_at":"2021-11-09T16:31:59.000Z","size":25327,"stargazers_count":312,"open_issues_count":10,"forks_count":31,"subscribers_count":25,"default_branch":"master","last_synced_at":"2024-11-14T10:31:17.397Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chanzuckerberg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-05-10T18:16:14.000Z","updated_at":"2024-11-02T19:21:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"82249e8f-20f8-4f0e-88a2-638e1ab9586e","html_url":"https://github.com/chanzuckerberg/MedMentions","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanzuckerberg%2FMedMentions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanzuckerberg%2FMedMentions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanzuckerberg%2FMedMentions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanzuckerberg%2FMedMentions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chanzuckerberg","download_url":"https://codeload.github.com/chanzuckerberg/MedMentions/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224672801,"owners_count":17350811,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T02:00:28.876Z","updated_at":"2024-11-14T18:31:16.358Z","avatar_url":"https://github.com/chanzuckerberg.png","language":null,"funding_links":[],"categories":["Information Extraction and NLP","Ranked by starred repositories","Datasets"],"sub_categories":[],"readme":"# MedMentions: A UMLS Annotated Dataset\n\nThis is a preliminary release of the _MedMentions_ dataset, a corpus of Biomedical papers\nannotated with mentions of UMLS entities. [CZI Meta](https://www.chanzuckerberg.com/science/projects-meta) \nis releasing this data to promote NLP research on Biomedical text.\n\nThis data is being released under the [CC0 license](https://creativecommons.org/publicdomain/zero/1.0/).\nThe papers in the corpus were selected from those available from [PubMed\u0026reg; / Medline\u0026reg;](https://www.nlm.nih.gov/databases/download/pubmed_medline.html).\nUsers are referred to that source for the most current and accurate version of the text for the corresponding papers.\n\n\n## Introduction\n\n**Corpus:** The _MedMentions_ corpus consists of 4,392 papers (Titles and Abstracts) randomly selected\nfrom among papers released on [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) in 2016, that \nwere in the biomedical field, published in the English language, and had both a Title and \nan Abstract.\n\n**Annotators:** We recruited a team of professional annotators with rich experience in biomedical content \ncuration to exhaustively annotate all [UMLS\u0026reg;](https://uts.nlm.nih.gov/home.html) \n(2017AA full version) entity mentions in these papers.\n\n**Annotation quality:** We did not collect stringent IAA (Inter-annotator agreement) data. \nTo gain insight on the annotation quality of *MedMentions*, we randomly selected eight \npapers from the annotated corpus, containing a total of 469 concepts. Two biologists \n('Reviewer') who did not participate in the annotation task then each reviewed four papers. \nThe agreement between Reviewers and Annotators, an estimate of the *Precision* of the \nannotations, was 97.3%.\n\n\n## The Full Dataset, and Subsets\n\n* [full](full/): This is the full dataset\n* [ST21pv](st21pv/): This is the ST21pv subset, containing a subset of the full annotations, \n    targeting information retrieval.\n\n\n## The PubTator format\n\nThe annotated data is published in [PubTator](http://bioportal.bioontology.org/ontologies/EDAM?p=classes\u0026conceptid=format_3783)\nformat:\n\nEach paper or document ends with a blank line, and is represented as (without the spaces):\n```\nPMID | t | Title text\nPMID | a | Abstract text\nPMID TAB StartIndex TAB EndIndex TAB MentionTextSegment TAB SemanticTypeID TAB EntityID\n...\n```\n\nThe first two lines present the Title and Abstract texts (no line-breaks or tabs in the _text_). \nSubsequent lines present the mentions, one per line.\nThe _StartIndex_ and _EndIndex_ are 0-based character indices into the document text, constructed\nby concatenating the Title and Abstract, separated by a SPACE character. The _MentionTextSegment_\nis the actual mention between those character positions. The _EntityID_ is the UMLS entity \n(concept) id, and the _SemanticTypeID_ is the id for the Semantic Type that entity is linked \nto in UMLS. If the UMLS entity is linked to more than one semantic type, then this field \ncontains a comma-separated list of all these type IDs. All UMLS concepts that are not in the 2017-AA Active release are linked to the\nspecial semantic type _UnknownType_.\n\nHere is an example:\n```\n25763772|t|DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis\n25763772|a|Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain reaction and direct sequencing were used to screen DNA samples for DCTN4 variants. A total of 121 adult CF patients from the Cochin Hospital CF centre have been included, all of them carrying two CFTR defects: 103 developed at least 1 pulmonary infection with Pa, and 68 patients of them had CPA. DCTN4 variants were identified in 24% (29/121) CF patients with Pa infection and in only 17% (3/18) CF patients with no Pa infection. Of the patients with CPA, 29% (20/68) had DCTN4 missense variants vs 23% (8/35) in patients without CPA. Interestingly, p.Tyr263Cys tend to be more frequently observed in CF patients with CPA than in patients without CPA (4/68 vs 0/35), and DCTN4 missense variants tend to be more frequent in male CF patients with CPA bearing two class II mutations than in male CF patients without CPA bearing two class II mutations (P = 0.06). Our observations reinforce that DCTN4 missense variants, especially p.Tyr263Cys, may be involved in the pathogenesis of CPA in male CF.\n25763772        0       5       DCTN4   T116,T123    C4308010\n25763772        23      63      chronic Pseudomonas aeruginosa infection        T047    C0854135\n25763772        67      82      cystic fibrosis T047    C0010674\n25763772        83      120     Pseudomonas aeruginosa (Pa) infection   T047    C0854135\n...\n```\nIn this example, the Title is 82 characters long. The first mention is for the UMLS concept\n\"DCTN4 protein, human\" whose UMLS id is _C4308010_. This entity is linked to two semantic \ntypes: \"Amino Acid, Peptide, or Protein\" (T116) and \"Biologically Active Substance\" (T123). \n\n\n## How to cite\n\nIf you use MedMentions, please cite the following paper:\n\nSunil Mohan and Donghui Li. 2019.\n*MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts*.\nIn Proceedings of the 2019 Conference on Automated Knowledge Base Construction (AKBC 2019).\nAmherst, Massachusetts, USA. May 2019.\n[[Preprint](https://arxiv.org/abs/1902.09476)]\n\n### Our Latest Model\n\nOur model achieves SOTA results (2021) on UMLS recognition (ST21pv subset): a lower bound F1 score of **0.570** for *mention level* entity recognition (detection and linking), and an F1 score of **0.657** for recognizing UMLS concepts at the *document level*. For details, please see the following paper:\n\nSunil Mohan, Rico Angell, Nicholas Monath, Andrew McCallum. 2021.\n*Low Resource Recognition and Linking of Biomedical Concepts from a Large Ontology*. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 2021. [[doi](https://doi.org/10.1145/3459930.3469524)] [[Preprint](https://arxiv.org/abs/2101.10587)]\n\n### Other papers on MedMentions\n\nShikhar Murty, Patrick Verga, Luke Vilnis, Irena Radovanovic and Andrew McCallum. 2018.\n*Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking*.\nThe 56th Annual Meeting of the Association for Computational Linguistics (ACL). \nMelbourne, Australia. July 2018.\n\n\n## Feedback, Questions\n\nIf you have any comments, questions or issues, please post a note in \n[GitHub issues](https://github.com/chanzuckerberg/MedMentions/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanzuckerberg%2FMedMentions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchanzuckerberg%2FMedMentions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanzuckerberg%2FMedMentions/lists"}