{"id":18011581,"url":"https://github.com/titipata/pubmed_parser","last_synced_at":"2025-05-14T07:08:47.690Z","repository":{"id":28195030,"uuid":"31697087","full_name":"titipata/pubmed_parser","owner":"titipata","description":":clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset","archived":false,"fork":false,"pushed_at":"2024-12-27T20:28:04.000Z","size":63299,"stargazers_count":662,"open_issues_count":14,"forks_count":172,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-05-14T03:27:59.367Z","etag":null,"topics":["article","doi","medline-xml","nlp","parse","parser","pmid","pubmed-central","pubmed-parser","python","xml"],"latest_commit_sha":null,"homepage":"http://titipata.github.io/pubmed_parser/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/titipata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2015-03-05T05:13:57.000Z","updated_at":"2025-05-13T10:47:49.000Z","dependencies_parsed_at":"2023-02-16T01:00:49.803Z","dependency_job_id":"7e9bba5c-82cf-432e-b600-0a7513be1bcb","html_url":"https://github.com/titipata/pubmed_parser","commit_stats":{"total_commits":305,"total_committers":35,"mean_commits":8.714285714285714,"dds":0.6721311475409837,"last_synced_commit":"327403ffd043989076374de9bcd6e02e301f4347"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/titipata%2Fpubmed_parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/titipata%2Fpubmed_parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/titipata%2Fpubmed_parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/titipata%2Fpubmed_parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/titipata","download_url":"https://codeload.github.com/titipata/pubmed_parser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254092776,"owners_count":22013290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article","doi","medline-xml","nlp","parse","parser","pmid","pubmed-central","pubmed-parser","python","xml"],"created_at":"2024-10-30T03:11:45.978Z","updated_at":"2025-05-14T07:08:42.675Z","avatar_url":"https://github.com/titipata.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset\n\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)\n [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)\n\nPubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)\n , [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.\n\nFor available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or\n [documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.\n\n## Available Parsers\n\n* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.\n* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.\n\nBelow, we list available parsers from `pubmed_parser`.\n\n  * [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)\n  * [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)\n  * [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)\n  * [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)\n  * [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)\n  * [Parse MEDLINE XML](#parse-medline-xml)\n  * [Parse MEDLINE Grant ID](#parse-medline-grant-id)\n  * [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)\n  * [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)\n  * [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)\n\n### Parse PubMed OA XML information\n\nWe created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:\n\n* `full_title` : article's title\n* `abstract` : abstract\n* `journal` : Journal name\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `doi` : DOI of the article\n* `publisher_id` : publisher ID\n* `author_list` : list of authors with affiliation keys in the following format\n\n``` python\n [['last_name_1', 'first_name_1', 'aff_key_1'],\n  ['last_name_1', 'first_name_1', 'aff_key_2'],\n  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]\n ```\n\n* `affiliation_list` : list of affiliation keys and affiliation strings in the following format\n\n``` python\n [['aff_key_1', 'affiliation_1'],\n  ['aff_key_2', 'affiliation_2'], ...]\n```\n\n* `publication_year` : publication year\n* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.\n\n``` python\nimport pubmed_parser as pp\ndict_out = pp.parse_pubmed_xml(path)\n```\n\n### Parse PubMed OA citation references\n\nThe function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows\n\n* `pmid` : PubMed ID of the article\n* `pmc` : PubMed Central ID of the article\n* `article_title` : title of cited article\n* `journal` : journal name\n* `journal_type` : type of journal\n* `pmid_cited` : PubMed ID of article that article cites\n* `doi_cited` : DOI of article that article cites\n* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)\n\n``` python\ndicts_out = pp.parse_pubmed_references(path) # return list of dictionary\n```\n\n### Parse PubMed OA images and captions\n\nThe function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `fig_caption` : string of caption\n* `fig_id` : reference id for figure (use to refer in XML article)\n* `fig_label` : label of the figure\n* `graphic_ref` : reference to image file name provided from Pubmed OA\n\n``` python\ndicts_out = pp.parse_pubmed_caption(path) # return list of dictionary\n```\n\n### Parse PubMed OA Paragraph\n\nFor someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `text` : full text of the paragraph\n* `reference_ids` : list of reference code within that paragraph.\n\nThis IDs can merge with output from `parse_pubmed_references` .\n\n* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)\n\n``` python\ndicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)\n```\n\n### Parse PubMed OA Table [WIP]\n\nYou can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `caption` : caption of the table\n* `label` : lable of the table\n* `table_columns` : list of column name\n* `table_values` : list of values inside the table\n* `table_xml` : raw xml text of the table (return if `return_xml=True`)\n\n``` python\ndicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)\n```\n\n### Parse MEDLINE XML\n\nMEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `doi` : DOI\n* `other_id` : Other IDs found, each separated by `;`\n* `title` : title of the article\n* `abstract` : abstract of the article\n* `authors` : authors, each separated by `;`\n* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`\n* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`\n* `keywords` : list of keywords, each separated by `;`\n* `chemical_list` : list of chemical terms, each separated by `;`\n* `pubdate` : Publication date. Defaults to year information only.\n* `journal` : journal of the given paper\n* `medline_ta` : this is abbreviation of the journal name\n* `nlm_unique_id` : NLM unique identification\n* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset\n* `country` : Country extracted from journal information field\n* `reference` : string of PMID each separated by `;` or list of references made to the article\n* `delete` : boolean if `False` means paper got updated so you might have two\n* `languages` : list of languages, separated by `;`\n* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.\n\nXMLs for the same paper. You can delete the record of deleted paper because it got updated.\n\n``` python\ndicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',\n                                 year_info_only=False,\n                                 nlm_category=False,\n                                 author_list=False,\n                                 reference_list=False) # return list of dictionary\n```\n\nTo extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.\n\n### Parse MEDLINE Grant ID\n\nUse `parse_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing\n\n* `pmid` : PubMed ID\n* `grant_id` : Grant ID\n* `grant_acronym` : Acronym of grant\n* `country` : Country where grant funding from\n* `agency` : Grant agency\n\nIf no Grant ID is found, it will return `None`\n\n### Parse MEDLINE XML from eutils website\n\nYou can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys\n\n* `title` : title\n* `abstract` : abstract\n* `journal` : journal\n* `affiliation` : affiliation of first author\n* `authors` : string of authors, separated by `;`\n* `year` : Publication year\n* `keywords` : keywords or MESH terms of the article\n\n``` python\ndict_out = pp.parse_xml_web(pmid, save_xml=False)\n```\n\n### Parse MEDLINE XML citations from website\n\nThe function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys\n\n* `pmc` : PubMed Central ID\n* `pmid` : PubMed ID\n* `doi` : DOI of the article\n* `n_citations` : number of citations for given articles\n* `pmc_cited` : list of PMCs that cite the given PMC\n\n``` python\ndict_out = pp.parse_citation_web(doc_id, id_type='PMC')\n```\n\n### Parse Outgoing XML citations from website\n\nThe function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys\n\n* `n_citations` : number of cited articles\n* `doc_id` : the document identifier given\n* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`\n* `pmid_cited` : list of PMIDs cited by the article\n\n``` python\ndict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')\n```\n\nIdentifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.\n\n## Installation\n\nYou can install the most update version of the package directly from the repository\n\n``` bash\npip install git+https://github.com/titipata/pubmed_parser.git\n```\n\nor install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using\n\n``` bash\npip install pubmed-parser\n```\n\nor clone the repository and install using `pip`\n\n``` bash\ngit clone https://github.com/titipata/pubmed_parser\npip install ./pubmed_parser\n```\n\nYou can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`\nin the root of the repository.\n\n## Example snippet to parse PubMed OA dataset\n\nAn example usage is shown as follows\n\n``` python\nimport pubmed_parser as pp\npath_xml = pp.list_xml_path('data') # list all xml paths under directory\npubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output\nprint(pubmed_dict)\n\n{'abstract': u\"Background Despite identical genotypes and ...\",\n 'affiliation_list':\n  [['I1': 'Department of Biological Sciences, ...'],\n   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],\n  'author_list':\n  [['Dennehy', 'John J', 'I1'],\n   ['Dennehy', 'John J', 'I2'],\n   ['Wang', 'Ing-Nang', 'I1']],\n 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \\u03bb',\n 'journal': 'BMC Microbiology',\n 'pmc': '3166277',\n 'pmid': '21810267',\n 'publication_year': '2011',\n 'publisher_id': '1471-2180-11-174',\n 'subjects': 'Research Article'}\n```\n\n## Example Usage with PySpark\n\nThis is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)\n\n``` python\nimport os\nimport pubmed_parser as pp\nfrom pyspark.sql import Row\n\npath_all = pp.list_xml_path('/path/to/xml/folder/')\npath_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)\nparse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),\n                                               **pp.parse_pubmed_xml(x)))\npubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe\npubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',\n                                 'file_name', 'pmc', 'pmid',\n                                 'publication_year', 'publisher_id',\n                                 'journal', 'subjects']] # select columns\npubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe\n```\n\nSee [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)\nfolder for more information.\n\n## Core Members\n\n* [Titipat Achakulvisut](http://titipata.github.io)\n* [Daniel E. Acuna](http://scienceofscience.org/about)\n\nand [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)\n\n## Dependencies\n\n* [lxml](http://lxml.de/)\n* [unidecode](https://pypi.python.org/pypi/Unidecode)\n* [requests](http://docs.python-requests.org/en/master/)\n\n## Citation\n\nIf you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows\n\n\u003e Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979\n\nor using BibTex\n\n```\n@article{Achakulvisut2020,\n  doi = {10.21105/joss.01979},\n  url = {https://doi.org/10.21105/joss.01979},\n  year = {2020},\n  publisher = {The Open Journal},\n  volume = {5},\n  number = {46},\n  pages = {1979},\n  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},\n  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},\n  journal = {Journal of Open Source Software}\n}\n```\n\n## Contributions\n\nWe welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.\n\n## Acknowledgement\n\nThis package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!\n\n## License\n\nMIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftitipata%2Fpubmed_parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftitipata%2Fpubmed_parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftitipata%2Fpubmed_parser/lists"}