Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/titipata/pubmed_parser
:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
https://github.com/titipata/pubmed_parser
article doi medline-xml nlp parse parser pmid pubmed-central pubmed-parser python xml
Last synced: 1 day ago
JSON representation
:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
- Host: GitHub
- URL: https://github.com/titipata/pubmed_parser
- Owner: titipata
- License: mit
- Created: 2015-03-05T05:13:57.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2024-12-27T20:28:04.000Z (about 1 month ago)
- Last Synced: 2025-01-23T07:04:39.451Z (8 days ago)
- Topics: article, doi, medline-xml, nlp, parse, parser, pmid, pubmed-central, pubmed-parser, python, xml
- Language: Python
- Homepage: http://titipata.github.io/pubmed_parser/
- Size: 60.4 MB
- Stars: 608
- Watchers: 21
- Forks: 170
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)Pubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)
, [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.For available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or
[documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.## Available Parsers
* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.
* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.Below, we list available parsers from `pubmed_parser`.
* [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)
* [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)
* [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)
* [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)
* [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)
* [Parse MEDLINE XML](#parse-medline-xml)
* [Parse MEDLINE Grant ID](#parse-medline-grant-id)
* [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)
* [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)
* [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)### Parse PubMed OA XML information
We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:
* `full_title` : article's title
* `abstract` : abstract
* `journal` : Journal name
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI of the article
* `publisher_id` : publisher ID
* `author_list` : list of authors with affiliation keys in the following format``` python
[['last_name_1', 'first_name_1', 'aff_key_1'],
['last_name_1', 'first_name_1', 'aff_key_2'],
['last_name_2', 'first_name_2', 'aff_key_1'], ...]
```* `affiliation_list` : list of affiliation keys and affiliation strings in the following format
``` python
[['aff_key_1', 'affiliation_1'],
['aff_key_2', 'affiliation_2'], ...]
```* `publication_year` : publication year
* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.``` python
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)
```### Parse PubMed OA citation references
The function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows
* `pmid` : PubMed ID of the article
* `pmc` : PubMed Central ID of the article
* `article_title` : title of cited article
* `journal` : journal name
* `journal_type` : type of journal
* `pmid_cited` : PubMed ID of article that article cites
* `doi_cited` : DOI of article that article cites
* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)``` python
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
```### Parse PubMed OA images and captions
The function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `fig_caption` : string of caption
* `fig_id` : reference id for figure (use to refer in XML article)
* `fig_label` : label of the figure
* `graphic_ref` : reference to image file name provided from Pubmed OA``` python
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
```### Parse PubMed OA Paragraph
For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `text` : full text of the paragraph
* `reference_ids` : list of reference code within that paragraph.This IDs can merge with output from `parse_pubmed_references` .
* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)
``` python
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
```### Parse PubMed OA Table [WIP]
You can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `caption` : caption of the table
* `label` : lable of the table
* `table_columns` : list of column name
* `table_values` : list of values inside the table
* `table_xml` : raw xml text of the table (return if `return_xml=True`)``` python
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
```### Parse MEDLINE XML
MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI
* `other_id` : Other IDs found, each separated by `;`
* `title` : title of the article
* `abstract` : abstract of the article
* `authors` : authors, each separated by `;`
* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`
* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`
* `keywords` : list of keywords, each separated by `;`
* `chemical_list` : list of chemical terms, each separated by `;`
* `pubdate` : Publication date. Defaults to year information only.
* `journal` : journal of the given paper
* `medline_ta` : this is abbreviation of the journal name
* `nlm_unique_id` : NLM unique identification
* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset
* `country` : Country extracted from journal information field
* `reference` : string of PMID each separated by `;` or list of references made to the article
* `delete` : boolean if `False` means paper got updated so you might have two
* `languages` : list of languages, separated by `;`
* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.XMLs for the same paper. You can delete the record of deleted paper because it got updated.
``` python
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
year_info_only=False,
nlm_category=False,
author_list=False,
reference_list=False) # return list of dictionary
```To extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.
### Parse MEDLINE Grant ID
Use `parse_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing
* `pmid` : PubMed ID
* `grant_id` : Grant ID
* `grant_acronym` : Acronym of grant
* `country` : Country where grant funding from
* `agency` : Grant agencyIf no Grant ID is found, it will return `None`
### Parse MEDLINE XML from eutils website
You can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys
* `title` : title
* `abstract` : abstract
* `journal` : journal
* `affiliation` : affiliation of first author
* `authors` : string of authors, separated by `;`
* `year` : Publication year
* `keywords` : keywords or MESH terms of the article``` python
dict_out = pp.parse_xml_web(pmid, save_xml=False)
```### Parse MEDLINE XML citations from website
The function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
* `pmc` : PubMed Central ID
* `pmid` : PubMed ID
* `doi` : DOI of the article
* `n_citations` : number of citations for given articles
* `pmc_cited` : list of PMCs that cite the given PMC``` python
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
```### Parse Outgoing XML citations from website
The function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
* `n_citations` : number of cited articles
* `doc_id` : the document identifier given
* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`
* `pmid_cited` : list of PMIDs cited by the article``` python
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
```Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.
## Installation
You can install the most update version of the package directly from the repository
``` bash
pip install git+https://github.com/titipata/pubmed_parser.git
```or install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using
``` bash
pip install pubmed-parser
```or clone the repository and install using `pip`
``` bash
git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser
```You can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`
in the root of the repository.## Example snippet to parse PubMed OA dataset
An example usage is shown as follows
``` python
import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict){'abstract': u"Background Despite identical genotypes and ...",
'affiliation_list':
[['I1': 'Department of Biological Sciences, ...'],
['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
'author_list':
[['Dennehy', 'John J', 'I1'],
['Dennehy', 'John J', 'I2'],
['Wang', 'Ing-Nang', 'I1']],
'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
'journal': 'BMC Microbiology',
'pmc': '3166277',
'pmid': '21810267',
'publication_year': '2011',
'publisher_id': '1471-2180-11-174',
'subjects': 'Research Article'}
```## Example Usage with PySpark
This is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)
``` python
import os
import pubmed_parser as pp
from pyspark.sql import Rowpath_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
**pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
'file_name', 'pmc', 'pmid',
'publication_year', 'publisher_id',
'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe
```See [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)
folder for more information.## Core Members
* [Titipat Achakulvisut](http://titipata.github.io)
* [Daniel E. Acuna](http://scienceofscience.org/about)and [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)
## Dependencies
* [lxml](http://lxml.de/)
* [unidecode](https://pypi.python.org/pypi/Unidecode)
* [requests](http://docs.python-requests.org/en/master/)## Citation
If you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows
> Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979
or using BibTex
```
@article{Achakulvisut2020,
doi = {10.21105/joss.01979},
url = {https://doi.org/10.21105/joss.01979},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {46},
pages = {1979},
author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
journal = {Journal of Open Source Software}
}
```## Contributions
We welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.
## Acknowledgement
This package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!
## License
MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna