https://github.com/titipata/pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
https://github.com/titipata/pubmed_parser
article doi medline-xml nlp parse parser pmid pubmed-central pubmed-parser python xml
Last synced: 9 months ago
JSON representation
:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Host: GitHub
URL: https://github.com/titipata/pubmed_parser
Owner: titipata
License: mit
Created: 2015-03-05T05:13:57.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2024-12-27T20:28:04.000Z (about 1 year ago)
Last Synced: 2025-05-14T03:27:59.367Z (9 months ago)
Topics: article, doi, medline-xml, nlp, parse, parser, pmid, pubmed-central, pubmed-parser, python, xml
Language: Python
Homepage: http://titipata.github.io/pubmed_parser/
Size: 60.4 MB
Stars: 662
Watchers: 21
Forks: 172
Open Issues: 14
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project

README

          # Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)

 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)

Pubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)

 , [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or

 [documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.

## Available Parsers

* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.

* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from `pubmed_parser`.

  * [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)

  * [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)

  * [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)

  * [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)

  * [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)

  * [Parse MEDLINE XML](#parse-medline-xml)

  * [Parse MEDLINE Grant ID](#parse-medline-grant-id)

  * [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)

  * [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)

  * [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)

### Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:

* `full_title` : article's title

* `abstract` : abstract

* `journal` : Journal name

* `pmid` : PubMed ID

* `pmc` : PubMed Central ID

* `doi` : DOI of the article

* `publisher_id` : publisher ID

* `author_list` : list of authors with affiliation keys in the following format

``` python

 [['last_name_1', 'first_name_1', 'aff_key_1'],

  ['last_name_1', 'first_name_1', 'aff_key_2'],

  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]

 ```

* `affiliation_list` : list of affiliation keys and affiliation strings in the following format

``` python

 [['aff_key_1', 'affiliation_1'],

  ['aff_key_2', 'affiliation_2'], ...]

```

* `publication_year` : publication year

* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.

``` python

import pubmed_parser as pp

dict_out = pp.parse_pubmed_xml(path)

```

### Parse PubMed OA citation references

The function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

* `pmid` : PubMed ID of the article

* `pmc` : PubMed Central ID of the article

* `article_title` : title of cited article

* `journal` : journal name

* `journal_type` : type of journal

* `pmid_cited` : PubMed ID of article that article cites

* `doi_cited` : DOI of article that article cites

* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)

``` python

dicts_out = pp.parse_pubmed_references(path) # return list of dictionary

```

### Parse PubMed OA images and captions

The function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

* `pmid` : PubMed ID

* `pmc` : PubMed Central ID

* `fig_caption` : string of caption

* `fig_id` : reference id for figure (use to refer in XML article)

* `fig_label` : label of the figure

* `graphic_ref` : reference to image file name provided from Pubmed OA

``` python

dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary

```

### Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

* `pmid` : PubMed ID

* `pmc` : PubMed Central ID

* `text` : full text of the paragraph

* `reference_ids` : list of reference code within that paragraph.

This IDs can merge with output from `parse_pubmed_references` .

* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)

``` python

dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)

```

### Parse PubMed OA Table [WIP]

You can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.

* `pmid` : PubMed ID

* `pmc` : PubMed Central ID

* `caption` : caption of the table

* `label` : lable of the table

* `table_columns` : list of column name

* `table_values` : list of values inside the table

* `table_xml` : raw xml text of the table (return if `return_xml=True`)

``` python

dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)

```

### Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:

* `pmid` : PubMed ID

* `pmc` : PubMed Central ID

* `doi` : DOI

* `other_id` : Other IDs found, each separated by `;`

* `title` : title of the article

* `abstract` : abstract of the article

* `authors` : authors, each separated by `;`

* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`

* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`

* `keywords` : list of keywords, each separated by `;`

* `chemical_list` : list of chemical terms, each separated by `;`

* `pubdate` : Publication date. Defaults to year information only.

* `journal` : journal of the given paper

* `medline_ta` : this is abbreviation of the journal name

* `nlm_unique_id` : NLM unique identification

* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset

* `country` : Country extracted from journal information field

* `reference` : string of PMID each separated by `;` or list of references made to the article

* `delete` : boolean if `False` means paper got updated so you might have two

* `languages` : list of languages, separated by `;`

* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

``` python

dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',

                                 year_info_only=False,

                                 nlm_category=False,

                                 author_list=False,

                                 reference_list=False) # return list of dictionary

```

To extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.

### Parse MEDLINE Grant ID

Use `parse_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

* `pmid` : PubMed ID

* `grant_id` : Grant ID

* `grant_acronym` : Acronym of grant

* `country` : Country where grant funding from

* `agency` : Grant agency

If no Grant ID is found, it will return `None`

### Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys

* `title` : title

* `abstract` : abstract

* `journal` : journal

* `affiliation` : affiliation of first author

* `authors` : string of authors, separated by `;`

* `year` : Publication year

* `keywords` : keywords or MESH terms of the article

``` python

dict_out = pp.parse_xml_web(pmid, save_xml=False)

```

### Parse MEDLINE XML citations from website

The function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `pmc` : PubMed Central ID

* `pmid` : PubMed ID

* `doi` : DOI of the article

* `n_citations` : number of citations for given articles

* `pmc_cited` : list of PMCs that cite the given PMC

``` python

dict_out = pp.parse_citation_web(doc_id, id_type='PMC')

```

### Parse Outgoing XML citations from website

The function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `n_citations` : number of cited articles

* `doc_id` : the document identifier given

* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`

* `pmid_cited` : list of PMIDs cited by the article

``` python

dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')

```

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.

## Installation

You can install the most update version of the package directly from the repository

``` bash

pip install git+https://github.com/titipata/pubmed_parser.git

```

or install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using

``` bash

pip install pubmed-parser

```

or clone the repository and install using `pip`

``` bash

git clone https://github.com/titipata/pubmed_parser

pip install ./pubmed_parser

```

You can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`

in the root of the repository.

## Example snippet to parse PubMed OA dataset

An example usage is shown as follows

``` python

import pubmed_parser as pp

path_xml = pp.list_xml_path('data') # list all xml paths under directory

pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output

print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",

 'affiliation_list':

  [['I1': 'Department of Biological Sciences, ...'],

   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],

  'author_list':

  [['Dennehy', 'John J', 'I1'],

   ['Dennehy', 'John J', 'I2'],

   ['Wang', 'Ing-Nang', 'I1']],

 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',

 'journal': 'BMC Microbiology',

 'pmc': '3166277',

 'pmid': '21810267',

 'publication_year': '2011',

 'publisher_id': '1471-2180-11-174',

 'subjects': 'Research Article'}

```

## Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)

``` python

import os

import pubmed_parser as pp

from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')

path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)

parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),

                                               **pp.parse_pubmed_xml(x)))

pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe

pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',

                                 'file_name', 'pmc', 'pmid',

                                 'publication_year', 'publisher_id',

                                 'journal', 'subjects']] # select columns

pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe

```

See [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)

folder for more information.

## Core Members

* [Titipat Achakulvisut](http://titipata.github.io)

* [Daniel E. Acuna](http://scienceofscience.org/about)

and [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)

## Dependencies

* [lxml](http://lxml.de/)

* [unidecode](https://pypi.python.org/pypi/Unidecode)

* [requests](http://docs.python-requests.org/en/master/)

## Citation

If you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows

> Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

```

@article{Achakulvisut2020,

  doi = {10.21105/joss.01979},

  url = {https://doi.org/10.21105/joss.01979},

  year = {2020},

  publisher = {The Open Journal},

  volume = {5},

  number = {46},

  pages = {1979},

  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},

  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},

  journal = {Journal of Open Source Software}

}

```

## Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.

## Acknowledgement

This package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!

## License

MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/titipata/pubmed_parser

Awesome Lists containing this project

README