Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/krassowski/easy-entrez

Retrieve PubMed articles, text-mining annotations, or molecular data from >35 Entrez databases via easy to use Python package - built on top of Entrez E-utilities API.
https://github.com/krassowski/easy-entrez

entrez entrez-eutilities eutilities gene-annotations literature-mining literature-search meta-analysis pubmed pubmed-central

Last synced: 14 days ago
JSON representation

Retrieve PubMed articles, text-mining annotations, or molecular data from >35 Entrez databases via easy to use Python package - built on top of Entrez E-utilities API.

Host: GitHub
URL: https://github.com/krassowski/easy-entrez
Owner: krassowski
License: lgpl-3.0
Created: 2020-06-14T10:48:26.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-11-02T21:59:16.000Z (about 1 year ago)
Last Synced: 2024-12-10T04:51:28.480Z (24 days ago)
Topics: entrez, entrez-eutilities, eutilities, gene-annotations, literature-mining, literature-search, meta-analysis, pubmed, pubmed-central
Language: Python
Homepage: https://easy-entrez.readthedocs.io/en/latest/
Size: 120 KB
Stars: 71
Watchers: 4
Forks: 6
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # easy-entrez

![Tests](https://github.com/krassowski/easy-entrez/workflows/tests/badge.svg)

![CodeQL](https://github.com/krassowski/easy-entrez/workflows/CodeQL/badge.svg)

[![Documentation Status](https://readthedocs.org/projects/easy-entrez/badge/?version=latest)](https://easy-entrez.readthedocs.io/en/latest/?badge=latest)

[![DOI](https://zenodo.org/badge/272182307.svg)](https://zenodo.org/badge/latestdoi/272182307)

![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8%20%7C%203.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)

Python REST API for Entrez E-Utilities, aiming to  be easy to use and reliable.

Easy-entrez:

- makes common tasks easy thanks to simple Pythonic API,

- is typed and integrates well with mypy,

- is tested on Windows, Mac and Linux across Python 3.7 to 3.12,

- is limited in scope, allowing to focus on the reliability of the core code,

- does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.

### Examples

```python

from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(

    'your-tool-name',

    '[email protected]',

    # optional

    return_type='json'

)

# find up to 10 000 results for cancer in human

result = entrez_api.search('cancer AND human[organism]', max_results=10_000)

# data will be populated with JSON or XML (depending on the `return_type` value)

result.data

```

See more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-entrez.readthedocs.io/en/latest).

For a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.

#### Fetching genes for a variant from dbSNP

Fetch the SNP record for `rs6311`:

```python

rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]

rs6311

```

Display the result:

```python

from easy_entrez.parsing import xml_to_string

print(xml_to_string(rs6311))

```

Find the gene names for `rs6311`:

```python

namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}

genes = [

    name.text

    for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)

]

print(genes)

```

> `['HTR2A']`

Fetch data for multiple variants at once:

```python

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

gene_names = {

    'rs' + document_summary.get('uid'): [

        element.text

        for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)

    ]

    for document_summary in result.data

}

print(gene_names)

```

> `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`

#### Obtaining the chromosomal position from SNP rsID number

```python

from pandas import DataFrame

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

variant_positions = DataFrame([

    {

        'id': 'rs' + document_summary.get('uid'),

        'chromosome': chromosome,

        'position': position

    }

    for document_summary in result.data

    for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)

    for chromosome, position in [chrom_and_position.text.split(':')]

])

variant_positions

```

> |    | id       |   chromosome |   position |

> |---:|:---------|-------------:|-----------:|

> |  0 | rs6311   |           13 |   46897343 |

> |  1 | rs662138 |            6 |  160143444 |

#### Converting full variation/mutation data to tabular format

Parsing utilities can quickly extract the data to a `VariantSet` object

holding pandas `DataFrame`s with coordinates and alternative alleles frequencies:

```python

from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)

variants

```

> ``

To get the coordinates:

```python

variants.coordinates

```

> | rs_id    | ref   | alts   |   chrom |       pos |   chrom_prev |   pos_prev | consequence                                                                  |

> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|

>| rs6311   | C     | A,T    |      13 |  46897343 |           13 |   47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |

>| rs662138 | C     | G      |       6 | 160143444 |            6 |  160564476 | intron_variant                                                               |

For frequencies:

```python

variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity

```

> |    | rs_id   | allele   |   source_frequency |   total_count | study       |     count |

> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|

> |  0 | rs6311  | T        |           0.44349  |          2221 | 1000Genomes |   984.991 |

> |  1 | rs6311  | T        |           0.411261 |          1585 | ALSPAC      |   651.849 |

> |  2 | rs6311  | T        |           0.331696 |          1486 | Estonian    |   492.9   |

> |  3 | rs6311  | T        |           0.35     |            14 | GENOME_DK   |     4.9   |

> |  4 | rs6311  | T        |           0.402529 |         56309 | GnomAD      | 22666     |

#### Obtaining the SNP rs ID number from chromosomal position

You can use the query string directly:

```python

results = entrez_api.search(

    '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',

    database='snp',

    max_results=10

)

print(results.data['esearchresult']['idlist'])

```

> `['59296319', '17076752', '7336701', '4']`

Or pass a dictionary (no validation of arguments is performed, `AND` conjunction is used):

```python

results = entrez_api.search(

    dict(chromosome=13, organism='human', position=31873085),

    database='snp',

    max_results=10

)

print(results.data['esearchresult']['idlist'])

```

> `['59296319', '17076752', '7336701', '4']`

The base position should use the latest genome assembly (GRCh38 at the time of writing);

you can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.

For more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.

#### Obtaining amino acids change information for variants in given range

First we search for dbSNP rs identifiers for variants in given region:

```python

dbsnp_ids = (

    entrez_api

    .search(

        '12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',

        database='snp',

        max_results=100

    )

    .data

    ['esearchresult']

    ['idlist']

)

```

Then fetch the variant data for identifiers:

```python

variant_data = entrez_api.fetch(

    ['rs' + rs_id for rs_id in dbsnp_ids],

    max_results=10,

    database='snp'

)

```

And parse the data, extracting the HGVS out of summary:

```python

from easy_entrez.parsing import parse_dbsnp_variants

from pandas import Series

def select_protein_hgvs(items):

    return [

        [sequence, hgvs]

        for entry in items

        for sequence, hgvs in [entry.split(':')]

        if hgvs.startswith('p.')

    ]

protein_hgvs = (

    parse_dbsnp_variants(variant_data)

    .summary

    .HGVS

    .apply(select_protein_hgvs)

    .explode()

    .dropna()

    .apply(Series)

    .rename(columns={0: 'sequence', 1: 'hgvs'})

)

protein_hgvs.head()

```

> | rs_id        | sequence    | hgvs        |

> |:-------------|:------------|:------------|

> | rs1940853486 | NP_006437.3 | p.Gly203Ter |

> | rs1940853414 | NP_006437.3 | p.Glu202Gly |

> | rs1940853378 | NP_006437.3 | p.Glu202Lys |

> | rs1940853299 | NP_006437.3 | p.Lys201Thr |

> | rs1940852987 | NP_006437.3 | p.Asp198Glu |

#### Fetching more than 10 000 entries

Use `in_batches_of` method to fetch more than 10k entries (e.g. `variant_ids`):

```python

snps_result = (

    entrez.api

    .in_batches_of(1_000)

    .fetch(variant_ids, max_results=5_000, database='snp')

)

```

The result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use `parse_dbsnp_variants` directly on this dictionary.

#### Find PubMed ID from DOI

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

```python

def doi_term(doi: str) -> str:

    """Clean a DOI string by removing URL prefix."""

    doi = (

        doi

        .replace('http://', 'https://')

        .replace('https://doi.org/', '')

    )

    return f'"{doi}"[Publisher ID]'

result = entrez_api.search(

    doi_term('https://doi.org/10.3389/fcell.2021.626821'),

    database='pubmed',

    max_results=1

)

print(result.data['esearchresult']['idlist'])

```

> `['33834021']`

### Installation

Requires Python 3.6+ (though only 3.7+ is tested). Install with:

```bash

pip install easy-entrez

```

If you wish to enable (optional, `tqdm`-based) progress bars use:

```bash

pip install easy-entrez[with_progress_bars]

```

If you wish to enable (optional, `pandas`-based) parsing utilities use:

```bash

pip install easy-entrez[with_parsing_utils]

```

### Contributing

To build the documentation locally:

```bash

pip install -e .[docs]

sphinx-build docs docs/_build

open docs/_build/index.html

```

### Alternatives

You might want to try:

- [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it

- [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses

- [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API

- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since