Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/etowahadams/pmc-figure-downloader

Extract figures and figure captions from PMC open access papers
https://github.com/etowahadams/pmc-figure-downloader

bioinformatics entrez pubmed pubmed-central

Last synced: about 2 months ago
JSON representation

Extract figures and figure captions from PMC open access papers

Awesome Lists containing this project

README

        

# PMC Figure Downloader

This is a simple script to download figures from open access papers in [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/) database. It uses the [PMC API](https://www.ncbi.nlm.nih.gov/pmc/tools/developers/) to search for articles and uses the
[PMC Open Access Web Service API](https://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/) to get the XML of each paper to determine the figure information and URLs.

## Installation
You can install using pip:
```bash
python -m venv env
source env/bin/activate
pip install -r requirements.txt
```

## Usage
See `example.ipynb` for a Jupyter notebook example.

### Query PMC for open access papers
```python
query = '"Nature Genetics"[Journal] AND "open access"[filter]'
# Returns a list of paper IDs
result_ids = search_pmc(query, email="[email protected]", max_results=10)
```
### Extract figure information and URLs from a list of paper IDs
```python
# A list of PMC IDs
result_ids = ["10937393", "10864173"]
# A dataframe with figure information
figure_data = extract_pmc_figures(result_ids)
# Save the dataframe to a parquet file if you want to use it later
figure_data.write_parquet("figure_data.parquet")
```
The `figure_data` dataframe looks like this:
```
┌──────────┬────────┬───────────┬──────────────────────┬─────────────────────┬─────────────────────┐
│ pmcid ┆ fig_id ┆ fig_label ┆ fig_title ┆ fig_desc ┆ image_url │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞══════════╪════════╪═══════════╪══════════════════════╪═════════════════════╪═════════════════════╡
│ 10937393 ┆ Fig1 ┆ Fig. 1 ┆ FANS-based isolation ┆ a, Schematic ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ of nuclei o… ┆ representation of ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ ┆ t… ┆ │
│ 10937393 ┆ Fig2 ┆ Fig. 2 ┆ Purity and ┆ a, Heatmaps depict ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ reproducibility of ┆ log2-transfor… ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ th… ┆ ┆ │
```

### Download figures to a directory
```python
output_dir = "img"
download_status = download_imgs(figure_data, output_dir)
print("Failed downloads:")
download_status.filter(pl.col("status") != 200)
```
The `download_status` dataframe will look like this:
```
┌──────────┬────────┬────────┐
│ pmcid ┆ fig_id ┆ status │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════╪════════╪════════╡
│ 10937393 ┆ Fig1 ┆ 200 │
│ 10937393 ┆ Fig2 ┆ 200 │
```