Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/etowahadams/pmc-figure-downloader
Extract figures and figure captions from PMC open access papers
https://github.com/etowahadams/pmc-figure-downloader
bioinformatics entrez pubmed pubmed-central
Last synced: about 2 months ago
JSON representation
Extract figures and figure captions from PMC open access papers
- Host: GitHub
- URL: https://github.com/etowahadams/pmc-figure-downloader
- Owner: etowahadams
- Created: 2024-03-19T16:28:27.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-04-03T19:55:36.000Z (9 months ago)
- Last Synced: 2024-10-18T17:26:46.927Z (3 months ago)
- Topics: bioinformatics, entrez, pubmed, pubmed-central
- Language: Jupyter Notebook
- Homepage:
- Size: 29.3 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PMC Figure Downloader
This is a simple script to download figures from open access papers in [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/) database. It uses the [PMC API](https://www.ncbi.nlm.nih.gov/pmc/tools/developers/) to search for articles and uses the
[PMC Open Access Web Service API](https://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/) to get the XML of each paper to determine the figure information and URLs.## Installation
You can install using pip:
```bash
python -m venv env
source env/bin/activate
pip install -r requirements.txt
```## Usage
See `example.ipynb` for a Jupyter notebook example.### Query PMC for open access papers
```python
query = '"Nature Genetics"[Journal] AND "open access"[filter]'
# Returns a list of paper IDs
result_ids = search_pmc(query, email="[email protected]", max_results=10)
```
### Extract figure information and URLs from a list of paper IDs
```python
# A list of PMC IDs
result_ids = ["10937393", "10864173"]
# A dataframe with figure information
figure_data = extract_pmc_figures(result_ids)
# Save the dataframe to a parquet file if you want to use it later
figure_data.write_parquet("figure_data.parquet")
```
The `figure_data` dataframe looks like this:
```
┌──────────┬────────┬───────────┬──────────────────────┬─────────────────────┬─────────────────────┐
│ pmcid ┆ fig_id ┆ fig_label ┆ fig_title ┆ fig_desc ┆ image_url │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞══════════╪════════╪═══════════╪══════════════════════╪═════════════════════╪═════════════════════╡
│ 10937393 ┆ Fig1 ┆ Fig. 1 ┆ FANS-based isolation ┆ a, Schematic ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ of nuclei o… ┆ representation of ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ ┆ t… ┆ │
│ 10937393 ┆ Fig2 ┆ Fig. 2 ┆ Purity and ┆ a, Heatmaps depict ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ reproducibility of ┆ log2-transfor… ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ th… ┆ ┆ │
```### Download figures to a directory
```python
output_dir = "img"
download_status = download_imgs(figure_data, output_dir)
print("Failed downloads:")
download_status.filter(pl.col("status") != 200)
```
The `download_status` dataframe will look like this:
```
┌──────────┬────────┬────────┐
│ pmcid ┆ fig_id ┆ status │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════╪════════╪════════╡
│ 10937393 ┆ Fig1 ┆ 200 │
│ 10937393 ┆ Fig2 ┆ 200 │
```