Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jannisborn/paperscraper
Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
https://github.com/jannisborn/paperscraper
arxiv biorxiv chemrxiv medrxiv paperscraper pubmed
Last synced: 1 day ago
JSON representation
Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
- Host: GitHub
- URL: https://github.com/jannisborn/paperscraper
- Owner: jannisborn
- License: mit
- Created: 2020-07-01T16:17:43.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-12-13T17:07:47.000Z (10 days ago)
- Last Synced: 2024-12-15T07:01:13.909Z (8 days ago)
- Topics: arxiv, biorxiv, chemrxiv, medrxiv, paperscraper, pubmed
- Language: Python
- Homepage:
- Size: 1.51 MB
- Stars: 273
- Watchers: 11
- Forks: 33
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
[![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
[![License:
MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
# paperscraper`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
**PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
from Google Scholar, impact factors from journals and comes with simple postprocessing functions
and plotting routines for meta-analysis.## Getting started
```console
pip install paperscraper
```This is enough to query **PubMed**, **arXiv** or Google Scholar.
#### Download X-rxiv Dumps
However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
```py
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file
```
*NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
```py
medrxiv(begin_date="2023-04-01", end_date="2023-04-08")
```
But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.## Examples
`paperscraper` is build on top of the packages [arxiv](https://pypi.org/project/arxiv/), [pymed](https://pypi.org/project/pymed-paperscraper/), and [scholarly](https://pypi.org/project/scholarly/).
### Publication keyword search
Consider you want to perform a publication keyword search with the query:
`COVID-19` **AND** `Artificial Intelligence` **AND** `Medical Imaging`.* Scrape papers from PubMed:
```py
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```* Scrape papers from arXiv:
```py
from paperscraper.arxiv import get_and_dump_arxiv_papersget_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
```* Scrape papers from bioRiv, medRxiv or chemRxiv:
```py
from paperscraper.xrxiv.xrxiv_query import XRXivQueryquerier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')
```You can also use `dump_queries` to iterate over a bunch of queries for all available databases.
```py
from paperscraper import dump_queriesqueries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')
```Or use the harmonized interface of `QUERY_FN_DICT` to query multiple databases of your choice:
```py
from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')
```* Scrape papers from Google Scholar:
Thanks to [scholarly](https://pypi.org/project/scholarly/), there is an endpoint for Google Scholar too.
It does not understand Boolean expressions like the others, but should be used just like
the [Google Scholar search fields](https://scholar.google.com).```py
from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)
```### Scrape PDFs
`paperscraper` also allows you to download the PDF files.
```py
from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')
```If you want to batch download all PDFs for your previous metadata search, use the wrapper.
Here we scrape the PDFs for the metadata obtained in the previous example.```py
from paperscraper.pdf import save_pdf_from_dump# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
```
*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
Many publishers detect and block scraping and many publications are simply behind paywalls.### Citation search
A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:
```py
from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)
```*NOTE*: The scholar endpoint does not require authentication but since it regularly
prompts with captchas, it's difficult to apply large scale.### Journal impact factor
You can also retrieve the impact factor for all journals:
```py
>>>from paperscraper.impact import Impactor
>>>i = Impactor()
>>>i.search("Nat Comms", threshold=85, sort_by='impact')
[
{'journal': 'Nature Communications', 'factor': 17.694, 'score': 94},
{'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}
]
```
This performs a fuzzy search with a threshold of 85. `threshold` defaults to 100 in which case an exact search
is performed. You can also search by journal abbreviation, [E-ISSN](https://portal.issn.org) or [NLM ID](https://portal.issn.org).
```py
i.search("Nat Rev Earth Environ") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search("101771060") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]# Filter results by impact factor
i.search("Neural network", threshold=85, min_impact=1.5, max_impact=20)
# [
# {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93},
# {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},
# {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86},
# {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}
# ]# Show all fields
i.search("quantum information", threshold=90, return_all=True)
# [
# {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},
# {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}
# ]
```### Plotting
When multiple query searches are performed, two types of plots can be generated
automatically: Venn diagrams and bar plots.#### Barplots
Compare the temporal evolution of different queries across different servers.
```py
from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query, load_jsonl# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]root = '../keyword_dumps'
data_dict = dict()
for query in queries:
filename = get_filename_from_query(query)
data_dict[filename] = dict()
for db,_ in QUERY_FN_DICT.items():
# Assuming the keyword search has been performed already
data = load_jsonl(os.path.join(root, db, filename))# Unstructured matches are aggregated into 6 bins, 1 per year
# from 2015 to 2020. Sanity check is performed by having
# `filtering=True`, removing papers that don't contain all of
# the keywords in query.
data_dict[filename][db], filtered = aggregate_paper(
data, 2015, bins_per_year=1, filtering=True,
filter_keys=query, return_filtered=True
)# Plotting is now very simple
from paperscraper.plotting import plot_comparisondata_keys = [
'deeplearning_molecule_fingerprint.jsonl',
'deeplearning_molecule_smiles.jsonl',
'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
data_dict,
data_keys,
title_text="'Deep Learning' AND 'Molecule' AND X",
keyword_text=['Fingerprint', 'SMILES', 'Graph'],
figpath='mol_representation'
)
```![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
#### Venn Diagrams
```py
from paperscraper.plotting import (
plot_venn_two, plot_venn_three, plot_multiple_venn
)sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')
```![2019](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging.png?raw=true "2019")
```py
plot_venn_three(
sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)
```![2020](https://github.com/jannisborn/paperscraper/blob/main/assets/ai_imaging_covid.png?raw=true "2020")
Or plot both together:
```py
plot_multiple_venn(
[sizes_2019, sizes_2020], [labels_2019, labels_2020],
titles=['2019', '2020'], suptitle='Keyword search comparison',
gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
figname='both'
)
```![both](https://github.com/jannisborn/paperscraper/blob/main/assets/both.png?raw=true "Both")
## Citation
If you use `paperscraper`, please cite a paper that motivated our development of this tool.```bib
@article{born2021trends,
title={Trends in Deep Learning for Property-driven Drug Design},
author={Born, Jannis and Manica, Matteo},
journal={Current Medicinal Chemistry},
volume={28},
number={38},
pages={7862--7886},
year={2021},
publisher={Bentham Science Publishers}
}
```## Contributions
Thanks to the following contributors:
- [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
- [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
- [@daenuprobst](https://github.com/daenuprobst): Since `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
- [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
- [@juliusbierk](https://github.com/juliusbierk): Bugfixes