https://github.com/hubgit/paper-fetcher

Fetch JSON, HTML and PDF for a list of DOIs
https://github.com/hubgit/paper-fetcher

Last synced: 9 months ago
JSON representation

Fetch JSON, HTML and PDF for a list of DOIs

Host: GitHub
URL: https://github.com/hubgit/paper-fetcher
Owner: hubgit
Created: 2014-04-03T08:15:00.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2014-04-04T11:13:41.000Z (almost 12 years ago)
Last Synced: 2025-05-05T17:25:24.333Z (9 months ago)
Language: PHP
Size: 164 KB
Stars: 10
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Paper Fetcher

Given a list of DOIs, one per line in a file named `dois.txt`, `php fetch.php` will read each DOI, attempt to fetch bibliographic JSON from dx.doi.org and HTML from the publisher via dx.doi.org, then parse the HTML to find a PDF URL and fetch that PDF.

If successful, the files are stored in the data folder.

Alongside each file a JSON file is stored that contains metadata for the HTTP request/response.

The XPath selectors to find the PDF URL (or next HTML URL, if it's an interstitial page) are in [selectors.json](selectors.json). They work for most publishers, but not all - it will almost certainly be necessary to add domain-specific rules to get 100% coverage.

Note that if page content is generated dynamically or fetched with JavaScript, it won't be captured.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hubgit/paper-fetcher

Awesome Lists containing this project

README