https://github.com/hubgit/paper-fetcher
Fetch JSON, HTML and PDF for a list of DOIs
https://github.com/hubgit/paper-fetcher
Last synced: 9 months ago
JSON representation
Fetch JSON, HTML and PDF for a list of DOIs
- Host: GitHub
- URL: https://github.com/hubgit/paper-fetcher
- Owner: hubgit
- Created: 2014-04-03T08:15:00.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2014-04-04T11:13:41.000Z (almost 12 years ago)
- Last Synced: 2025-05-05T17:25:24.333Z (9 months ago)
- Language: PHP
- Size: 164 KB
- Stars: 10
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Paper Fetcher
Given a list of DOIs, one per line in a file named `dois.txt`, `php fetch.php` will read each DOI, attempt to fetch bibliographic JSON from dx.doi.org and HTML from the publisher via dx.doi.org, then parse the HTML to find a PDF URL and fetch that PDF.
If successful, the files are stored in the data folder.
Alongside each file a JSON file is stored that contains metadata for the HTTP request/response.
The XPath selectors to find the PDF URL (or next HTML URL, if it's an interstitial page) are in [selectors.json](selectors.json). They work for most publishers, but not all - it will almost certainly be necessary to add domain-specific rules to get 100% coverage.
Note that if page content is generated dynamically or fetched with JavaScript, it won't be captured.