Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/metachris/pdfx
Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
https://github.com/metachris/pdfx
Last synced: 3 months ago
JSON representation
Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
- Host: GitHub
- URL: https://github.com/metachris/pdfx
- Owner: metachris
- License: apache-2.0
- Archived: true
- Created: 2015-10-15T14:49:37.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2023-06-15T04:37:39.000Z (over 1 year ago)
- Last Synced: 2024-09-20T18:50:38.598Z (3 months ago)
- Language: Python
- Homepage: http://www.metachris.com/pdfx
- Size: 1.73 MB
- Stars: 1,033
- Watchers: 39
- Forks: 113
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-pdf - pdfx
- awesome-starred - metachris/pdfx - Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs. (others)
README
# PDFx
![Build status for master branch](https://github.com/metachris/pdfx/workflows/Lint%20and%20test/badge.svg)
[![image](https://badge.fury.io/py/pdfx.svg)](https://pypi.python.org/pypi/pdfx)
[![image](https://img.shields.io/badge/license-Apache-blue.svg)](https://github.com/metachris/pdfx/blob/master/LICENSE)## Introduction
Extract references (pdf, url, doi, arxiv) and metadata from a PDF.
Optionally download all referenced PDFs and check for broken links.**Features**
- Extract references and metadata from a given PDF
- Detects pdf, url, arxiv and doi references
- **Fast, parallel download of all referenced PDFs**
- **Find broken hyperlinks** (using the `-c` flag)
([more](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/))
- Output as text or JSON (using the `-j` flag)
- Extract the PDF text (using the `--text` flag)
- Use as command-line tool or Python package
- Compatible with Python 2 and 3
- Works with local and online pdfs## Getting Started
Grab a copy of the code with `easy_install` or `pip`, and run it:
$ sudo easy_install -U pdfx
...
$ pdfxRun `pdfx -h` to see the help output:
$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
[--version]Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.positional arguments:
pdf Filename or URL of a PDF fileoptional arguments:
-h, --help show this help message and exit
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
Download all referenced PDFs into specified directory
-c, --check-links Check for broken links
-j, --json Output infos as JSON (instead of plain text)
-v, --verbose Print all references (instead of only PDFs)
-t, --text Only extract text (no metadata or references)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Output to specified file instead of console
--version show program's version number and exit## Examples
Lets take a look at this paper:
:$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}References: 36
- URL: 18
- PDF: 18PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdfYou can use the `-v` flag to output all references instead of just the
PDFs.**Download all referenced pdfs** with `-d` (for `download-pdfs`) to the
specified directory (eg. to `/tmp/`):$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...To **extract text**, you can use the `-t` flag:
# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txtTo **check for broken links** use the `-c` flag:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c
\[Example (with video) of checking for broken
links\]().## Usage as Python library
>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")## Dev & Contributing
```bash
# Setup venv
python3 -m venv
venv . venv/bin/activate# Install PDFx and dev deps
pip install -e .
pip install -r requirements_dev.txt# Run tests and checks
make test
make lint
make check# Format the code (with black)
make format
```### Releasing
* Update version number in `setup.py` and `pdfx/__init__.py`
* Create a git tag starting with `v` (eg. `git tag v1.5.9`)
* Push the tag to GitHub: `git push --tags`GitHub Actions is then publishing to PyPI.
## Various
- Author: Chris Hager [twitter.com/metachris](https://twitter.com/metachris)
- Homepage: https://www.metachris.com/pdfx
- License: ApacheFeedback, ideas and pull requests are welcome!
## Improvement Ideas
Possible:
- Timeout (see [#43](https://github.com/metachris/pdfx/issues/43))
- Cuts off links that span two lines [#40](https://github.com/metachris/pdfx/issues/40)
- Include Check-Links Results in Output [#39](https://github.com/metachris/pdfx/issues/39)