Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rpolana/pdf_to_searchable_pdf
Python command-line utility to convert any pdf having images or unsearchable text to a searchable pdf
https://github.com/rpolana/pdf_to_searchable_pdf
Last synced: 3 months ago
JSON representation
Python command-line utility to convert any pdf having images or unsearchable text to a searchable pdf
- Host: GitHub
- URL: https://github.com/rpolana/pdf_to_searchable_pdf
- Owner: rpolana
- License: unlicense
- Created: 2021-12-27T14:16:31.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2022-01-14T18:41:02.000Z (almost 3 years ago)
- Last Synced: 2024-06-27T15:35:50.540Z (5 months ago)
- Language: Python
- Size: 16.6 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf_to_searcheable_pdf
A cross-platform python command-line utility that converts any PDF file containing images or unsearcheable fonts to a searcheable text PDF file using tesseract OCR (optical character recognition) and other open source libraries
## Usage
`$ python pdf_to_searchable_pdf.py [-h] [-d DATA_DIR] [-t] [-i] input_filename`
positional arguments:
input_filename input pdf filenameoptional arguments:
-h, --help show this help message and exit
-d DATA_DIR, --data_dir DATA_DIR
input data directory
-t, --text_flag flag to output text file
-i, --intermediates_flag
flag to output intermediate (page image and pdf) files## Requirements/Dependencies
* Python 3.6 or up
* Python modules listed in the requirements.txt
* tesseract OCR
* Poppler (Windows users will have to build or download poppler for Windows: (https://github.com/oschwartz10612/poppler-windows/releases/) which is the most up-to-date. You will then have to add the `bin/` folder to [PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/).