Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/soodoku/image-to-text
Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs
https://github.com/soodoku/image-to-text
Last synced: 13 days ago
JSON representation
Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs
- Host: GitHub
- URL: https://github.com/soodoku/image-to-text
- Owner: soodoku
- Created: 2015-01-29T01:03:15.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2019-10-07T18:25:17.000Z (about 5 years ago)
- Last Synced: 2024-10-11T12:17:24.916Z (27 days ago)
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 15
- Watchers: 4
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
### Image to Text
The script uses [Tesseract](https://github.com/tesseract-ocr) to get text from pdfs. It reads pdfs from a specified directory and outputs text files to another directory. Tesseract works well for documents with simple structure and fonts that are easily parsed but generally struggles with more complex layout. To fix errors in the recovered text, you may want to use [Edit Distance Based Search and Replace](https://github.com/soodoku/search-and-replace), exploiting the fact that errors in OCR tend of systematic.
Rather than use Tesseract, you can also use [Abbyy FineReader](https://github.com/soodoku/abbyyR) or [Captricity](https://github.com/soodoku/captr). And to estimate the error rate of OCR, you may want to use [recognize](https://github.com/soodoku/recognize).
For a general overview of how to convert paper to digitial and how to optimize that process, see [A Quick Scan: From Paper to Digital](http://gbytes.gsood.com/2014/05/28/a-quick-scan-from-paper-to-digital-data/)
#### Usage
`pdf2txt.py [options] pdf_directory`
#### Command Line Options:
```
-h, --help show this help message and exit
-d DPI, --dpi=DPI JPEG Resolution in DPI (default: 400)
-j JPGDIR, --jpgdir=JPGDIR
JPEG output directory (default: jpg)
-t TXTDIR, --textdir=TXTDIR
Text output directory (default: text)
-r, --resume Resume OCR to Text (default: False)
```#### Example:
`python pdf2txt.py pdf_dir`The script will be post process all PDF files in `pdf_dir` directory and save the output text files to the `text` directory
### License
Scripts are released under the [MIT License](https://opensource.org/licenses/MIT).