https://github.com/benckx/optimize-pdf-ereaders
Optimize scanned PDFs for small ebook readers using OCR
https://github.com/benckx/optimize-pdf-ereaders
ebook ebook-reader ebooks ereader ereader-tools ocr ocr-recognition pdf-document-processor tesseract tesseract-ocr
Last synced: about 1 year ago
JSON representation
Optimize scanned PDFs for small ebook readers using OCR
- Host: GitHub
- URL: https://github.com/benckx/optimize-pdf-ereaders
- Owner: benckx
- Created: 2022-07-29T09:03:00.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-08-02T09:12:28.000Z (over 3 years ago)
- Last Synced: 2025-01-23T13:48:06.481Z (about 1 year ago)
- Topics: ebook, ebook-reader, ebooks, ereader, ereader-tools, ocr, ocr-recognition, pdf-document-processor, tesseract, tesseract-ocr
- Language: Java
- Homepage:
- Size: 42.8 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About
PDF books and articles found online are usually poorly rendered on small e-readers (e.g. Kindle Oasis), as a whole PDF
page is displayed on the small screen.
This lib uses OCR to correct the skewed angle of the page, crop around the text and re-paginate; as to optimize for the
best reading experience on small e-readers.
The code was initially written in 2018 in Java, alongside an online converter website that I decided to take down as it
would cost quite a bit (OCR and image processing being quite resource-intensive). I also couldn't maintain it as I was
working full time.
Therefore, the project probably needs a bit of a cleanup.
The unit tests using full PDF books can not be shared publicly, so I will re-add them later, using only individual pages
rather than complete books.
## Examples
### Example 1
#### Input
[download PDF](thumbs/baudrillard_extract.pdf)
#### Output
[download PDF](thumbs/baudrillard_output.pdf)
### Example 2
#### Input
[download PDF](thumbs/edinburgh_extract.pdf)
#### Output
[download PDF](thumbs/edinburgh_output.pdf)
### Example 3
#### Input
[download PDF](thumbs/ellul_extract.pdf)
#### Output
[download PDF](thumbs/ellul_output.pdf)
# Requirements
```shell
sudo apt-get install tesseract-ocr
```
The data in `tessdata/` is found on https://github.com/tesseract-ocr/tessdata_best
# Usage
```java
RequestConfig requestConfig = RequestConfig
.builder()
.pdfFile(file)
.minPage(minPage)
.maxPage(maxPage)
.correctAngle(true)
.build();
Processor processor = new Processor(requestConfig);
processor.process();
processor.joinThread();
File outputFile = processor.writeToPDFFile(fileName + "_optimized.pdf");
```
# TODO
* ~~Move to Gradle~~
* Re-add unit tests that can be shared publicly, adapt the other ones
* Add language as a parameter
* Create a user-friendly runnable
* Move to Kotlin
* Finish picture detection