https://github.com/benckx/optimize-pdf-ereaders

Optimize scanned PDFs for small ebook readers using OCR
https://github.com/benckx/optimize-pdf-ereaders

ebook ebook-reader ebooks ereader ereader-tools ocr ocr-recognition pdf-document-processor tesseract tesseract-ocr

Last synced: about 1 year ago
JSON representation

Optimize scanned PDFs for small ebook readers using OCR

Host: GitHub
URL: https://github.com/benckx/optimize-pdf-ereaders
Owner: benckx
Created: 2022-07-29T09:03:00.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2022-08-02T09:12:28.000Z (almost 4 years ago)
Last Synced: 2025-01-23T13:48:06.481Z (over 1 year ago)
Topics: ebook, ebook-reader, ebooks, ereader, ereader-tools, ocr, ocr-recognition, pdf-document-processor, tesseract, tesseract-ocr
Language: Java
Homepage:
Size: 42.8 MB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          





# About

PDF books and articles found online are usually poorly rendered on small e-readers (e.g. Kindle Oasis), as a whole PDF

page is displayed on the small screen.

This lib uses OCR to correct the skewed angle of the page, crop around the text and re-paginate; as to optimize for the

best reading experience on small e-readers.

The code was initially written in 2018 in Java, alongside an online converter website that I decided to take down as it

would cost quite a bit (OCR and image processing being quite resource-intensive). I also couldn't maintain it as I was

working full time.

Therefore, the project probably needs a bit of a cleanup.

The unit tests using full PDF books can not be shared publicly, so I will re-add them later, using only individual pages

rather than complete books.

## Examples

### Example 1

#### Input



    

    

    

    



[download PDF](thumbs/baudrillard_extract.pdf)

#### Output



    

    

    

    

    

    

    



[download PDF](thumbs/baudrillard_output.pdf)

### Example 2

#### Input



    

    

    

    



[download PDF](thumbs/edinburgh_extract.pdf)

#### Output



    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    

    



[download PDF](thumbs/edinburgh_output.pdf)

### Example 3

#### Input



    

    

    

    



[download PDF](thumbs/ellul_extract.pdf)

#### Output



    

    

    

    

    

    



[download PDF](thumbs/ellul_output.pdf)

# Requirements

```shell

sudo apt-get install tesseract-ocr

```

The data in `tessdata/` is found on https://github.com/tesseract-ocr/tessdata_best

# Usage

```java

    RequestConfig requestConfig = RequestConfig

        .builder()

        .pdfFile(file)

        .minPage(minPage)

        .maxPage(maxPage)

        .correctAngle(true)

        .build();

    Processor processor = new Processor(requestConfig);

    processor.process();

    processor.joinThread();

    File outputFile = processor.writeToPDFFile(fileName + "_optimized.pdf");

```

# TODO

* ~~Move to Gradle~~

* Re-add unit tests that can be shared publicly, adapt the other ones

* Add language as a parameter

* Create a user-friendly runnable

* Move to Kotlin

* Finish picture detection

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benckx/optimize-pdf-ereaders

Awesome Lists containing this project

README