{"id":21459511,"url":"https://github.com/benckx/optimize-pdf-ereaders","last_synced_at":"2025-03-17T04:23:05.680Z","repository":{"id":50572407,"uuid":"519149393","full_name":"benckx/optimize-pdf-ereaders","owner":"benckx","description":"Optimize scanned PDFs for small ebook readers using OCR","archived":false,"fork":false,"pushed_at":"2022-08-02T09:12:28.000Z","size":44833,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-23T13:48:06.481Z","etag":null,"topics":["ebook","ebook-reader","ebooks","ereader","ereader-tools","ocr","ocr-recognition","pdf-document-processor","tesseract","tesseract-ocr"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/benckx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-29T09:03:00.000Z","updated_at":"2024-09-17T18:54:15.000Z","dependencies_parsed_at":"2022-09-24T13:38:13.544Z","dependency_job_id":null,"html_url":"https://github.com/benckx/optimize-pdf-ereaders","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benckx%2Foptimize-pdf-ereaders","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benckx%2Foptimize-pdf-ereaders/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benckx%2Foptimize-pdf-ereaders/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benckx%2Foptimize-pdf-ereaders/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/benckx","download_url":"https://codeload.github.com/benckx/optimize-pdf-ereaders/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243971200,"owners_count":20376784,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ebook","ebook-reader","ebooks","ereader","ereader-tools","ocr","ocr-recognition","pdf-document-processor","tesseract","tesseract-ocr"],"created_at":"2024-11-23T06:29:29.854Z","updated_at":"2025-03-17T04:23:05.662Z","avatar_url":"https://github.com/benckx.png","language":"Java","funding_links":["https://paypal.me/benckx/2"],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"https://paypal.me/benckx/2\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/Donate-PayPal-green.svg\"/\u003e\n\u003c/a\u003e\n\n# About\n\nPDF books and articles found online are usually poorly rendered on small e-readers (e.g. Kindle Oasis), as a whole PDF\npage is displayed on the small screen.\n\nThis lib uses OCR to correct the skewed angle of the page, crop around the text and re-paginate; as to optimize for the\nbest reading experience on small e-readers.\n\nThe code was initially written in 2018 in Java, alongside an online converter website that I decided to take down as it\nwould cost quite a bit (OCR and image processing being quite resource-intensive). I also couldn't maintain it as I was\nworking full time.\n\nTherefore, the project probably needs a bit of a cleanup.\n\nThe unit tests using full PDF books can not be shared publicly, so I will re-add them later, using only individual pages\nrather than complete books.\n\n## Examples\n\n### Example 1\n\n#### Input\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/baudrillard_input_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_input_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_input_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_input_page_4.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/baudrillard_extract.pdf)\n\n#### Output\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_4.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_5.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_6.jpg\"/\u003e\n    \u003cimg src=\"thumbs/baudrillard_output_page_7.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/baudrillard_output.pdf)\n\n### Example 2\n\n#### Input\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/edinburgh_input_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_input_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_input_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_input_page_4.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/edinburgh_extract.pdf)\n\n#### Output\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_4.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_5.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_6.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_7.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_8.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_9.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_10.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_11.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_12.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_13.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_14.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_15.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_16.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_17.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_18.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_19.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_20.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_21.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_22.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_23.jpg\"/\u003e\n    \u003cimg src=\"thumbs/edinburgh_output_page_24.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/edinburgh_output.pdf)\n\n### Example 3\n\n#### Input\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/ellul_input_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_input_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_input_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_input_page_4.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/ellul_extract.pdf)\n\n#### Output\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_1.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_2.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_3.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_4.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_5.jpg\"/\u003e\n    \u003cimg src=\"thumbs/ellul_output_page_6.jpg\"/\u003e\n\u003c/p\u003e\n\n[download PDF](thumbs/ellul_output.pdf)\n\n# Requirements\n\n```shell\nsudo apt-get install tesseract-ocr\n```\n\nThe data in `tessdata/` is found on https://github.com/tesseract-ocr/tessdata_best\n\n# Usage\n\n```java\n    RequestConfig requestConfig = RequestConfig\n        .builder()\n        .pdfFile(file)\n        .minPage(minPage)\n        .maxPage(maxPage)\n        .correctAngle(true)\n        .build();\n\n    Processor processor = new Processor(requestConfig);\n    processor.process();\n    processor.joinThread();\n    File outputFile = processor.writeToPDFFile(fileName + \"_optimized.pdf\");\n```\n\n# TODO\n\n* ~~Move to Gradle~~\n* Re-add unit tests that can be shared publicly, adapt the other ones\n* Add language as a parameter\n* Create a user-friendly runnable\n* Move to Kotlin\n* Finish picture detection\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenckx%2Foptimize-pdf-ereaders","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenckx%2Foptimize-pdf-ereaders","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenckx%2Foptimize-pdf-ereaders/lists"}