{"id":21047188,"url":"https://github.com/jsv4/ocrusrex","last_synced_at":"2026-03-07T15:34:40.339Z","repository":{"id":57447738,"uuid":"250360968","full_name":"JSv4/OCRUSREX","owner":"JSv4","description":"A simple Python script to turn non-OCRed PDFs into searchable, OCRed PDFs under an enterprise-friendly, open source license. ","archived":false,"fork":false,"pushed_at":"2020-04-05T05:21:08.000Z","size":10978,"stargazers_count":9,"open_issues_count":1,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-11T13:32:49.738Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JSv4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-26T20:09:20.000Z","updated_at":"2024-12-25T12:21:01.000Z","dependencies_parsed_at":"2022-09-16T22:21:22.758Z","dependency_job_id":null,"html_url":"https://github.com/JSv4/OCRUSREX","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/JSv4/OCRUSREX","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FOCRUSREX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FOCRUSREX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FOCRUSREX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FOCRUSREX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JSv4","download_url":"https://codeload.github.com/JSv4/OCRUSREX/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FOCRUSREX/sbom","scorecard":{"id":69292,"data":{"date":"2025-08-11","repo":{"name":"github.com/JSv4/OCRUSREX","commit":"2824b2a80f2496689cf6d8f3fe670d45afdbe42e"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.8,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/14 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":8,"reason":"binaries present in source code","details":["Warn: binary detected: OCRUSREX/__pycache__/__init__.cpython-37.pyc:1","Warn: binary detected: OCRUSREX/__pycache__/ocrusrex.cpython-37.pyc:1"],"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-15T03:27:03.644Z","repository_id":57447738,"created_at":"2025-08-15T03:27:03.644Z","updated_at":"2025-08-15T03:27:03.644Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271605911,"owners_count":24788968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T14:35:52.828Z","updated_at":"2026-03-07T15:34:40.295Z","avatar_url":"https://github.com/JSv4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### What does it do?\n\nOCRUSREX is a simple, no-frills Python 3 library to take a PDF (either by path or as a file-like object), convert\nit to images, run it through Tesseract 4 and return it as a PDF.\n\n### Why Another OCR Library?\n\nThere are several excellent packages out there to OCR PDFs in Python, but their licensing can be problematic\nfor potential use in enterprise applications. Also, many of them had way more features than I needed or wanted.\nI just needed something to take a PDF, OCR it and give it back to me, without the overhead and potential licensing\nheadaches of other options out there.\n\n### Install\n\n`pip install ocrusrex`\n\nIn addition to installing the required python dependencies, you'll also need tesseract v4. You'll also want to install\nthe appropriate training data. OCRUSREX is configured to use the \"best\" english LSTM training dataset. You can change this\nconfig to use fast or standard quality by overriding the tesseract_config argument. Using best improves accuracy by about \n1 - 3 percentage points based on my benchmarking with a performance hit of ~ 20% more time per page. On a VirtualBox dev\nenvironment, that meant going from 2 seconds to 2.38 seconds per page when using byte obj input and out (string paths are\nmuch slower). I was willing to take that tradeoff as I hate having re-OCR things, but your mileage may vary.\n\nIf you want to use tesseract's \"best\" training sets, you will likely need to install it for your \nsystem. See a description of the different Tesseract models in the Tesseract Docs:\n\n* https://tesseract-ocr.github.io/tessdoc/Data-Files.html#updated-data-files-for-version-400-september-15-2017\n\n### Dependencies\n\n###### PRODUCTION\n        \nOCRUSREX relies on four core libraries, all with MIT-like\nlicenses:\n\n1) pytesseract (MIT)\n2) PyPDF2 (MIT-like)\n3) Pillow (MIT-like)\n4) uqfoundation/multiprocess (MIT-like)\n\nObviously you will need to have already installed Tesseract 4. Please refer to the Tesseract documentation (or pytesseract docs).\n\n###### TESTS\n\nThe tests also rely on:\n\n1) python-levenshtein\n2) tika-python.\n\nNeither are required for the core library to work.\n\n### Usage\n\n###### Single Threaded\n\n    from OCRUSREX import ocrusrex\n    ocrusrex.OCRPDF(source=\"\", targetPath=None, page=None, nice=5, verbose=False, tesseract_config='--oem 1 -l eng -c preserve_interword_spaces=1 textonly_pdf=1')\n\n_RETURNS_: fasly if the task fails or truthy if it succeeds. If you specify a targetPath, returns True to indicate success\nor false to indicate failure. If you don't specify a targetPath, returns a bytes obj on success or None on failure. \n\n###### Multi Threaded\n\nOCRUSREX supports multithreaded execution via the core Python 3 multithreading library. Based on intial testing, this \nexecution mode enables you to divide OCR time per page vs singlethreaded by the number of threads you specify.\nFor example, assuming you have the requisite # of cores, running threads=4 will divide ocr time per page by 4, effectively\nreducing OCR time by 75%. Multithreading performance is almost identical to running seperate processes. This is likely due\nto OCRUSREX always calling separate instances of tesseract, so it's effectively callings separate processes whether you choose\nmultithreaded or multiprocessed execution.\n\n    from OCRUSREX import ocrusrex\n    ocrusrex.Multithreaded_OCRPDF(source=\"\", targetPath=None, processes=4, nice=5, verbose=False, tesseract_config='--oem 1 -l eng -c preserve_interword_spaces=1 textonly_pdf=1')\n\n_RETURNS_: fasly if the task fails or truthy if it succeeds. If you specify a targetPath, returns True to indicate success\nor false to indicate failure. If you don't specify a targetPath, returns a bytes obj on success or None on failure. \n\n###### Multi Processed\n\nOCRUSREX supports multiprocessed execution via the uqfoundation/multiprocess library. Based on intial testing, this \nexecution mode enables you to divide OCR time per page vs singlethreaded by the number of processes you specify.\nFor example, assuming you have the requisite # of cores, running processes=4 will divide ocr time per page by 4, effectively\nreducing OCR time by 75%. This method may be removed in the future as it appears to offer no real performance advantages\nover using multithreading, yet it comes at the expense of an additional dependency.\n\n    from OCRUSREX import ocrusrex\n    ocrusrex.Multiprocessed_OCRPDF(source=\"\", targetPath=None, threads=4, nice=5, verbose=False, tesseract_config='--oem 1 -l eng -c preserve_interword_spaces=1 textonly_pdf=1')\n\n_RETURNS_: fasly if the task fails or truthy if it succeeds. If you specify a targetPath, returns True to indicate success\nor false to indicate failure. If you don't specify a targetPath, returns a bytes obj on success or None on failure. \n\n##### OPTIONS:\n\n* **processes** = 4 [**_Multiprocessed Only_**]\n  * How many processes should be spawned at one time? Default is 4. Initial guidance is this should not be \u003e your # of effective cores. \n\n* **threads** = 4 [**_Multithreaded Only_**]\n  * How many threads should be used at once. Default is 4. Because the threads spawn new tesseract instances, the multithreaded performance is quite similiar to multiprocessed.\n\n* **source** = \"\"\n  * What is the target PDF to be OCRed? This can be a String containing a valid path to a pdf or a file-like\nobject.\n\n* **targetPath** = None\n  * Where should the OCRed PDF be saved? If you don't provide a value, the pdf will be returned as a byte\n    object.\n  \n* **page** = None [**_Singlethreaded Only_**]\n  * If you provide an integer value, only a single page will be OCRed and returned. If you leave this as None\n            the entire PDF will be OCRed. **PDF page starting index # is 1 NOT 0**.\n\n* **nice** = 5\n   * Sets the priority of the thread in Unix-like operating systems. 5 is slightly elevated.\n\n* **verbose** = False\n   * If this is set to True, show messages in the console.\n\n* **tesseract_config** = '--oem 1 --psm 6 -l best/eng -c preserve_interword_spaces=1'\n   * This will be passed to pytesseract as the config option. These are the command line arguments\n            you would pass directly to tesseract were you calling it directly. The defaults here are\n            optimized for accuracy but require you download an additional tesseract data file (the \"best\" one).\n            This comes at the expense of speed.\n\n* **tesseract_config** = 'print'\n   * Pass a method in for this argument to have verbose OCR message passed to this function as the first argument. \n   If you leave it as None, standard print() function is used.\n\n* **logger** = None\n   * If you want to pass a specific logger in to expose inner verbose messages (particularly for multithreaded and multiprocessed versions),\n   Pass the logger object in here. It must support .info and .error calls. If you leave this as None, verbose uses standard Python print() function.\n\n### FUTURE\n\nI am pretty happy with the baseline performance at this point. The tool can OCR PDFs at a rate of 1 page every 2 seconds.\n\nThe most important next feature is reducing the output file size. Output PDFs are currently substantially larger than the\nsource files. I need to do some experimentation to see if this can be done AFTER feeding the high-quality images to Tesseract.\n\nAfter that, it would be good to do some more error checking and other automated cleanup. Other libraries out there have\nsubstantial codebases dedicated just to these kinds of cleanup tasks. For now, I expect I'll add these types of features\norganically over time as I discover particularly hairy situations that need to be addressed or where I need some type of output\nfor myself. Contributions are welcome!","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Focrusrex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjsv4%2Focrusrex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Focrusrex/lists"}