{"id":19721028,"url":"https://github.com/paramsiddharth/pdf2text","last_synced_at":"2025-09-18T17:05:29.993Z","repository":{"id":56220043,"uuid":"308841424","full_name":"paramsiddharth/pdf2text","owner":"paramsiddharth","description":"An application that can extract editable as well as scanned text from PDF files.","archived":false,"fork":false,"pushed_at":"2024-10-05T22:47:48.000Z","size":13,"stargazers_count":1,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-22T06:41:14.035Z","etag":null,"topics":["hacktoberfest","ocr","pdf","text-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/paramsiddharth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-10-31T08:53:14.000Z","updated_at":"2025-08-08T03:12:55.000Z","dependencies_parsed_at":"2025-04-19T20:24:15.971Z","dependency_job_id":"aff9cb3e-c8f0-466d-b90f-91f7fc35526f","html_url":"https://github.com/paramsiddharth/pdf2text","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/paramsiddharth/pdf2text","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paramsiddharth%2Fpdf2text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paramsiddharth%2Fpdf2text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paramsiddharth%2Fpdf2text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paramsiddharth%2Fpdf2text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/paramsiddharth","download_url":"https://codeload.github.com/paramsiddharth/pdf2text/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paramsiddharth%2Fpdf2text/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275690337,"owners_count":25510497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hacktoberfest","ocr","pdf","text-extraction"],"created_at":"2024-11-11T23:13:10.638Z","updated_at":"2025-09-18T17:05:29.952Z","avatar_url":"https://github.com/paramsiddharth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Ultimate PDF to Text Converter\r\nAn application that can extract editable as well as scanned text from PDF files.\r\n\r\n## Execution\r\nExecute `main.py` using Python 3.8+.\r\n```\r\npython main.py\r\n```\r\n\r\n## Prerequisites and Dependencies\r\n- PyPI requirements can be installed via `requirements.txt`.\r\n  ```\r\n  pip3 install -r requirements.txt\r\n  ```\r\n  `sudo` might be required for allowing the modification of the PATH environment variable.\r\n- [Tesseract](https://github.com/tesseract-ocr/tesseract) must be installed and added to PATH for text recognition. It can be downloaded and installed [here](https://github.com/tesseract-ocr/tesseract/releases).\r\n\r\n  If it isn't in the PATH\r\n  environment variable, add it, or set/export TESSERACT_CMD\r\n  as the location of the Tesseract executable binary.\r\n  E. g.\r\n  - Bash\r\n    ``` bash\r\n    export TESSERACT_CMD='/path/to/tesseract'\r\n    ```\r\n  - PowerShell\r\n    ``` powershell\r\n    $Env:Tesseract_Cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'\r\n    ```\r\n  - Windows Command Prompt\r\n    ``` cmd\r\n    set \"TESSERACT_CMD=C:\\Program Files\\Tesseract-OCR\\tesseract.exe\"\r\n    ```\r\n\r\n- [Poppler](https://poppler.freedesktop.org/) must be installed and added to PATH for text recognition to work. Below are instructions for various operating systems.\r\n  - Windows users can download and install it [here](https://blog.alivate.com.au/poppler-windows/).\r\n  - MacOS users can install it using Homebrew.\r\n    ```\r\n    brew install poppler\r\n\t```\r\n  - Most Linux distributions already ship with Poppler. If absent, it can manually be installed.\r\n    ```\r\n    sudo apt install poppler-utils\r\n\t```\r\n\r\n## Behind the idea\r\n- [Tesseract PDF OCR](https://github.com/Manvendra2000/Reader) : A simple application that reads a PDF file and parses the text using the Tesseract OCR. _– [Manvendra](https://github.com/Manvendra2000), [Param](http://www.paramsid.com/)._\r\n- [PDF File Reader](https://github.com/HarshMarolia/Pdf-File-Reader) : A simple Python application that reads out textual PDF files. _– [Harsh](https://github.com/HarshMarolia), [Param](http://www.paramsid.com/)._\r\n\r\n### Made with ❤ by [Manvendra](https://github.com/Manvendra2000), [Harsh](https://github.com/HarshMarolia), [Param](http://www.paramsid.com/), and [Medha](https://github.com/tmedha).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparamsiddharth%2Fpdf2text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparamsiddharth%2Fpdf2text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparamsiddharth%2Fpdf2text/lists"}