{"id":47498207,"url":"https://github.com/raphaelramosds/pdf-to-txt","last_synced_at":"2026-03-27T04:02:37.303Z","repository":{"id":292406553,"uuid":"980813151","full_name":"raphaelramosds/pdf-to-txt","owner":"raphaelramosds","description":"PdfToTxt is a simple package for converting a PDF file into TXT","archived":false,"fork":false,"pushed_at":"2025-09-03T14:15:06.000Z","size":1140,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-03T22:54:05.549Z","etag":null,"topics":["imagickphp","tesseractocr"],"latest_commit_sha":null,"homepage":"https://packagist.org/packages/raphaelramosds/pdf-to-txt","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raphaelramosds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-09T19:00:02.000Z","updated_at":"2025-09-03T14:15:10.000Z","dependencies_parsed_at":"2025-09-03T16:22:18.935Z","dependency_job_id":null,"html_url":"https://github.com/raphaelramosds/pdf-to-txt","commit_stats":null,"previous_names":["raphaelramosds/pdf-to-txt"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/raphaelramosds/pdf-to-txt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelramosds%2Fpdf-to-txt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelramosds%2Fpdf-to-txt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelramosds%2Fpdf-to-txt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelramosds%2Fpdf-to-txt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raphaelramosds","download_url":"https://codeload.github.com/raphaelramosds/pdf-to-txt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelramosds%2Fpdf-to-txt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31018544,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-27T03:51:26.850Z","status":"ssl_error","status_checked_at":"2026-03-27T03:51:09.693Z","response_time":164,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imagickphp","tesseractocr"],"created_at":"2026-03-27T04:02:34.874Z","updated_at":"2026-03-27T04:02:37.292Z","avatar_url":"https://github.com/raphaelramosds.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PdfToTxt\n\nPdfToTxt is a simple package for converting a PDF file into TXT with PHP\n\n```bash\ncomposer require raphaelramosds/pdf-to-txt\n```\n\n## Dependencies\n\nUnfortunately, this package can only be used in a Linux environment. Additionally, you will need to install the following dependencies\n\n### Tesseract OCR package\n\n```bash\n# Install Tesseract OCR and its support to PT-BR language\nsudo apt install tesseract-ocr tesseract-ocr-por\n```\n\n### ImageMagick\n\n```bash\n# Install\nsudo apt install imagemagick php-imagick\n\n# Enable imagick extension\nsudo phpenmod imagick\n\n# (Optional) Check if it is enabled\nphp -m | grep imagick\n```\n\n## How does it work?\n\nIt uses ImageMagick to convert all PDF pages into JPG format, extracts their content using Tesseract OCR and compiles the results into a single TXT file.\n\n### Ghostscript support?\n\nWhile some PDF files use standard fonts that can be easily mapped to text, others rely on custom fonts which often store characters as vector graphics. In such cases, OCR becomes necessary to extract readable content. Therefore, in the future, I plan to add Ghostscript support to this package as an alternative method for handling these PDFs without relying solely on OCR.\n\nYou can use the following Ghostscript command to convert a PDF into a plain text file\n\n```bash\ngs -sDEVICE=txtwrite -o file.txt file.pdf\n```\n\nBefore using this approach, it's recommended to check which fonts are used in the PDF. You can do that with the following command\n\n```bash\ngs -DPDFINFO file.pdf\n```\n\n## Example\n\nConverts `file.pdf` into `file.txt` and save it on `path/to/txt directory`\n\n```php\n$ptt = new PdfToTxt('path/to/file.pdf', 'path/to/txt', 'file');\n$ptt-\u003econvert();\n```\n\n## Tests\n\nUnit tests were written with [PHPUnit](https://phpunit.de/)\n\n```bash\n./vendor/bin/phpunit tests\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelramosds%2Fpdf-to-txt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraphaelramosds%2Fpdf-to-txt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelramosds%2Fpdf-to-txt/lists"}