{"id":14982519,"url":"https://github.com/drmccoy/pdftextorizer","last_synced_at":"2026-01-07T01:40:39.571Z","repository":{"id":238591329,"uuid":"796904655","full_name":"DrMcCoy/pdftextorizer","owner":"DrMcCoy","description":"Interactively extract text from multi-column PDFs","archived":false,"fork":false,"pushed_at":"2024-07-28T17:42:08.000Z","size":182,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-27T23:45:01.166Z","etag":null,"topics":["gui","pdf","pdf-extractor","pdf-files","pdf2text","pdftotext","pyqt5","qt5"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DrMcCoy.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog","contributing":null,"funding":null,"license":"COPYING","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-06T20:54:22.000Z","updated_at":"2024-07-28T17:44:02.000Z","dependencies_parsed_at":"2024-05-08T23:32:23.792Z","dependency_job_id":"991e980b-c66f-4ca1-a01f-1d60525a4b98","html_url":"https://github.com/DrMcCoy/pdftextorizer","commit_stats":{"total_commits":50,"total_committers":1,"mean_commits":50.0,"dds":0.0,"last_synced_commit":"f9cbc17d04c75723d1ccfcb4f78dfc77e5beffa2"},"previous_names":["drmccoy/pdftextorizer"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DrMcCoy%2Fpdftextorizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DrMcCoy%2Fpdftextorizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DrMcCoy%2Fpdftextorizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DrMcCoy%2Fpdftextorizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DrMcCoy","download_url":"https://codeload.github.com/DrMcCoy/pdftextorizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245944062,"owners_count":20697948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gui","pdf","pdf-extractor","pdf-files","pdf2text","pdftotext","pyqt5","qt5"],"created_at":"2024-09-24T14:05:34.223Z","updated_at":"2026-01-07T01:40:39.544Z","avatar_url":"https://github.com/DrMcCoy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"PDF Textorizer README\n=====================\n\n[TOC]\n\nPDF Textorizer is a GUI application to interactively extract text from\nmulti-column PDFs, licensed under the terms of the\n[GNU Affero General Public License version 3](https://www.gnu.org/licenses/agpl.html)\n(or later), written in Python.\n\n\nWhat does PDF Textorizer do?\n----------------------------\n\nPDF Textorizer loads PDF files and displays them, page by page. It\nautomatically detects columns of text in a PDF with a multi-column layout and\nmarks these regions on the currently displayed page.\n\nThe user can click on a region to select it. It can then be modified: the user\ncan move it across the page, change its size, move it up and down in the order\nof regions and even delete it. Additionally, the user can add new regions by\ndrawing them over the current page. And to save the current progress, the whole\nset of regions in PDF file can be saved into a JSON file and loaded back in\nlater.\n\nOnce the user is happy with the current setup, the text found in the regions\ncan be grabbed and exported, in order. As a quick preview, only the current\npage can be either saved into a file or printed to standard out, and for the\nfinal pass, the whole PDF can be converted.\n\nWhy?\n----\n\nPDFs with complex layouts, especially multiple columns, are notoriously\ndifficult to copy text from. Often, copying grabs text from neighbouring\ncolumns and similar issues.\n\nAll that and worse can be found in tabletop roleplaying PDFs, with their\nuber-complex layouts that include text flowing around images and styled\nstatblocks. If you want to, for example, copy flavour text for ease of\ntranslation or having it readily available during a session, those issue\nmake that pretty annoying.\n\n[PyMuPDF has example code to detect columns](https://artifex.com/blog/extract-text-from-a-multi-column-document-using-pymupdf-inpython),\nand while the results are promising, they're not perfect. You really want\nto fix the remaining issues before grabbing the text. And that's best\ndone interactively in a GUI. Hence, PDF Textorizer.\n\nKeyboard shortcuts\n------------------\n\nPDF Textorizer offers global keyboard shortcuts for all the operations\nin the main menu.\n\n| Shortcut        | Command                   | Explanation                                                              |\n| --------------- | --------------------------|--------------------------------------------------------------------------|\n| Ctrl+O          | Open PDF                  | Open a new PDF file                                                      |\n| Ctrl+W          | Close PDF                 | Close the currently opened PDF file                                      |\n| Ctrl+Shift+O    | Load Regions              | Load a previously saved regions file                                     |\n| Ctrl+Shift+S    | Save Regions As...        | Save the current regions into a new file                                 |\n| Ctrl+S          | Save Regions              | Save the current regions to the current regions file                     |\n| Ctrl+P          | Convert Page to Text      | Convert all regions of the current page to text and print it to stdout   |\n| Ctrl+T          | Save Page to Text         | Convert all regions of the current page to text and write it into a file |\n| Ctrl+Shift-T    | Save All Pages to Text    | Convert all regions of all pages to text and write it into a file        |\n| Ctrl+Q          | Quit                      | Quit PDF Textorizer                                                      |\n| Shift+F1        | About PDF Textorizer      | Show an about box                                                        |\n\nKeyboard commands\n-----------------\n\nIn addition to the global keyboard shortcuts advertised in the main menu,\nPDF Textorizer supports a set of keyboard shortcuts for modifying regions.\n\nNOTE: These shortcuts only work when the page view has the focus, by, for\n      example, clicking on it first!\n\n| Key Combination | Command                                      |\n| --------------- | ---------------------------------------------|\n| K               | Move to the first page in the PDF            |\n| L               | Move to the previous page in the PDF         |\n| J               | Move to the next page in the PDF             |\n| H               | Move to the last in the PDF                  |\n| Shift+K         | Select the first region on the page          |\n| Shift+L         | Select the previous region on the page       |\n| Shift+J         | Select the next region on the page           |\n| Shift+H         | Select the last on the region                |\n| Left            | Move the selected region left                |\n| Right           | Move the selected region right               |\n| Up              | Move the selected region up                  |\n| Down            | Move the selected region down                |\n| Shift+Left      | Grow the selected region horizontally        |\n| Shift+Right     | Shrink the selected region horizontally      |\n| Shift+Up        | Grow the selected region vertically          |\n| Shift+Down      | Shrink the selected region vertically        |\n| Page Up         | Move the selected region \"up\" in the stack   |\n| Page Down       | Move the selected region \"down\" in the stack |\n| Insert          | Add a new region                             |\n| Delete          | Remove the selected region                   |\n| R               | Redraw the selected region                   |\n\nInstallation\n------------\n\nTo install PDF Textorizer system-wide, use\n```\npip install .\n```\n\nTo install PDF Textorizer for the current user only, use\n```\npip install --user .\n```\n\nTo install PDF Textorizer in a virtualenv, use\n```\npip -m venv env\nsource env/bin/activate\npip install .\n```\n\nOptionally, the included [Makefile](Makefile) can be leveraged to install and\nrun PDF Textorizer from a virtualenv. Please read the [Makefile](Makefile) itself\nto understand what it can do.\n\nA typical example would be:\n\n```\nmake run\n```\n\nThis would install PDF Textorizer into a virtualenv and run it.\n\nA more elaborate example:\n\n```\nPYTHON=python3 make run arg=\"-h\"\n```\n\nThis would install PDF Textorizer into a virtualenv and run using \"python3\" as\nthe Python environment, with the command line parameter \"-h\" (thus showing\nthe help text).\n\n\nCommand line usage\n------------------\n\n```\nusage: pdftextorizer [-h] [-v] [-p PAGE] [pdf_file] [regions_file]\n\npdftextorizer -- Interactively extract text from multi-column PDFs\n\npositional arguments:\n  pdf_file              PDF file to open\n  regions_file          Regions file to load\n\noptions:\n  -h, --help            Show this help message and exit\n  -v, --version         Print the version and exit\n\nPDF arguments:\n  -p PAGE, --page PAGE  Open the PDF file directly on this page\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrmccoy%2Fpdftextorizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrmccoy%2Fpdftextorizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrmccoy%2Fpdftextorizer/lists"}