{"id":30336557,"url":"https://github.com/homebackend/pdf-title-page-splitter","last_synced_at":"2026-05-04T09:31:31.967Z","repository":{"id":310236961,"uuid":"1032607048","full_name":"homebackend/pdf-title-page-splitter","owner":"homebackend","description":"Splits a pdf based on identified title pages using ML trained model","archived":false,"fork":false,"pushed_at":"2025-08-16T16:54:58.000Z","size":1257,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-12T23:03:58.171Z","etag":null,"topics":["machine-learning","opencv","pdf-splitter","pdf2image","pypdf2","scikit-learn","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/homebackend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-05T14:50:49.000Z","updated_at":"2025-08-16T18:08:15.000Z","dependencies_parsed_at":"2025-08-16T18:31:15.935Z","dependency_job_id":"a8e62b3d-bd1f-4c97-8d42-d73c4e856ae1","html_url":"https://github.com/homebackend/pdf-title-page-splitter","commit_stats":null,"previous_names":["homebackend/pdf-title-page-splitter"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/homebackend/pdf-title-page-splitter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homebackend%2Fpdf-title-page-splitter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homebackend%2Fpdf-title-page-splitter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homebackend%2Fpdf-title-page-splitter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homebackend%2Fpdf-title-page-splitter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/homebackend","download_url":"https://codeload.github.com/homebackend/pdf-title-page-splitter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homebackend%2Fpdf-title-page-splitter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32601491,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T22:12:39.696Z","status":"online","status_checked_at":"2026-05-04T02:00:06.625Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","opencv","pdf-splitter","pdf2image","pypdf2","scikit-learn","tensorflow"],"created_at":"2025-08-18T05:22:17.147Z","updated_at":"2026-05-04T09:31:31.951Z","avatar_url":"https://github.com/homebackend.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Stargazers][stars-shield]][stars-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n[![LinkedIn][linkedin-shield]][linkedin-url]\n\n# pdf-title-page-splitter\n**pdf-title-page-splitter** is command line tool (with limited UI support) to splits a pdf based on identified title pages. The title pages are identified using machine learning model. The tool supports both training model and using trained model to split pdf files.\n\n## But first, what is the need for this tool anyway ?\nThere are a plethora of tools that allow splitting of pdf files, so why this tool?\n\nConsider the pdf like this: [Economic and Political Weekly - Volume 26](https://archive.org/details/dli.bengal.10689.12140/). It has 1200 pages and 210.4 MB of data. This makes these pdf files notoriously difficult to read and handle. Actually this pdf file is contains multiple issues of volume 26 of [Economic and Political Weekly](https://www.epw.in/). So ideally this pdf file should be split into multiple pdf files - each for one issue of the volume.\n\nBut this still didn't answer the original question. One can easily split pdf using any other tool. However, if you have to do this for [hundreds of such files](https://archive.org/search?query=economic+and+political+weekly) the task becomes daunting. Where this tool helps is basically to train a ML model that will identify title pages within the given pdf and split pdf into multiple pdfs, each for a single issue.\n\nIn the above specific example, the following is a title page:\n\n\u003cimg src=\"https://ia800301.us.archive.org/BookReader/BookReaderImages.php?zip=/13/items/dli.bengal.10689.12140/10689.12140_jp2.zip\u0026file=10689.12140_jp2/10689.12140_0005.jp2\u0026id=dli.bengal.10689.12140\u0026scale=2\u0026rotate=0\" alt=\"Title page\" width=\"200\"/\u003e\n\nUsing **pdf-title-page-splitter** we train model to identify such title pages and split pdf into multiple issues.\n\n# Installation\n## Requirements\n\n**pdf-title-page-splitter** requires *python 3* and a bunch of other dependencies mentioned in `requirements.txt` file.\n\n## Setup\nBefore running **pdf-title-page-splitter** the python environment needs to be setup correctly. Here we are creating a python virtual environment and installing all the dependencies. The instructions are provided for *Linux*, but ideally these should be identical for any *UNIX* like operating system.\n\n### Create virtual environment and install dependencies\n\nThe following Change to the folder/directory containing \n\n```bash\npython -m venv venv\n. venv/bin/activate\npip install -r requirements.txt\n```\n### Activating virtual environment\n\nCreating virtual environment and installing dependencies is one time process. In subsequent runs you just need to activate the virtual environment:\n```bash\n. venv/bin/activate\n```\n\nTo deactivate the virtual environment run the command: `deactivate`.\n\n# Usage\n\n## Creating model file\n\nModel is trained using **create** command. It supports the following command line options:\n\n```bash\n$ python pdf-title-page-splitter.py create -h\nusage: pdf-title-page-splitter.py create [-h] [-s SAVE_PATH] [-p PARALLELISM] [--pdf pdf [title-pages ...]] files [files ...]\n\npositional arguments:\n  files                 Pdf files to be used\n\noptions:\n  -h, --help            show this help message and exit\n  -s, --save-path SAVE_PATH\n                        Save path (default: model.pkl)\n  -p, --parallelism PARALLELISM\n                        Number of parallel pages to process (default: number of cores)\n  --pdf pdf [title-pages ...]\n                        Specify a file and comma separated title pages pair. Can be used multiple times.\n```\n\nWhat the above command will do, is create a model-file **model.pkl** trained using specified pdf files to predict title pages for an unseen pdf.\n\n### Example run\n\nConsider the case where you have identified title pages of some pdfs. You can create model like so:\n\n```bash\npython3 pdf-title-page-splitter.py \\\n    create \\\n    --save-path model.pkl \\\n    --pdf 'first-pdf.pdf' 5 69 143 201 239 312 \\\n    --pdf 'second-pdf.pdf' 2 45 100 189 234 301\n```\n\nHere page numbers 5, 69, 143, 201, 239 and 312 are title pages identified in *first-pdf.pdf*. Likewise for the other pdf.\n\n### Another run\n\nInstead if you have a bunch of pdf files that have title page as the first page of pdf (essentially which you have already split) you can use the following command to create a model file **model.pkl**.\n\n```bash\npython3 pdf-title-page-splitter.py \\\n    create \\\n    --save-path model.pkl \\\n    'first-pdf.pdf' \\\n    'second-pdf.pdf' \\\n    'third-pdf.pdf' \\\n    'fourth-pdf.pdf'\n```\n\n## Predicting title pages\n\nThe **predict** command can be used to predict *title pages* for a bunch of pdfs given a model file generated from **create** command. The title pages identified are saved in JSON format (by default filename is *titles.json*) for subsequent processing.\n\n```bash\n$ python pdf-title-page-splitter.py predict -h\nusage: pdf-title-page-splitter.py predict [-h] [-m MODEL_PATH] [-s SAVE_PATH] [-b BEGINNING_PAGE] [-e ENDING_PAGE]\n                                          [-p PARALLELISM]\n                                          pdf [pdf ...]\n\npositional arguments:\n  pdf                   Pdf files to be used\n\noptions:\n  -h, --help            show this help message and exit\n  -m, --model-path MODEL_PATH\n                        Model file path (default: model.pkl)\n  -s, --save-path SAVE_PATH\n                        Save path (default: titles.json)\n  -b, --beginning-page BEGINNING_PAGE\n                        Starting page of pdf file (default=1)\n  -e, --ending-page ENDING_PAGE\n                        Ending page of pdf file (default=last page)\n  -p, --parallelism PARALLELISM\n                        Number of parallel pages to process (default: number of cores)\n```\n\n### Example run\nThe following command identifies title pages in all pdf files in */tmp* directory and saves the result to *my-titles.json*.\n\n```bash\npython3 pdf-title-page-splitter.py \\\n    predict \\\n    --model-path model.pkl \\\n    --save-path my-titles.json \\\n    /tmp/*.pdf\n```\n\nA sample *titles.json* file generated could be:\n\n```json\n{\n    \"Economic.And.Political.Volume.xviii.No27.pdf\": [],\n    \"Economic And Political Weekly Vol.-xxii-no.49-inernetdli2015121078.pdf\": [\n        2,\n        58,\n        114,\n        170\n    ],\n    \"Economic And Political Weekly Vol-XXVIII -- Sachin Chaudhuri -- 1995 -- Economic And Political Weekly Vol-XXVIII -- aa94fb82ae80ea6297ec3a739a400d22 -- Anna\\u2019s Archive.pdf\": [\n        5,\n        77,\n        149,\n        261,\n        325,\n        389,\n        495,\n        555,\n        623\n    ]\n}\n```\n\nEssentially it contains page numbers of identified *title pages*. If no *title page* was identified it will be empty list (for example: *Economic.And.Political.Volume.xviii.No27.pdf* above).\n\n## Show title pages\n\nOnce the *title pages* have been identified by **pdf-title-page-splitter** the next step is to visually see if the identified *title pages* are corrent and potentially correct any mistakes.\n\nThe **show** commands presents to you each *title page* and you can either accept, reject or substitute the given *title page* for each pdf file in *titles.json*. The **show** command itself is split into two sub commands, viz. **run** and **from**.\n\nThe subcommand **run** is essentially to combine **predict** and **show** commands into a single step.\n\nThe subcommand **from** is used to read *titles.json* file from **predict** command as described in prior section, and present to the user UI as described above in this section.\n\nNote, though this is not recommended, you can skip this step and directly go on to splitting of pdfs.\n\nCommand line options for show command:\n\n```bash\n$ python pdf-title-page-splitter.py show -h\nusage: pdf-title-page-splitter.py show [-h] {run,from} ...\n\npositional arguments:\n  {run,from}  Available sub commands\n    run       Run predict and show pages\n    from      Load saved data from file and show pages\n\noptions:\n  -h, --help  show this help message and exit\n```\n\n### Show title pages from *titles.json*\n\nThe supported command line options are as follow:\n\n```bash\n$ python pdf-title-page-splitter.py show from -h\nusage: pdf-title-page-splitter.py show from [-h] [-l LOAD_FROM_FILE] [-s SAVE_PATH]\n\noptions:\n  -h, --help            show this help message and exit\n  -l, --load-from-file LOAD_FROM_FILE\n                        Load title pages and pdf from file (default: titles.json)\n  -s, --save-path SAVE_PATH\n                        Save path (default: splits.json)\n```\n\nHere **-l** option loads predicted *title page* data as generated using using **create** command.\n\n**-s** option specifies the file (by default *splits.json*) where the *title pages* left after user has done accepting, rejecting and/or substituting of *title pages* are stored. Essentially this file is used to split the pdf files. Note that *splits.json* has same format as *titles.json*.\n\n### UI and user interaction\n\nFor each *title page* user is presented with a window showing the *title page*. User can take the following actions:\n\n- **-\u003e** (right arrow key): moves to next *title page* (current page is retained)\n- **\u003c-** (left arrow key): moves to previous *title page* (current page is retained)\n- **x**: delete current *title page* (during split this will be treated as non *title page*)\n- **r**: replace current *title page* (this will enter user into page replacement mode)\n- **n**: moves to next pdf file (if there is no next file - you will be asked if you want to save changes)\n- **s**: save and quit (all changes are saved into file specified using **-s** command line option)\n- **q**: quit without saving (no changes are saved)\n\nIn replacement mode (using **r** key above) the following actions are supported:\n\n- **-\u003e** (right arrow key): moves to next page\n- **\u003c-** (left arrow key): moves to previous page\n- **s**: save current page as replacement page for the *title page*\n- **q**: quit page replacement mode (the same *title page* is retained - you will be dropped to same *title page* - which can be retained, rejected or substituted again, if desired)\n\n[Title Page](doc/title-page.png \"Title page\")\n[Wrong Title Page](doc/wrong-title-page.png \"Wrong Title Page\")\n[Replacement Page](doc/replacement-page.png \"Replacement page\")\n\n\n## Split pdf files\n\nAs a final step you can proceed to split pdf files. Till now no actual pdf files were written.\n\nThe supported command line options are as follows:\n\n```bash\n$ python pdf-title-page-splitter.py split -h\nusage: pdf-title-page-splitter.py split [-h] {run,from} ...\n\npositional arguments:\n  {run,from}  Available sub commands\n    run       Run predict and split pages\n    from      Load saved data from file and show pages\n\noptions:\n  -h, --help  show this help message and exit\n```\n\nThe **run** command executes **predict**, and then splits the files. There is no **show** command.\n\nThe **from** command splits the pdf files based on *splits.json* file.\n\n### **from** command\n\n**from** command supports the following options:\n\n```bash\n$ python pdf-title-page-splitter.py split from -h\nusage: pdf-title-page-splitter.py split from [-h] [-l LOAD_FROM_FILE] [--force] [--move-original-to MOVE_ORIGINAL_TO]\n                                             [--split-destination SPLIT_DESTINATION] [--noop] [--move-singles MOVE_SINGLES]\n\noptions:\n  -h, --help            show this help message and exit\n  -l, --load-from-file LOAD_FROM_FILE\n                        Load title pages and pdf from file (default: splits.json)\n  --force               Force overwriting of split files (default skips file if it exists)\n  --move-original-to MOVE_ORIGINAL_TO\n                        Post split move the file to specified diretory (default do not move)\n  --split-destination SPLIT_DESTINATION\n                        Destination directory for split files (default same as source file)\n  --noop                Make no actual changes (default make changes)\n  --move-singles MOVE_SINGLES\n                        Move files that contain only single title page and that too as first page into the specified\n                        directory (default is to not move)\n```\n\n### Example run\n\n```bash\npython3 pdf-title-page-splitter.py split \\\n    from \\\n    --move-original-to splitted \\\n    --split-destination splits \\\n    --move-singles splits \\\n    --load-from-file splits.json\n```\n\nAfter running the command the output would be like so:\n\n```bash\n$ ls -1 splits\n'Economic \u0026 Political Weekly  June 16-23 1900: Vol 25 24-25-economicpoliticalweekly_june16231900_25_2425.pdf'\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0000.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0001.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0002.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0003.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0004.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0005.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0006.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0007.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0008.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0009.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0010.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0011.pdf\nEconomic.And.Political.Weekly.Vol-XXV.No-27_split_0012.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0001.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0002.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0003.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0004.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0005.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0006.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0007.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0008.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0009.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0010.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0011.pdf\nEconomic.And.Political.Weekly.Vol-XXV_split_0012.pdf\n```\n\nNote in the above output files named **_0000.pdf** are files created from page number 1 to first *title page*. If for any file first *title page* is page number 1 there is no **_0000.pdf** file.\nAlso, if a file has no splits defined it will be kept as is.\n\n\u003c!-- MARKDOWN LINKS \u0026 IMAGES --\u003e\n\u003c!-- https://www.markdownguide.org/basic-syntax/#reference-style-links --\u003e\n[contributors-shield]: https://img.shields.io/github/contributors/homebackend/pdf-title-page-splitter.svg?style=for-the-badge\n[contributors-url]: https://github.com/homebackend/pdf-title-page-splitter/graphs/contributors\n[forks-shield]: https://img.shields.io/github/forks/homebackend/pdf-title-page-splitter.svg?style=for-the-badge\n[forks-url]: https://github.com/homebackend/pdf-title-page-splitter/network/members\n[stars-shield]: https://img.shields.io/github/stars/homebackend/pdf-title-page-splitter.svg?style=for-the-badge\n[stars-url]: https://github.com/homebackend/pdf-title-page-splitter/stargazers\n[issues-shield]: https://img.shields.io/github/issues/homebackend/pdf-title-page-splitter.svg?style=for-the-badge\n[issues-url]: https://github.com/homebackend/pdf-title-page-splitter/issues\n[license-shield]: https://img.shields.io/github/license/homebackend/pdf-title-page-splitter.svg?style=for-the-badge\n[license-url]: https://github.com/homebackend/pdf-title-page-splitter/blob/master/LICENSE\n[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge\u0026logo=linkedin\u0026colorB=555\n[linkedin-url]: https://linkedin.com/in/neeraj-jakhar-39686212b\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhomebackend%2Fpdf-title-page-splitter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhomebackend%2Fpdf-title-page-splitter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhomebackend%2Fpdf-title-page-splitter/lists"}