{"id":27133128,"url":"https://github.com/erikkastelec/pdfscraper","last_synced_at":"2025-06-26T09:03:28.432Z","repository":{"id":37670920,"uuid":"254577667","full_name":"erikkastelec/PDFScraper","owner":"erikkastelec","description":"CLI program for searching inside text and tables in PDF documents and displaying results in HTML. ","archived":false,"fork":false,"pushed_at":"2024-02-07T18:49:55.000Z","size":126,"stargazers_count":5,"open_issues_count":4,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-02T06:51:27.546Z","etag":null,"topics":["camelot","ocr","ocr-analysis","pdf-documents","pdfminer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erikkastelec.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-04-10T07:59:11.000Z","updated_at":"2023-07-07T12:21:39.000Z","dependencies_parsed_at":"2024-02-07T19:59:52.988Z","dependency_job_id":null,"html_url":"https://github.com/erikkastelec/PDFScraper","commit_stats":{"total_commits":41,"total_committers":3,"mean_commits":"13.666666666666666","dds":"0.12195121951219512","last_synced_commit":"0c3e9131b5d3db2dc021f50528fb9f00caaf052b"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikkastelec%2FPDFScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikkastelec%2FPDFScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikkastelec%2FPDFScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikkastelec%2FPDFScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erikkastelec","download_url":"https://codeload.github.com/erikkastelec/PDFScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247744135,"owners_count":20988778,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["camelot","ocr","ocr-analysis","pdf-documents","pdfminer"],"created_at":"2025-04-07T22:38:39.125Z","updated_at":"2025-04-07T22:38:39.959Z","avatar_url":"https://github.com/erikkastelec.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDFScraper\n[![PyPI version](https://badge.fury.io/py/PDFScraper.svg)](https://badge.fury.io/py/PDFScraper)\n\nCLI program and library for extraction of PDF elements, which implements a search functionality that outputs summary in an HTML format. It combines [Pdfminer.six](https://github.com/pdfminer/pdfminer.six), [Camelot](https://github.com/camelot-dev/camelot) and [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) in a single program, which is simple to use.\n\n# How to use\n### Install using pip\n\nUse pip to install PDFScraper:\n\n\u003cpre\u003e\n$ pip install PDFScraper\n\u003c/pre\u003e\n\n### Arguments\n\u003cpre\u003e\noptional arguments:\n  -h, --help            show this help message and exit\n  --path PATH           path to pdf folder or file\n  --out OUT             path to output file location\n  --log_level {critical,error,warning,info,debug}\n                        logger level to use (default: info)\n  --search SEARCH       word to search for\n  --tessdata TESSDATA   location of tesseract data files\n  --tables TABLES       should tables be extracted and searched\n  --search_mode SEARCH_MODE\n                        And or Or search, when multiple search words are\n                        provided\n  --multiprocessing MULTIPROCESSING\n                        should multiprocessing be enabled\n\u003c/pre\u003e\n\n\n\n`path`, by default \".\", specifies the location of the PDF folder or directory.\n\n`out`, by default \".\", specifies output directory in which `summary.html` file is created.\n\n`search` argument is used for specifying the word or sentence that will be searched for in the PDF documents.\n\n`tessdata` argument can be used to specify custom tessdata location for OCR analysis.\n\n`tables`, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly.\n\n`search_mode`, by default in 'and' mode, specifies whether all the search terms need to be contained inside paragraph. In 'or' mode, the paragraph is returned if any of the terms are contained. In 'and' mode, the paragraph is returned if all the terms are contained.\n\n`multiprocessing`, by default True, runs process in multiple threads to speed up processing. **Should not be used with OCR as it significantly decreases performance**\n### OCR\n\n**tessdata pretrained language [files](https://github.com/tesseract-ocr/tessdata_best) need to be manually added to the tessdata directory.**\n\n\nOCR analysis of PDF documents currently supports English and Slovenian language. \nLanguage of the document is automatically detected using [langdetect library](https://github.com/Mimino666/langdetect).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikkastelec%2Fpdfscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferikkastelec%2Fpdfscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikkastelec%2Fpdfscraper/lists"}