{"id":18268721,"url":"https://github.com/whoiskatrin/financial-statement-pdf-extractor","last_synced_at":"2025-04-04T23:31:13.966Z","repository":{"id":40966365,"uuid":"251388095","full_name":"whoiskatrin/financial-statement-pdf-extractor","owner":"whoiskatrin","description":"Python script to extract as much structured information as possible from annual/quarterly reports.","archived":false,"fork":false,"pushed_at":"2024-01-15T11:54:33.000Z","size":18,"stargazers_count":98,"open_issues_count":0,"forks_count":24,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-20T20:17:43.040Z","etag":null,"topics":["balance-sheet","cash-flow","cash-flow-statement","data-processing","extract","financial-analysis","financial-statements","pdf","quarterly-reports"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/whoiskatrin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-30T18:05:55.000Z","updated_at":"2025-02-02T20:58:57.000Z","dependencies_parsed_at":"2024-11-05T11:39:40.689Z","dependency_job_id":"cbe61fc7-1fdf-44b2-92b5-6f1cff8294c3","html_url":"https://github.com/whoiskatrin/financial-statement-pdf-extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whoiskatrin%2Ffinancial-statement-pdf-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whoiskatrin%2Ffinancial-statement-pdf-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whoiskatrin%2Ffinancial-statement-pdf-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whoiskatrin%2Ffinancial-statement-pdf-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/whoiskatrin","download_url":"https://codeload.github.com/whoiskatrin/financial-statement-pdf-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266475,"owners_count":20910831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["balance-sheet","cash-flow","cash-flow-statement","data-processing","extract","financial-analysis","financial-statements","pdf","quarterly-reports"],"created_at":"2024-11-05T11:33:03.121Z","updated_at":"2025-04-04T23:31:11.434Z","avatar_url":"https://github.com/whoiskatrin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Financial Statement Extractor 📚🔍\n\nThis Python script extracts tables containing specific keywords, such as \"Revenue\" and \"Income,\" from a collection of PDF files in the specified input directory and saves the extracted tables as Excel files in the specified output directory.\n\n## Features ✨\n\n- Extract tables with specific keywords from PDF files\n- Parallel processing for faster extraction\n- Customizable regex pattern for keyword search\n- Error handling and logging for better traceability\n- Supports specifying input and output directories\n\n## Installation 🛠️\n\n### Dependencies\n\n- Python 3.7 or higher\n- [pdfgrep](https://pdfgrep.org/) (system package)\n\n### Steps\n\n1. Clone the repository or download the script:\n\n```\ngit clone financial-statement-pdf-extractor.git\n```\n\nInstall the Python dependencies using pip:\n```\npip install -r requirements.txt \n```\n\nInstall the pdfgrep package using your system's package manager:\nFor Ubuntu:\n\n```\nsudo apt-get install pdfgrep\n```\n\nFor macOS:\n```\nbrew install pdfgrep\n```\n## Usage\n\nReplace input_directory with the path to the directory containing the PDF files you want to process, and output_directory with the path to the directory where you want to save the extracted tables.\n\nOptional Arguments\n-p, --processes: Number of parallel processes (default: number of CPU cores)\n-r, --regex: Custom regex pattern for searching specific keywords in PDF files (default: '^(?s:(?=.*Revenue)|(?=.*Income))')\nFor example, to use a custom regex pattern and specify the number of parallel processes, run the script as follows:\n\n```\npython script.py -i input_directory -o output_directory -r 'your_custom_pattern' -p 4\n```\n\n\n## License 📄\nThis project is licensed under the MIT License. See the LICENSE file for details.\n\n## Contributing 🤝\nPlease feel free to open an issue or submit a pull request if you would like to contribute to the project or have any suggestions for improvements.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhoiskatrin%2Ffinancial-statement-pdf-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwhoiskatrin%2Ffinancial-statement-pdf-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhoiskatrin%2Ffinancial-statement-pdf-extractor/lists"}