Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/whoiskatrin/financial-statement-pdf-extractor

Python script to extract as much structured information as possible from annual/quarterly reports.
https://github.com/whoiskatrin/financial-statement-pdf-extractor

balance-sheet cash-flow cash-flow-statement data-processing extract financial-analysis financial-statements pdf quarterly-reports

Last synced: 3 months ago
JSON representation

Python script to extract as much structured information as possible from annual/quarterly reports.

Host: GitHub
URL: https://github.com/whoiskatrin/financial-statement-pdf-extractor
Owner: whoiskatrin
Created: 2020-03-30T18:05:55.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-06-15T17:10:37.000Z (over 1 year ago)
Last Synced: 2023-06-15T18:37:31.597Z (over 1 year ago)
Topics: balance-sheet, cash-flow, cash-flow-statement, data-processing, extract, financial-analysis, financial-statements, pdf, quarterly-reports
Language: Python
Homepage:
Size: 16.6 KB
Stars: 50
Watchers: 5
Forks: 14
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # PDF Financial Statement Extractor 📚🔍

This Python script extracts tables containing specific keywords, such as "Revenue" and "Income," from a collection of PDF files in the specified input directory and saves the extracted tables as Excel files in the specified output directory.

## Features ✨

- Extract tables with specific keywords from PDF files

- Parallel processing for faster extraction

- Customizable regex pattern for keyword search

- Error handling and logging for better traceability

- Supports specifying input and output directories

## Installation 🛠️

### Dependencies

- Python 3.7 or higher

- [pdfgrep](https://pdfgrep.org/) (system package)

### Steps

1. Clone the repository or download the script:

```

git clone financial-statement-pdf-extractor.git

```

Install the Python dependencies using pip:

```

pip install -r requirements.txt 

```

Install the pdfgrep package using your system's package manager:

For Ubuntu:

```

sudo apt-get install pdfgrep

```

For macOS:

```

brew install pdfgrep

```

## Usage

Replace input_directory with the path to the directory containing the PDF files you want to process, and output_directory with the path to the directory where you want to save the extracted tables.

Optional Arguments

-p, --processes: Number of parallel processes (default: number of CPU cores)

-r, --regex: Custom regex pattern for searching specific keywords in PDF files (default: '^(?s:(?=.*Revenue)|(?=.*Income))')

For example, to use a custom regex pattern and specify the number of parallel processes, run the script as follows:

```

python script.py -i input_directory -o output_directory -r 'your_custom_pattern' -p 4

```

## License 📄

This project is licensed under the MIT License. See the LICENSE file for details.

## Contributing 🤝

Please feel free to open an issue or submit a pull request if you would like to contribute to the project or have any suggestions for improvements.