Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eli64s/pypdf
Common Python PDF parsing utilities ๐
https://github.com/eli64s/pypdf
pdf pdf-document pdf-generation pdf-python pdfplumber pdfreader pypdf2 python python-pdf python-pdfkit
Last synced: 42 minutes ago
JSON representation
Common Python PDF parsing utilities ๐
- Host: GitHub
- URL: https://github.com/eli64s/pypdf
- Owner: eli64s
- License: apache-2.0
- Created: 2020-12-16T09:49:47.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-06-29T16:19:30.000Z (over 1 year ago)
- Last Synced: 2024-11-12T09:07:10.039Z (about 1 month ago)
- Topics: pdf, pdf-document, pdf-generation, pdf-python, pdfplumber, pdfreader, pypdf2, python, python-pdf, python-pdfkit
- Language: Python
- Homepage:
- Size: 308 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
pypdfโฆ Empower your PDFs with pypdf!
โฆ Developed with the software listed below:
---
## ๐ Table of Contents
- [๐ Table of Contents](#-table-of-contents)
- [๐ Overview](#-overview)
- [โ๏ธ Features](#๏ธ-features)
- [๐ Project Structure](#-project-structure)
- [๐งฉ Modules](#-modules)
- [๐ Getting Started](#-getting-started)
- [โ๏ธ Prerequisites](#๏ธ-prerequisites)
- [๐ป Installation](#-installation)
- [๐ฎ Using pypdf](#-using-pypdf)
- [๐งช Running Tests](#-running-tests)
- [๐บ Roadmap](#-roadmap)
- [๐ค Contributing](#-contributing)
- [๐ License](#-license)
- [๐ Acknowledgments](#-acknowledgments)---
## ๐ Overview
The pypdf project provides a set of Python scripts for manipulating PDF documents. It includes functionalities such as extracting data using regular expressions, searching and replacing specific values, generating test PDFs with random dates and invoices, and applying formatting and linting to the codebase. This project aims to simplify PDF processing tasks by providing easy-to-use scripts that automate various PDF-related operations. Its value proposition lies in its ability to save time and effort by streamlining PDF manipulation workflows.
---
## โ๏ธ Features
Feature | Description |
|-----|-----|
| **๐ Architecture** | The codebase follows a modular architecture with separate files for different functionalities, such as PDF parsing, searching, and creating. It also uses a configuration file to define the application's settings, enhancing flexibility and maintainability. |
| **๐ Documentation** | The codebase lacks comprehensive documentation. While some functions and classes have inline comments, there is no overall documentation explaining the codebase's purpose, usage, or high-level architecture. Improved documentation would enhance understandability and ease of maintenance. |
| **๐งฉ Dependencies** | The codebase relies on several external libraries, such as pdfplumber, fitz, ReportLab, and PyPDF. These libraries provide powerful PDF processing features and save development effort. However, the codebase does not include a detailed explanation of their usage or the reasons behind their selection. |
| **โป๏ธ Modularity** | The codebase demonstrates good modularity by separating functionality into different files. Each file handles a specific aspect of PDF processing, such as parsing, searching, or creating. However, there could be room for further modularization, such as extracting common utility functions into a shared module. |
| **โ๏ธ Testing** | The codebase lacks comprehensive unit tests. While it includes some test files, their coverage is limited. Further testing, including unit tests for individual functions and integration tests for complete scenarios, would help ensure code correctness and maintainability. |
| **โก๏ธ Performance** | It is difficult to assess performance without specific requirements or benchmarks. However, the codebase makes use of efficient libraries for PDF processing, such as pdfplumber and fitz, which are known for their performance. The codebase would benefit from performance profiling and optimization if performance issues arise. |
| **๐ Security** | There are no specific security measures mentioned in the codebase. It is important to handle user input, particularly regular expressions and file paths, with caution to mitigate potential security vulnerabilities like path traversal or code injection attacks. |
| **๐ Version Control** | The codebase is hosted on GitHub, utilizing the Git version control system. This enables collaboration among developers, code version management, and the ability to roll back changes if necessary. The repository contains multiple commits, indicating ongoing development and iterative improvements. |
| **๐ Integrations** | There are no explicit integrations mentioned in the codebase. However, the codebase could be integrated with other systems or APIs to enhance functionality, such as fetching PDFs from external sources or integrating with document management systems. |
| **๐ Scalability** | The codebase does not exhibit explicit scalability features, such as distributed processing or load balancing. However, its modular architecture allows for adding new functionality or extending existing features without significant code changes. It could benefit from scalability considerations if the application's requirements demand it in the future. |---
## ๐ Project Structure
```bash
repo
โโโ Makefile
โโโ README.md
โโโ conf
โย ย โโโ conf.toml
โโโ docs
โย ย โโโ example.pdf
โย ย โโโ pdf_input.pdf
โย ย โโโ pdf_updated.pdf
โย ย โโโ test_invoice.pdf
โโโ requirements.txt
โโโ scripts
โย ย โโโ clean.sh
โโโ src
โโโ conf.py
โโโ create_pdf_test_dates.py
โโโ create_pdf_test_invoice.py
โโโ pdf_parse_by_regex.py
โโโ pdf_search_and_replace.py5 directories, 14 files
```---
## ๐งฉ Modules
Root
| File | Summary | Module |
|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|
| Makefile | The code snippet provides a Makefile with several functionalities.-The `help` target displays a list of commands and their descriptions.-The `style` target applies formatting and linting to the code using tools like autoflake, autopep8, black, flake8, isort, and yapf.-The `clean` target calls the `style` target and then executes a clean.sh script to remove unnecessary files.-The `conda` target creates a conda environment named `pypdf` with Python 3.9 and installs the dependencies specified in requirements.txt.-The `venv` target creates a virtual environment named `pypdf`, activates it, and installs the dependencies specified in requirements.txt. | Makefile |Scripts
| File | Summary | Module |
|:---------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
| clean.sh | This code snippet is a bash script that performs various clean-up tasks. It removes backup files, Python cache files, cache directories, VS Code settings, build artifacts, pytest cache, benchmarks, and specific files. This script helps maintain a clean working environment by removing unnecessary files and folders. | scripts/clean.sh |Src
| File | Summary | Module |
|:---------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|
| pdf_parse_by_regex.py | The provided code snippet extracts data from a PDF file using regular expressions. It takes in a PDF file, name pattern, and amount pattern as input, and returns a dictionary mapping names to their corresponding amounts. It uses the pdfplumber library to open the PDF file, and then applies the given patterns to extract the relevant data. Finally, it prints the parsed data in a formatted manner. | src/pdf_parse_by_regex.py |
| conf.py | This code snippet defines a configuration file for an application. It uses the `dataclasses` module to define three data classes: `PathsConfig` for paths configuration, `RegexConfig` for regex configuration, and `AppConfig` for overall application configuration. The `read_config_file` function reads the configuration file in TOML format and returns a populated `AppConfig` object. | src/conf.py |
| pdf_search_and_replace.py | The provided code is a Python script that searches for a specific value in a PDF document, identified by a regular expression pattern, and replaces it with a new value. It utilizes the `fitz` library to open and manipulate PDF files, specifically applying redactions to remove the old value and inserting the new value at a specific location on the PDF page. The script reads the configuration from a TOML file and performs the replacement on the specified input PDF, saving the modified PDF to the output path. | src/pdf_search_and_replace.py |
| create_pdf_test_dates.py | This code snippet generates a PDF document with random dates displayed on each page. It uses the ReportLab library to create the PDF and the datetime module to generate random dates. The add_random_dates_to_page() function is called twice to add dates to the first and second pages of the PDF. The resulting PDF is saved as "docs/example.pdf". | src/create_pdf_test_dates.py |
| create_pdf_test_invoice.py | The provided code snippet creates a test PDF document with a random invoice. It uses the PyPDF class, which is a subclass of the FPDF library's FPDF class. The PyPDF class includes methods for setting up the header and footer of the PDF document, generating the invoice content, and saving the PDF to the specified output path. The generated invoice includes random names and amounts, which are added to a table in the PDF document. | src/create_pdf_test_invoice.py |---
## ๐ Getting Started
### โ๏ธ Prerequisites
Before you begin, ensure that you have the following prerequisites installed:
- [Python 3.7+](https://www.python.org/downloads/)
- [pdfplumber](https://github.com/jsvine/pdfplumber)### ๐ป Installation
1. Clone the pypdf repository:
```sh
git clone https://github.com/eli64s/pypdf
```2. Change to the project directory:
```sh
cd pypdf
```3. Install the dependencies:
```sh
pip install -r requirements.txt
```### ๐ฎ Using pypdf
```sh
python3 src/pdf_parse_by_regex.py
```### ๐งช Running Tests
```sh
pytest
```---
## ๐บ Roadmap
- [ ] Implement more PDF parsing functionalities.
- [ ] Add unit tests for each module.---
## ๐ค Contributing
[Contributing Guidelines](./CONTRIBUTING.md)
---
## ๐ License
[MIT](./LICENSE)
---
## ๐ Acknowledgments
- [pdfplumber](https://github.com/jsvine/pdfplumber)
---