https://github.com/sanafagal/pdf-processor-for-eps-files
A tool designed to process and rename PDF files based on specific EPS configurations, utilizing exact and fuzzy matching techniques to identify file types efficiently.
https://github.com/sanafagal/pdf-processor-for-eps-files
eps file-rename fuzzy-matching ocrmypdf pdf-processing pypdf2 text-extraction
Last synced: about 2 months ago
JSON representation
A tool designed to process and rename PDF files based on specific EPS configurations, utilizing exact and fuzzy matching techniques to identify file types efficiently.
- Host: GitHub
- URL: https://github.com/sanafagal/pdf-processor-for-eps-files
- Owner: SanAfaGal
- Created: 2024-12-04T23:15:46.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-01-03T19:42:59.000Z (5 months ago)
- Last Synced: 2025-04-03T09:15:00.102Z (about 2 months ago)
- Topics: eps, file-rename, fuzzy-matching, ocrmypdf, pdf-processing, pypdf2, text-extraction
- Language: Python
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDF Processor for EPS Files
A Python-based application designed to process and rename PDF files based on EPS-specific configurations. The tool leverages exact and fuzzy matching techniques to identify file types, apply OCR for text extraction, and supports splitting and combining PDFs.
## Features 🚀
- **Automatic PDF Renaming**: Renames files using EPS-specific naming conventions.
- **Text Extraction**: Extracts text from PDFs using PyPDF2 and OCR.
- **File Type Identification**: Uses exact and fuzzy matching for robust keyword detection.
- **PDF Splitting and Combining**: Handles multi-page PDFs by splitting and recombining files as needed.## Technologies 🛠️
- **Python 3.x**
- **PyPDF2**: For PDF text extraction and manipulation.
- **ocrmypdf**: Adds OCR to non-searchable PDFs.
- **rapidfuzz**: Implements fuzzy matching for text processing.
- **logging**: For structured application logging.## Prerequisites 🔧
### 1. Install Tesseract
Download and install Tesseract OCR from [UB-Mannheim's Tesseract repository](https://github.com/UB-Mannheim/tesseract/wiki).
After installation, add the following to your system's **Environment Variables** (PATH):
```
C:\Program Files\Tesseract-OCR
```### 2. Install Ghostscript
Download and install Ghostscript from the [official Ghostscript website](https://ghostscript.com/releases/gsdnld.html).
After installation, add the `bin` directory to your system's **Environment Variables** (PATH):
```
C:\Program Files\gs\gs10.04.0\bin
```### 3. Verify Installations
Run the following commands in a terminal to verify installations:
```bash
tesseract --version
gswin64c --version
```## Installation 📦
1. **Clone the repository**:
```bash
git clone https://github.com/your-username/your-repo.git
cd your-repo
```2. **Set up a virtual environment** (optional but recommended):
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```3. **Install dependencies**:
```bash
pip install -r requirements.txt
```## Usage 📋
### CLI Arguments
- `eps_name`: The EPS name to process the files for (e.g., "NUEVA EPS").
- `input_path`: Path to the folder or file to process.### Example Command
```bash
python main.py "NUEVA EPS" "/path/to/input/folder"
```## EPS Configuration 🏥
The application supports multiple EPS configurations, such as:
- **NUEVA EPS**
- **SALUD TOTAL**Each configuration includes:
- Keywords for file type identification.
- Customizable naming conventions.## Logging 📄
Logs are generated in the following files:
- `info.log`: General process information.
- `error.log`: Errors encountered during processing.