https://github.com/faerque/pdf_scraper

PDF Scraper with Automation - A CLI tool for extracting text from PDFs and storing it in an SQLite database for structured querying. Supports digitally generated PDFs and enables efficient document processing.
https://github.com/faerque/pdf_scraper

automation cli-tool document-management document-management-system natural-language-processing pdf-processing sqlite text-extraction

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/faerque/pdf_scraper
Owner: Faerque
Created: 2025-02-11T06:26:22.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-02-20T18:34:42.000Z (12 months ago)
Last Synced: 2025-02-20T19:37:40.721Z (12 months ago)
Topics: automation, cli-tool, document-management, document-management-system, natural-language-processing, pdf-processing, sqlite, text-extraction
Language: Python
Homepage:
Size: 544 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PDF Scraper with Automation

## 📌 Project Overview

This project automates the extraction of text from PDFs in a given directory and stores the extracted content for future use. The extracted text is saved in an SQLite database for structured storage, enabling efficient querying and retrieval.

## 🎯 Features

- Extracts text from PDFs efficiently.
- Processes all PDFs in a directory automatically.
- Stores extracted text in an SQLite database for structured access.
- Provides a CLI-based execution for ease of use.
- Ensures modular and scalable code architecture.
- Implements **logging** to track processing steps and errors.

## 🛠️ Why PyMuPDF?

This project utilizes **PyMuPDF** (also known as Fitz) over other PDF libraries like **PDFMiner** or **PyPDF2** due to:

- **Speed & Efficiency:** PyMuPDF is significantly faster in extracting text from PDFs.
- **Accuracy:** It retains the document structure better compared to other parsers.
- **Lightweight:** Consumes less memory and provides efficient text extraction.
- **Support for Complex PDFs:** Handles embedded fonts and complex document layouts effectively.

## 📄 Supported PDF Types

This tool is best suited for extracting text from:

- **Digitally Generated PDFs:** PDFs created directly from software like Microsoft Word, LaTeX, or InDesign.
- **Machine-Readable PDFs:** Documents where the text layer is selectable and extractable.

### ❌ Limitations

- **Scanned PDFs & Image-Based PDFs:** Lacks built-in OCR functionality; cannot extract text from scanned images without external OCR tools (e.g., Tesseract or Adobe OCR).
- **Encrypted or Restricted PDFs:** May not extract text from protected PDFs unless permissions allow it.
- **Poorly Formatted PDFs:** May struggle with extracting correctly structured text from heavily formatted PDFs with complex layouts.

## 📑 Logging System

The project includes a **logging system** to track operations in real-time and store them in `scraper.log`.

### 📌 Why Logging?

- ✅ Tracks each step of execution (PDF scanning, extraction, database storage).
- ✅ Records errors and warnings for debugging.
- ✅ Provides timestamps for process tracking.

### 📄 Logging Implementation

- **Log File**: All logs are stored in `scraper.log`.
- **Logging Levels**:
- `INFO` → Tracks normal operations.
- `WARNING` → Logs non-critical issues (e.g., duplicate PDFs).
- `ERROR` → Captures failures (e.g., file read errors).

## 🤖 Why PDF Text Extraction is Important?

PDF text extraction is crucial for:

- **Data Mining & Research:** Extracting insights from large volumes of documents.
- **Automated Report Analysis:** Processing business reports, invoices, and financial statements.
- **Natural Language Processing (NLP):** Analyzing and processing text for sentiment analysis, keyword extraction, and entity recognition.
- **Searchable Document Archives:** Converting unstructured PDF content into structured databases for easy retrieval and indexing.

## 📂 Project Structure

```
📁 pdf_scraper
│── extractor.py # Extracts text from PDFs
│── processor.py # Scans directory and processes PDFs
│── database.py # Handles SQLite database interactions
│── logger.py # Manages logging system
│── main.py # CLI entry point for execution
│── requirements.txt # Dependencies
│── README.md # Project documentation
```

## 🚀 Installation & Usage

### 1️⃣ Setup Environment

```sh
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
```

### 2️⃣ Run the Script

```sh
python main.py --directory /path/to/pdfs
```

### 3️⃣ Query Extracted Text (Example SQLite Query)

```sql
SELECT * FROM pdf_text WHERE filename = 'example.pdf';
```

---

This project provides a scalable and efficient solution for automated PDF text extraction and storage, enabling powerful document processing capabilities.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/faerque/pdf_scraper

Awesome Lists containing this project

README