{"id":25403107,"url":"https://github.com/faerque/pdf_scraper","last_synced_at":"2026-04-29T01:32:52.827Z","repository":{"id":276926702,"uuid":"930748388","full_name":"Faerque/PDF_scraper","owner":"Faerque","description":"PDF Scraper with Automation - A CLI tool for extracting text from PDFs and storing it in an SQLite database for structured querying. Supports digitally generated PDFs and enables efficient document processing.","archived":false,"fork":false,"pushed_at":"2025-02-20T18:34:42.000Z","size":557,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-20T19:37:40.721Z","etag":null,"topics":["automation","cli-tool","document-management","document-management-system","natural-language-processing","pdf-processing","sqlite","text-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Faerque.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-11T06:26:22.000Z","updated_at":"2025-02-20T18:34:46.000Z","dependencies_parsed_at":"2025-02-11T07:42:13.671Z","dependency_job_id":null,"html_url":"https://github.com/Faerque/PDF_scraper","commit_stats":null,"previous_names":["faerque/pdf_scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Faerque%2FPDF_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Faerque%2FPDF_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Faerque%2FPDF_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Faerque%2FPDF_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Faerque","download_url":"https://codeload.github.com/Faerque/PDF_scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248574894,"owners_count":21127085,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","cli-tool","document-management","document-management-system","natural-language-processing","pdf-processing","sqlite","text-extraction"],"created_at":"2025-02-16T02:27:55.479Z","updated_at":"2026-04-29T01:32:52.782Z","avatar_url":"https://github.com/Faerque.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Scraper with Automation\n\n## 📌 Project Overview\n\nThis project automates the extraction of text from PDFs in a given directory and stores the extracted content for future use. The extracted text is saved in an SQLite database for structured storage, enabling efficient querying and retrieval.\n\n## 🎯 Features\n\n- Extracts text from PDFs efficiently.\n- Processes all PDFs in a directory automatically.\n- Stores extracted text in an SQLite database for structured access.\n- Provides a CLI-based execution for ease of use.\n- Ensures modular and scalable code architecture.\n- Implements **logging** to track processing steps and errors.\n\n## 🛠️ Why PyMuPDF?\n\nThis project utilizes **PyMuPDF** (also known as Fitz) over other PDF libraries like **PDFMiner** or **PyPDF2** due to:\n\n- **Speed \u0026 Efficiency:** PyMuPDF is significantly faster in extracting text from PDFs.\n- **Accuracy:** It retains the document structure better compared to other parsers.\n- **Lightweight:** Consumes less memory and provides efficient text extraction.\n- **Support for Complex PDFs:** Handles embedded fonts and complex document layouts effectively.\n\n## 📄 Supported PDF Types\n\nThis tool is best suited for extracting text from:\n\n- **Digitally Generated PDFs:** PDFs created directly from software like Microsoft Word, LaTeX, or InDesign.\n- **Machine-Readable PDFs:** Documents where the text layer is selectable and extractable.\n\n### ❌ Limitations\n\n- **Scanned PDFs \u0026 Image-Based PDFs:** Lacks built-in OCR functionality; cannot extract text from scanned images without external OCR tools (e.g., Tesseract or Adobe OCR).\n- **Encrypted or Restricted PDFs:** May not extract text from protected PDFs unless permissions allow it.\n- **Poorly Formatted PDFs:** May struggle with extracting correctly structured text from heavily formatted PDFs with complex layouts.\n\n## 📑 Logging System\n\nThe project includes a **logging system** to track operations in real-time and store them in `scraper.log`.\n\n### 📌 Why Logging?\n\n- ✅ Tracks each step of execution (PDF scanning, extraction, database storage).\n- ✅ Records errors and warnings for debugging.\n- ✅ Provides timestamps for process tracking.\n\n### 📄 Logging Implementation\n\n- **Log File**: All logs are stored in `scraper.log`.\n- **Logging Levels**:\n  - `INFO` → Tracks normal operations.\n  - `WARNING` → Logs non-critical issues (e.g., duplicate PDFs).\n  - `ERROR` → Captures failures (e.g., file read errors).\n\n## 🤖 Why PDF Text Extraction is Important?\n\nPDF text extraction is crucial for:\n\n- **Data Mining \u0026 Research:** Extracting insights from large volumes of documents.\n- **Automated Report Analysis:** Processing business reports, invoices, and financial statements.\n- **Natural Language Processing (NLP):** Analyzing and processing text for sentiment analysis, keyword extraction, and entity recognition.\n- **Searchable Document Archives:** Converting unstructured PDF content into structured databases for easy retrieval and indexing.\n\n## 📂 Project Structure\n\n```\n📁 pdf_scraper\n│── extractor.py       # Extracts text from PDFs\n│── processor.py       # Scans directory and processes PDFs\n│── database.py        # Handles SQLite database interactions\n│── logger.py # Manages logging system\n│── main.py            # CLI entry point for execution\n│── requirements.txt   # Dependencies\n│── README.md          # Project documentation\n```\n\n## 🚀 Installation \u0026 Usage\n\n### 1️⃣ Setup Environment\n\n```sh\npython -m venv venv\nsource venv/bin/activate  # On Windows use `venv\\Scripts\\activate`\npip install -r requirements.txt\n```\n\n### 2️⃣ Run the Script\n\n```sh\npython main.py --directory /path/to/pdfs\n```\n\n### 3️⃣ Query Extracted Text (Example SQLite Query)\n\n```sql\nSELECT * FROM pdf_text WHERE filename = 'example.pdf';\n```\n\n---\n\nThis project provides a scalable and efficient solution for automated PDF text extraction and storage, enabling powerful document processing capabilities.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaerque%2Fpdf_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffaerque%2Fpdf_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaerque%2Fpdf_scraper/lists"}