{"id":49393250,"url":"https://github.com/joshuaevan/page-forge","last_synced_at":"2026-04-28T14:35:11.375Z","repository":{"id":343378090,"uuid":"1177431354","full_name":"joshuaevan/page-forge","owner":"joshuaevan","description":"Convert multi-page PDFs into clean, OCR-ready images — CLI, watch folder, web UI, and optional Google Drive sync.","archived":false,"fork":false,"pushed_at":"2026-03-10T05:22:54.000Z","size":49,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-10T11:34:02.673Z","etag":null,"topics":["cli","docker","document-processing","fastapi","google-drive","image-processing","ocr","ocr-preprocessing","pdf","pdf-converter","pdf-to-image","pillow","pymupdf","python","self-hosted"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joshuaevan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-10T02:42:27.000Z","updated_at":"2026-03-10T04:10:56.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/joshuaevan/page-forge","commit_stats":null,"previous_names":["joshuaevan/page-forge"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/joshuaevan/page-forge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshuaevan%2Fpage-forge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshuaevan%2Fpage-forge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshuaevan%2Fpage-forge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshuaevan%2Fpage-forge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joshuaevan","download_url":"https://codeload.github.com/joshuaevan/page-forge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshuaevan%2Fpage-forge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32385308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T14:34:11.604Z","status":"ssl_error","status_checked_at":"2026-04-28T14:32:37.009Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","docker","document-processing","fastapi","google-drive","image-processing","ocr","ocr-preprocessing","pdf","pdf-converter","pdf-to-image","pillow","pymupdf","python","self-hosted"],"created_at":"2026-04-28T14:35:10.375Z","updated_at":"2026-04-28T14:35:11.368Z","avatar_url":"https://github.com/joshuaevan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PageForge\n\n\u003e Break multi-page PDFs into clean, OCR-ready images — from the command line or a self-hosted web UI.\n\n![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue)\n![Docker](https://img.shields.io/badge/docker-ready-2496ED)\n![License MIT](https://img.shields.io/badge/license-MIT-green)\n\n---\n\n## Features\n\n- **CLI-first** — run `python main.py document.pdf` with no server required\n- **Watch folder** — drop PDFs into an inbox directory; PageForge processes them automatically on each sync cycle\n- **Web UI** with drag-and-drop upload, file browser, coverage calendar, and per-file ZIP download\n- **Google Drive optional** — connect a Drive folder to pull PDFs automatically (requires a GCP service account)\n- **High-contrast B\u0026W output** — autocontrast, double sharpen, and binarize for maximum OCR accuracy\n- **300 DPI by default** — configurable from 72 to 600 DPI\n- **Configurable** — all settings adjustable via environment variables, a JSON config file, or the web Settings tab\n- **Self-hosted** — runs entirely on your machine; no external services required unless you enable Drive sync\n\n---\n\n## Quick Start\n\n### CLI\n\n```bash\n# Single file — output next to the source\npython main.py document.pdf\n\n# Glob — batch convert\npython main.py *.pdf\n\n# Custom output directory\npython main.py scan.pdf --output /path/to/images/\n\n# JPEG output with quality control (much smaller files)\npython main.py scan.pdf --format jpeg --quality 85\n\n# Grayscale — no binarization, good for photos or mixed documents\npython main.py scan.pdf --mode grayscale\n\n# Full color output\npython main.py scan.pdf --mode color --format jpeg\n\n# Override DPI and threshold inline\npython main.py scan.pdf --dpi 150 --threshold 180\n```\n\nOutput images are written to `{stem}_pages/` by default, or the directory specified with `--output`.\n\n### Docker\n\n```bash\ndocker run --rm -v $(pwd)/data:/data -p 8000:8000 pageforge\n```\n\nPlace PDFs in `./data/inbox/` and open `http://localhost:8000` in your browser.\n\n### Docker Compose\n\n```bash\ncp docker-compose.example.yml docker-compose.yml\ndocker compose up -d\n```\n\nEdit `docker-compose.yml` to uncomment and set environment variables as needed.\n\n---\n\n## How It Works\n\nEach PDF page passes through a five-step image processing pipeline:\n\n1. **Grayscale render** — PyMuPDF renders the page at the configured DPI directly into a grayscale pixel buffer, skipping color conversion overhead.\n2. **Autocontrast** — Pillow's `ImageOps.autocontrast` stretches the histogram with a 2% cutoff, compensating for yellowed paper or uneven scan exposure.\n3. **Double sharpen** — Two passes of `ImageFilter.SHARPEN` enhance edge definition so character strokes are crisp.\n4. **Binarize** — A point transform converts every pixel below the threshold to pure black and every pixel at or above it to pure white, producing a true 1-bit-style image in an 8-bit container.\n5. **PNG optimize** — The result is saved with `optimize=True`, letting the PNG encoder find the smallest lossless encoding of the high-contrast data.\n\nThe output is a sequence of small, high-contrast PNG files that OCR engines and AI vision models handle reliably.\n\n---\n\n## Configuration\n\nAll settings are optional. PageForge works with zero configuration.\n\n| Name | Default | Description |\n|---|---|---|\n| `PAGEFORGE_INBOX` | `/data/inbox` | Local folder watched for incoming PDFs |\n| `PAGEFORGE_OUTPUT` | `/data/output` | Root directory where page images are saved |\n| `PAGEFORGE_DPI` | `300` | Render resolution; higher values increase file size and quality |\n| `PAGEFORGE_THRESHOLD` | `160` | Binarize cutoff (0–255); lower values keep more pixels black |\n| `PAGEFORGE_MODE` | `bw` | `bw` (binarized, best for OCR), `grayscale` (no binarize), or `color` (full color) |\n| `PAGEFORGE_FORMAT` | `png` | Output format: `png` (lossless) or `jpeg` (smaller files) |\n| `PAGEFORGE_JPEG_QUALITY` | `85` | JPEG quality 1–95; only applies when format is `jpeg` |\n| `PAGEFORGE_ENABLE_UPLOAD` | `true` | Set to `false` to disable the web upload endpoint |\n| `SYNC_INTERVAL_MINUTES` | `30` | How often (in minutes) to poll the inbox and Drive folder |\n| `DRIVE_FOLDER_ID` | _(empty)_ | Google Drive folder ID; leave blank to disable Drive sync |\n| `PAGEFORGE_CONFIG` | `/data/config.json` | Path to the JSON config file (overrides defaults, overridden by env vars) |\n\nSettings can also be changed at runtime via the web Settings tab, which writes to `config.json`.\n\n---\n\n## Google Drive Setup\n\n\u003e **You supply your own credentials.** PageForge includes no Google auth of any kind. You create a Google Cloud project, generate a service account key, and share whichever Drive folder you want with that account. Your credentials stay on your machine.\n\nSee **[docs/google-drive-setup.md](docs/google-drive-setup.md)** for the full step-by-step walkthrough, including screenshots guidance, troubleshooting, and how to disable Drive sync without removing the key file.\n\n**Short version:**\n\n1. Create a GCP project and enable the **Google Drive API**.\n2. Create a **Service Account** and download its JSON key.\n3. Place the key at `./data/credentials.json` (mounted as `/data/credentials.json` in the container).\n4. Share your Drive folder with the service account email as **Editor**.\n5. Set `DRIVE_FOLDER_ID` in your environment or via the web Settings tab.\n\nPageForge will poll the folder on each sync cycle, download new PDFs, convert them, and delete the originals from Drive.\n\n---\n\n## Web UI\n\nThe web interface is available at `http://localhost:8000` when running as a server.\n\n- **Files tab** — shows summary statistics (total files, pages, days, years), an interactive monthly coverage calendar, and a filterable table of all processed documents. Each row has a ZIP download button.\n- **Upload tab** — drag-and-drop or click-to-browse PDF upload. Files are processed in the background; the UI polls for completion and updates the status in real time. Hidden when `PAGEFORGE_ENABLE_UPLOAD` is `false`.\n- **Settings tab** — live configuration editor for all settings. Changes are saved to `config.json` and take effect immediately without restarting the server.\n\n---\n\n## Output\n\nFor a PDF named `2024-03-15_invoice.pdf`, PageForge creates:\n\n```\n2024-03-15_invoice_pages/\n  2024-03-15_invoice_p001.png\n  2024-03-15_invoice_p002.png\n  2024-03-15_invoice_p003.png\n  ...\n```\n\n- Images are named `{stem}_p{page:03d}.png` (zero-padded to three digits).\n- Each directory is self-contained and can be zipped for download via the API or web UI.\n- When files are named with a `YYYY-MM-DD` prefix, the web UI groups and filters them by date, month, and year.\n\n---\n\n## Use Cases\n\n- **OCR pipelines** — feed the output PNGs directly into Tesseract, EasyOCR, or a cloud OCR API\n- **AI document ingestion** — send high-contrast page images to vision-capable language models for extraction or summarization\n- **Digitizing paper records** — scan documents to PDF, drop them in the inbox, and get clean per-page images automatically\n- **Batch archival** — process hundreds of PDFs overnight using the CLI glob mode or the watch-folder scheduler\n\n---\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshuaevan%2Fpage-forge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoshuaevan%2Fpage-forge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshuaevan%2Fpage-forge/lists"}