{"id":49070118,"url":"https://github.com/chen3feng/scan2pdf","last_synced_at":"2026-04-20T07:04:09.663Z","repository":{"id":350279379,"uuid":"1206017249","full_name":"chen3feng/scan2pdf","owner":"chen3feng","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-09T17:59:12.000Z","size":85,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-09T18:12:56.078Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chen3feng.png","metadata":{"files":{"readme":"README-zh.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-09T13:49:58.000Z","updated_at":"2026-04-09T17:59:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/chen3feng/scan2pdf","commit_stats":null,"previous_names":["chen3feng/scan2pdf"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/chen3feng/scan2pdf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen3feng%2Fscan2pdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen3feng%2Fscan2pdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen3feng%2Fscan2pdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen3feng%2Fscan2pdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chen3feng","download_url":"https://codeload.github.com/chen3feng/scan2pdf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen3feng%2Fscan2pdf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32036803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-20T07:04:08.544Z","updated_at":"2026-04-20T07:04:09.655Z","avatar_url":"https://github.com/chen3feng.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[English](README.md) | 简体中文\n\n# scan2pdf\n\n将扫描版 PDF 书籍转换为紧凑的文字 PDF，基于 OCR 技术。\n\n输入一个大体积的扫描版 PDF（图片格式），对每一页进行 OCR 识别，输出一个轻量的文字版 PDF，排版清晰美观——通常可实现 **99% 以上的压缩率**。\n\n## 功能特性\n\n- **OCR 驱动转换** — 使用 Tesseract 从扫描页面中提取文字\n- **字号感知渲染** — 从 OCR 数据中检测标题/正文字号，保留相对排版关系\n- **单页保证** — 每一页扫描内容严格对应一页输出（内容溢出时自动缩小字号）\n- **均匀垂直分布** — 文字均匀填满整个页面高度，而非堆积在顶部\n- **封面页保留** — 封面页以压缩图片形式保留\n- **并行处理** — 多线程 OCR，加速转换\n- **打印机风格页码选择** — 支持 `1,3-5,10-20` 等语法指定处理页面\n\n## scan2pdf vs ocrmypdf\n\n两者都使用 Tesseract OCR，但**目标完全不同**：\n\n| | **ocrmypdf** | **scan2pdf** |\n|---|---|---|\n| **目标** | 给扫描 PDF 添加隐藏文字层（可搜索/可复制） | 将扫描 PDF **重新生成**为纯文字 PDF |\n| **输出** | 原始图片 + 透明文字叠加层 | 纯文字排版，丢弃原始图片 |\n| **体积** | 与原文件大小相近（图片仍在） | **99% 以上压缩率**（只保留文字） |\n| **外观** | 看起来和原件一模一样 | 重新排版的文字页面 |\n\n**scan2pdf 的独特之处：**\n\n- **极致压缩** — 190MB 的扫描书籍可压缩至几百 KB\n- **智能排版重建** — 从 hOCR 边界框推算字号，区分标题与正文，均匀分布文字填满整页\n- **严格单页对应** — 每页扫描内容只在对应的一页输出，内容溢出时自动缩小字号（最小 6pt）\n- **封面页特殊处理** — 封面页保留为压缩图片，文字页转为纯文字\n- **OCR 文本清洗流水线** — 修复 OCR 瑕疵、合并断行、过滤页眉页脚，输出干净可读的文字\n\n**何时选择哪个：**\n\n| 场景 | 推荐 |\n|------|------|\n| 需要保留原始扫描外观，只是想搜索/复制文字 | **ocrmypdf** |\n| 需要极致压缩，在手机/Kindle 上阅读 | **scan2pdf** |\n| 存档扫描文档，保持法律效力 | **ocrmypdf** |\n| 大量扫描小说/教材，只关心文字内容 | **scan2pdf** |\n\n\u003e **一句话总结** — ocrmypdf 是给扫描件\"贴上隐形字幕\"，scan2pdf 是把扫描件\"翻译成电子书\"。\n\n## 前置要求\n\n- **Python** ≥ 3.10\n- **uv** — [安装指南](https://docs.astral.sh/uv/getting-started/installation/)（`curl -LsSf https://astral.sh/uv/install.sh | sh`）\n- **Tesseract OCR** — [安装指南](https://github.com/tesseract-ocr/tesseract)\n- **Poppler**（提供 `pdftoppm`）— [Windows](https://github.com/oschwartz10612/poppler-windows/releases)、[macOS](https://formulae.brew.sh/formula/poppler)（`brew install poppler`）、[Linux](https://poppler.freedesktop.org/)（`apt install poppler-utils`）\n\n## 安装\n\n```bash\n# 克隆仓库\ngit clone \u003crepo-url\u003e\ncd scan2pdf\n\n# 安装依赖（包括开发工具）\nuv sync\n\n# 或安装可选的快速渲染依赖\nuv sync --extra fast\n```\n\n## 使用方法\n\n### 基本用法\n\n```bash\n# 转换整本书（输出：book-text.pdf）\nuv run scan2pdf book.pdf\n\n# 指定输出文件\nuv run scan2pdf book.pdf -o output.pdf\n```\n\n### 快速测试\n\n```bash\n# 只转换第 3 到 10 页\nuv run scan2pdf book.pdf -n 3-10\n\n# 转换指定页面\nuv run scan2pdf book.pdf -n 1,5,10-20\n```\n\n### 高级选项\n\n```bash\n# 自定义封面页、语言和并发数\nuv run scan2pdf book.pdf --image-pages 1-3 --lang eng+chi_sim --workers 8\n\n# 降低 DPI 以加快处理速度\nuv run scan2pdf book.pdf --dpi 200\n\n# 详细输出\nuv run scan2pdf book.pdf -v      # INFO 级别\nuv run scan2pdf book.pdf -vv     # DEBUG 级别\n\n# 保留临时文件以便调试\nuv run scan2pdf book.pdf --keep-temp\n```\n\n\u003e **注意：** `--lang` 选项支持以 `+` 分隔多种语言（如 `eng+chi_sim`）。\n\u003e 你可能需要安装额外的 Tesseract 语言包，可用语言列表见\n\u003e [Tesseract 语言数据](https://ocrmypdf.readthedocs.io/en/latest/languages.html) 页面。\n\u003e macOS 上：`brew install tesseract-lang`；Ubuntu/Debian 上：`apt install tesseract-ocr-\u003clang\u003e`。\n\n### 全部选项\n\n| 选项 | 默认值 | 说明 |\n|------|--------|------|\n| `input` | *（必填）* | 输入的扫描版 PDF 文件 |\n| `-o, --output` | `\u003cinput\u003e-text.pdf` | 输出 PDF 文件 |\n| `-n, --pages` | 全部 | 页码范围（如 `1-10`、`1,3,5-20`） |\n| `--image-pages` | *（无）* | 作为图片页处理的页码（如 `1`、`1-3`、`1,2,5-10`） |\n| `--lang` | `eng` | OCR 语言 |\n| `--dpi` | `300` | 页面渲染 DPI |\n| `--image-quality` | `60` | 图片页 JPEG 质量 |\n| `--image-max-width` | `1200` | 图片页最大宽度（像素） |\n| `--workers` | `4` | 并行 OCR 工作线程数 |\n| `--tesseract` | `tesseract` | Tesseract 可执行文件路径 |\n| `--keep-temp` | 关闭 | 保留临时文件以便调试 |\n| `-v, --verbose` | 关闭 | 增加输出详细程度（`-v` INFO，`-vv` DEBUG） |\n\n## 开发与测试\n\n### 环境搭建\n\n```bash\n# 克隆并安装所有依赖（包括开发工具）\ngit clone https://github.com/chen3feng/scan2pdf.git\ncd scan2pdf\nuv sync\n```\n\n### 代码风格\n\n本项目使用 [Ruff](https://docs.astral.sh/ruff/) 进行代码检查和格式化：\n\n```bash\n# 检查 lint 错误\nuv run ruff check .\n\n# 自动修复 lint 错误\nuv run ruff check . --fix\n\n# 检查代码格式\nuv run ruff format --check .\n\n# 自动格式化代码\nuv run ruff format .\n```\n\n### 运行测试\n\n```bash\n# 运行全部测试\nuv run pytest tests/ -v\n\n# 运行指定测试文件\nuv run pytest tests/test_text_cleaner.py -v\n\n# 简短错误输出\nuv run pytest tests/ --tb=short\n```\n\n### 持续集成\n\n每次向 `master` 分支 push 或提交 Pull Request 时，会自动触发 [GitHub Actions](.github/workflows/ci.yml)：\n\n1. **Lint** — `ruff check` + `ruff format --check`\n2. **Test** — 在 Python 3.10 / 3.11 / 3.12 / 3.13 上运行 `pytest`\n\n## 架构\n\n```\nscan2pdf/\n├── cli.py            # 命令行接口与参数解析\n├── pipeline.py       # 编排完整的转换流程\n├── pdf_splitter.py   # 将页面提取为图片（pikepdf + poppler）\n├── ocr_engine.py     # 运行 Tesseract OCR，生成 hOCR\n├── hocr_parser.py    # 解析 hOCR 输出，提取文字和字号\n├── text_cleaner.py   # 清理 OCR 瑕疵，合并行为段落\n├── pdf_generator.py  # 生成排版后的文字 PDF（ReportLab）\n└── pdf_merger.py     # 合并各页 PDF 为最终输出\n```\n\n### 处理流程\n\n```\n扫描版 PDF\n    │\n    ├─ 封面页 ──→ 渲染为图片 ──→ JPEG 压缩 ──→ 封装为 PDF\n    │\n    └─ 文字页 ──→ 渲染为图片 ──→ Tesseract OCR ──→ 解析 hOCR\n                                                        │\n                       合并为最终 PDF ←── 生成 PDF ←── 清理文字\n```\n\n## 依赖\n\n| 包 | 用途 |\n|----|------|\n| `pikepdf` | PDF 页面提取 |\n| `pypdf` | PDF 读取与合并 |\n| `reportlab` | 文字 PDF 生成 |\n| `lxml` | hOCR（HTML/XML）解析 |\n| `Pillow` | 图片处理 |\n\n## 许可证\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen3feng%2Fscan2pdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchen3feng%2Fscan2pdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen3feng%2Fscan2pdf/lists"}