{"id":50494230,"url":"https://github.com/bcblr1993/pdf-diff-system","last_synced_at":"2026-06-02T05:31:10.417Z","repository":{"id":360691555,"uuid":"1251112279","full_name":"bcblr1993/pdf-diff-system","owner":"bcblr1993","description":"PDF 差异对比系统 - 合同/文档审核工作台。PyMuPDF + RapidOCR + 字符流 diff，React 双 PDF 高亮审核，Docker 一键部署。","archived":false,"fork":false,"pushed_at":"2026-05-27T13:29:39.000Z","size":608,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T14:22:55.603Z","etag":null,"topics":["chinese","contract-review","diff","docker","fastapi","ocr","pdf","pymupdf","rapidocr","react"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bcblr1993.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T08:54:55.000Z","updated_at":"2026-05-27T13:29:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bcblr1993/pdf-diff-system","commit_stats":null,"previous_names":["bcblr1993/pdf-diff-system"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/bcblr1993/pdf-diff-system","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcblr1993%2Fpdf-diff-system","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcblr1993%2Fpdf-diff-system/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcblr1993%2Fpdf-diff-system/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcblr1993%2Fpdf-diff-system/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bcblr1993","download_url":"https://codeload.github.com/bcblr1993/pdf-diff-system/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcblr1993%2Fpdf-diff-system/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33808702,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","contract-review","diff","docker","fastapi","ocr","pdf","pymupdf","rapidocr","react"],"created_at":"2026-06-02T05:31:09.564Z","updated_at":"2026-06-02T05:31:10.393Z","avatar_url":"https://github.com/bcblr1993.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF / Word 差异对比系统\n\n合同 / 文档审核工作台。把**原件**（电子矢量 PDF 或 Word）和**对方版本**（盖章扫描 PDF 或修改后 Word）扔进来，系统自动 OCR + 字符流 diff + 双视图高亮，再逐条人工审核确认/忽略/批注，最后导出归档报告。\n\n## ⚡ 30 秒上手\n\n```bash\ngit clone https://github.com/bcblr1993/pdf-diff-system.git\ncd pdf-diff-system\ncp .env.example .env\ndocker compose up -d                  # 起 5 个容器：postgres+redis+api+worker+frontend\n```\n\n打开 http://localhost:8080 → 用 `admin / admin123` 登录 → 拖两份文档 → 自动对比 → 审核 → 导出。\n\n**生产部署看 [DEPLOYMENT.md](./DEPLOYMENT.md)**。\n\n---\n\n## 核心能力\n\n| 维度 | 能力 |\n|---|---|\n| **输入格式** | PDF（电子版 + 扫描版）、Word（.docx）任意混搭 |\n| **算法** | 全文档字符流 diff + 块位移识别 + OCR 噪声归一 + 章遮挡识别（v10 演进而来） |\n| **准确性** | 样本合同 14-15 页对比，23 条真实差异 0 误报，关键字段全命中 |\n| **性能** | 首次 35s（含 OCR），缓存命中 4s（同文件二次） |\n| **审核** | 逐条 ✓/✗ + 批注 + 快捷键 + 完成审核归档 |\n| **批量** | 1 份原件 × N 份扫描件，共享 OCR 缓存 |\n| **导出** | Excel / HTML 快照 / PDF 三种归档格式 |\n| **集成** | API Key 鉴权 + Webhook 推送 + 完整 OpenAPI 文档 |\n| **多用户** | 内置登录 + 管理员/普通两角色 + 完整审计日志 |\n\n---\n\n## 系统截图速览\n\n```\n┌─ 任务列表（首页）──────────────────────────────────────┐\n│ #5  合同对比 v2   done   真实23  ★7   审核 12/23      │\n│ #6  批量×3       running                              │\n│ ...                                                    │\n└────────────────────────────────────────────────────────┘\n\n┌─ 详情页（双 PDF / Word 并排）─────────────────────────┐\n│ 原件 P5            │ 扫描件 P6        │ 差异侧栏       │\n│ ┌──────────────┐   │ ┌──────────────┐ │ #2★ XX-C-...  │\n│ │ 合同编号...  │   │ │ XX-C-260520  │ │ #38★ 仟→任    │\n│ │ ...          │   │ │ ...          │ │ ...           │\n│ │ [仟]→[任]🟡  │   │ │ ...          │ │               │\n│ └──────────────┘   │ └──────────────┘ │ ↑↓ Y N U 快捷键│\n└────────────────────────────────────────────────────────┘\n```\n\n颜色：🟢 新增 / 🔴 删除 / 🟡 修改 / ⚪ 章遮挡 / 🔵 位置移动 / ★ 关键字段\n\n---\n\n## 架构\n\n```\n                  ┌──────────────┐\n                  │   浏览器     │  React + pdf.js + Tailwind\n                  └──────┬───────┘\n                         │\n                  ┌──────▼───────┐\n                  │   Nginx :80  │  ← 前端静态 + API/WS 反代\n                  └──────┬───────┘\n                         │\n              ┌──────────▼───────┐    ┌─────────────┐\n              │ FastAPI :8000    │◄──►│  Postgres   │\n              │  • Auth (JWT)    │    │  16-alpine  │\n              │  • 任务管理      │    └─────────────┘\n              │  • 差异+审核     │\n              │  • WebSocket     │    ┌─────────────┐\n              │  • Webhook       │    │   Redis 7   │\n              │  • OpenAPI       │◄──►│             │\n              └──────────┬───────┘    └─────┬───────┘\n                         │ enqueue          │\n                         ▼                  │ pub/sub 进度\n              ┌──────────────────┐          │\n              │  Worker (RQ)     │◄─────────┘\n              │  Pipeline:       │\n              │   ┌─ PDF: extract→ocr→stamp→diff\n              │   └─ Word: extract→diff\n              └────────┬─────────┘\n                       │\n                       ▼\n              ┌──────────────────┐\n              │ 卷: storage/     │  按 SHA1 去重存 PDF / Word\n              │ 卷: cache/       │  OCR 结果缓存\n              └──────────────────┘\n```\n\n---\n\n## 算法效果（一份样本合同的演进）\n\n样本：14 页电子版合同 vs 15 页盖章扫描件，含手写填空、红章、表格内容错位、OCR 易混字。\n\n| 迭代 | 真实差异 / 噪声 | 关键改进 |\n|---|---|---|\n| v1 行级 diff | 662 全是噪声 | 基础跑通 |\n| v3 页级 diff | 81，但 P9 整页是错的 | 引入页对齐 |\n| v4 全文档 diff | 96 | **根治页号错位** |\n| v7 拆 replace + move | 39 | **表格位移识别** |\n| **v10 最终** | **23 真实差异** | 单字噪声折叠 + 下划线忽略 |\n\nv10 的 23 条全部命中真实问题：合同编号填空 / 错字（仟/任、‰/%、甲/申）/ 缺失条款 / 新增联系人电话 / 章遮挡区 / 签字栏布局变化 ——**零误报**。\n\n详见 `backend/pipeline/diff.py` 注释。\n\n---\n\n## 仓库结构\n\n```\npdf-diff-system/\n├── README.md                    本文档\n├── DEPLOYMENT.md                内网部署 + 运维手册\n├── CLAUDE.md                    给 AI 助手的开发约定 + 算法说明\n├── LICENSE                      MIT\n│\n├── docker-compose.yml           开发用 compose\n├── docker-compose.prod.yml      生产 overlay（资源限制 + 日志轮转）\n├── .env.example                 环境变量模板\n│\n├── scripts/                     运维脚本\n│   ├── init-admin.sh            初始化首个管理员（交互式）\n│   ├── backup.sh                备份 DB + 文件存储\n│   ├── restore.sh               从备份恢复\n│   └── health-check.sh          巡检脚本（cron 友好）\n│\n├── backend/                     Python 后端\n│   ├── pipeline/                ← 核心 diff 算法\n│   │   ├── extract.py           PDF 矢量文字抽取（PyMuPDF）\n│   │   ├── ocr.py               扫描 PDF OCR（RapidOCR）\n│   │   ├── word.py              Word 文本抽取（python-docx）\n│   │   ├── stamp_mask.py        红章检测\n│   │   ├── stream.py            字符流构建\n│   │   ├── normalize.py         规范化（OCR 形近字）\n│   │   ├── diff.py              全文档 diff + move 识别\n│   │   └── cache.py             按 SHA1 缓存\n│   ├── app/\n│   │   ├── main.py              FastAPI 入口\n│   │   ├── cli.py               保留命令行入口\n│   │   ├── core/                config / security / deps / logging\n│   │   ├── db/models/           7 个 ORM 表\n│   │   ├── schemas/             Pydantic\n│   │   ├── api/                 内部 API（auth/comparisons/diffs/batches/...）\n│   │   │   └── v1/              外部 API（X-API-Key 鉴权）\n│   │   ├── services/            业务逻辑（webhook / api_key / file_storage / ...）\n│   │   ├── workers/             RQ Worker + compare_job\n│   │   ├── exporters/           xlsx / pdf_report / html_report / batch_xlsx\n│   │   └── ws/                  WebSocket 进度\n│   ├── alembic/                 数据库迁移（4 个版本）\n│   └── Dockerfile\n│\n└── frontend/                    React + TypeScript\n    ├── src/\n    │   ├── pages/               Login / List / New / Detail / BatchList /\n    │   │                        BatchDetail / Integrations\n    │   ├── components/          AppShell / PdfDocument / DocxViewer /\n    │   │                        DiffSidebar / ProgressPanel / ...\n    │   ├── api/                 axios 客户端 + endpoints\n    │   ├── stores/              Zustand\n    │   └── App.tsx\n    ├── Dockerfile               Nginx 静态服务 + API 反代\n    └── nginx.conf\n```\n\n---\n\n## 功能模块速查\n\n| 想做的事 | 操作 |\n|---|---|\n| 上传两份文档对比 | `POST /api/comparisons`（multipart：orig, scan, title）|\n| 查任务列表 | `GET /api/comparisons` |\n| 看差异详情 | `GET /api/comparisons/{id}` + `GET /api/comparisons/{id}/diffs` |\n| 审核一条差异 | `PATCH /api/diffs/{id}` body `{review_action, review_note}` |\n| 完成整任务审核 | `POST /api/comparisons/{id}/review/complete` |\n| 导出报告 | `GET /api/comparisons/{id}/export.xlsx` (或 .html / .pdf) |\n| 批量对比 | `POST /api/batches`（1 原件 + N 扫描件）|\n| 创建外部 API Key | `POST /api/api-keys`（仅管理员）|\n| 注册 Webhook | `POST /api/webhooks`（仅管理员）|\n| 用 API Key 调用 | Header `X-API-Key: \u003ckey\u003e`, 端点 `/api/v1/comparisons` |\n| 实时进度 | WebSocket `/ws/comparisons/{id}/progress` |\n\n完整 OpenAPI：http://localhost:8080/docs\n\n---\n\n## 开发模式\n\n```bash\n# 起后端依赖\ndocker compose up -d postgres redis api worker\n\n# 前端 dev server（热重载）\ncd frontend\nexport PATH=\"/opt/homebrew/opt/node/bin:$PATH\"  # macOS brew node\nnpm install\nnpm run dev    # → http://localhost:5173\n```\n\nVite 已配置代理：`/api` → `:8000`，`/ws` → `:8000`。\n\n---\n\n## 开发路线（已完成）\n\n| 阶段 | 内容 | 状态 |\n|---|---|---|\n| MVP | 命令行 PDF 对比 + HTML 报告 | ✅ |\n| M1 | FastAPI 后端 + DB + Worker + 15 个端点 | ✅ |\n| M2 | React 前端 + 双 PDF Viewer + 审核交互 | ✅ |\n| M3 | Excel / PDF / HTML 三种报告导出 | ✅ |\n| M4 | 批量对比（1 原件 × N 扫描件）| ✅ |\n| M5 | API Key + Webhook + 外部 API v1 | ✅ |\n| M7 | Word (.docx) 文档对比支持 | ✅ |\n| M6 | 生产部署 + 备份/恢复/巡检脚本 + 部署文档 | ✅ |\n\n---\n\n## License\n\nMIT。商用请关注：仓库代码可自由使用，但 RapidOCR 内置 PP-OCRv4 模型采用 Apache 2.0，PyMuPDF AGPL（商用需评估）。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcblr1993%2Fpdf-diff-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcblr1993%2Fpdf-diff-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcblr1993%2Fpdf-diff-system/lists"}