{"id":48219030,"url":"https://github.com/notoriouslab/doc-cleaner","last_synced_at":"2026-04-04T19:03:31.468Z","repository":{"id":344129716,"uuid":"1176969982","full_name":"notoriouslab/doc-cleaner","owner":"notoriouslab","description":"doc-cleaner：一個為繁體中文金融文件設計的開源文件清洗工具，支援完全離線運行，你的文件，不該為了整理而離開你的電腦 :)","archived":false,"fork":false,"pushed_at":"2026-04-02T05:36:45.000Z","size":180,"stargazers_count":159,"open_issues_count":1,"forks_count":16,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-02T18:58:06.862Z","etag":null,"topics":["bank-statement","pdf","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/notoriouslab.png","metadata":{"files":{"readme":"README.en.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-09T15:05:03.000Z","updated_at":"2026-04-02T05:36:49.000Z","dependencies_parsed_at":"2026-03-13T18:03:16.145Z","dependency_job_id":null,"html_url":"https://github.com/notoriouslab/doc-cleaner","commit_stats":null,"previous_names":["notoriouslab/doc-cleaner"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/notoriouslab/doc-cleaner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notoriouslab%2Fdoc-cleaner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notoriouslab%2Fdoc-cleaner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notoriouslab%2Fdoc-cleaner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notoriouslab%2Fdoc-cleaner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/notoriouslab","download_url":"https://codeload.github.com/notoriouslab/doc-cleaner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notoriouslab%2Fdoc-cleaner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31409471,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bank-statement","pdf","python"],"created_at":"2026-04-04T19:03:31.410Z","updated_at":"2026-04-04T19:03:31.459Z","avatar_url":"https://github.com/notoriouslab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# doc-cleaner\n\nConvert PDF, DOCX, XLSX, and text files to clean, structured Markdown — CJK-friendly, table-friendly, privacy-first.\n\n**Requires Python 3.9+** · Part of the [notoriouslab](https://github.com/notoriouslab) open-source toolkit.\n\n\u003e [中文 README](README.md)\n\n---\n\n## Why This Tool\n\nMost document-to-Markdown tools either drop tables, butcher CJK text, or require cloud uploads. doc-cleaner was built for Traditional Chinese financial documents from day one, but works great with any language. It also integrates with AI agent frameworks (OpenClaw, etc.) — any agent can call it via shell, and it ships with a `SKILL.md` for direct use with [OpenClaw](https://openclaw.ai/).\n\n| Feature | |\n|---|---|\n| CJK-first | Big5, CP950, UTF-16 auto-detection — covers all Taiwan bank statements |\n| Table preservation | DOCX + XLSX → Markdown pipe tables |\n| High-quality PDF extraction | Optional opendataloader-pdf produces pipe tables directly from PDFs |\n| Smart PDF triage | Auto-classifies: native text / layout-broken / scanned |\n| AI structuring | Gemini (cloud), Groq (cloud), or Ollama (local) |\n| No-AI mode | `--ai none` — pure extraction, zero API keys, zero cloud |\n| PDF decryption | Optional pikepdf |\n| Ad cleaning | Tail truncation + inline removal with configurable regex |\n| Privacy-first | Local Ollama option — documents never leave your machine |\n| Atomic writes | Temp file + `os.replace()` — no partial output |\n| Dry-run preview | `--dry-run` before committing |\n\n---\n\n## Quick Start\n\n```bash\n# 1. Clone\ngit clone https://github.com/notoriouslab/doc-cleaner.git\ncd doc-cleaner\n\n# 2. Install core dependencies\npip install -r requirements.txt\n\n# 3. (Optional) High-quality PDF extraction (recommended)\npip install opendataloader-pdf # Requires Java 11+ (brew install openjdk@21)\n\n# 4. (Optional) Install AI backend\npip install google-genai python-dotenv # for Gemini\n# or\n# Groq uses its OpenAI-compatible API directly; just set GROQ_API_KEY\n# or\npip install ollama # for local Ollama\n\n# 5. (Optional) Install PDF extras\npip install pikepdf # PDF decryption\npip install pdf2image # PDF vision mode (requires poppler)\n\n# 5. Configure\ncp config.example.json config.json\ncp .env.example .env\n# Edit .env — set GEMINI_API_KEY or GROQ_API_KEY if using a cloud backend\n\n# 6. Run\npython cleaner.py --input statement.pdf\n# Output: ./output/statement.md\n```\n\n### No-AI Mode (Simplest)\n\nNo API keys, no cloud — just text and table extraction:\n\n```bash\npip install -r requirements.txt\npython cleaner.py --input ./downloads/ --ai none\n```\n\n### Dry Run\n\nPreview which files would be processed without writing anything:\n\n```bash\npython cleaner.py --input ./downloads/ --dry-run --verbose\n```\n\n---\n\n## CLI Options\n\n```\npython cleaner.py [options]\n\n --input, -i File or directory to process (required, non-recursive)\n --output-dir, -o Output directory (default: ./output)\n --config Path to config JSON (default: \u003cscript-dir\u003e/config.json)\n --ai gemini | groq | ollama | none (default: from config or gemini)\n --password PDF decryption password (overrides .env and config)\n --summary Print JSON summary to stdout after processing (for scripts and AI agents)\n --dry-run Preview without writing files\n --verbose Enable debug logging\n --version Print version and exit\n```\n\n### Exit Codes\n\n| Code | Meaning |\n|---|---|\n| 0 | All files processed successfully |\n| 1 | Some files failed (partial success) |\n| 2 | No processable files found or config error |\n\n---\n\n## Configuration\n\nThe main config file is `config.json` (copy from `config.example.json`):\n\n```jsonc\n{\n \"ai\": {\n \"backend\": \"gemini\", // default AI backend\n \"prompt_template\": \"prompts/default.txt\", // prompt template path\n \"gemini\": {\n \"model\": \"gemini-2.5-pro\"\n },\n \"groq\": {\n \"model\": \"meta-llama/llama-4-scout-17b-16e-instruct\",\n \"base_url\": \"https://api.groq.com/openai/v1\",\n \"timeout\": 120\n },\n \"ollama\": {\n \"model\": \"qwen3.5:9b\",\n \"host\": \"http://localhost:11434\"\n }\n },\n \"pdf\": {\n \"dpi\": 200, // vision mode resolution\n \"max_pages\": 15 // vision mode page cap (OOM protection)\n },\n \"output\": {\n \"frontmatter\": true // include YAML frontmatter in output\n },\n \"ad_truncation_patterns\": [ // ad truncation regex (see below)\n \"\u003c投資人權益通知訊息[ \u003e]\",\n \"謹慎理財.{0,20}信用至上\"\n ]\n}\n```\n\n### Secret Management\n\n- **API keys and passwords** belong in `.env` only — **never** in `config.json`\n- Both `config.json` and `.env` are excluded via `.gitignore`\n- doc-cleaner warns at startup if it detects secret-like fields in `config.json`\n\n```\n# .env example\nGEMINI_API_KEY=your-key-here\nGROQ_API_KEY=your-key-here\nPDF_PASSWORD=your-pdf-password\n```\n\nPassword priority: `--password` CLI arg \u003e `.env` (`PDF_PASSWORD`) \u003e `config.json`\n\n---\n\n## Ad Cleaning\n\nTaiwan bank statement PDFs often contain investment risk notices, legal disclaimers, and promotional content. doc-cleaner provides two cleaning mechanisms:\n\n### Tail Truncation (`ad_truncation_patterns`)\n\nEverything after the first match is **removed entirely**. Best for legal disclaimers at the end of documents.\n\n### Inline Removal (`ad_strip_patterns`, v1.1)\n\nEach matched paragraph is **removed individually** without affecting surrounding content. Best for promotional blocks embedded between useful data.\n\nConfigure in `config.json`:\n\n```json\n{\n \"ad_truncation_patterns\": [\n \"謹慎理財.{0,20}信用至上\",\n \"your tail-truncation pattern here\"\n ],\n \"ad_strip_patterns\": [\n \"※運動賺回饋\",\n \"your inline-removal pattern here\"\n ]\n}\n```\n\n| Setting | Behavior | Use case |\n|---|---|---|\n| `ad_truncation_patterns` | Truncate everything after first match | End-of-document disclaimers |\n| `ad_strip_patterns` | Remove each matched paragraph | Inline promotional blocks |\n\nSafety: if tail truncation would remove more than 70% of content, it's skipped with a warning.\n\nAll regex patterns are validated at startup — invalid syntax causes an immediate error, not a mid-processing crash.\n\n---\n\n## Custom AI Prompt Templates\n\ndoc-cleaner ships with two prompt templates:\n\n| File | Purpose |\n|---|---|\n| `prompts/default.txt` | General-purpose document cleaning |\n| `prompts/finance.txt` | Bank statements and financial reports (preserves transactions, amounts) |\n\nSwitch in `config.json`:\n\n```json\n\"ai\": { \"prompt_template\": \"prompts/finance.txt\" }\n```\n\n**Write your own**: create a `.txt` file in `prompts/`. The AI must output JSON with these fields:\n\n```json\n{\n \"title\": \"Short descriptive title\",\n \"summary\": \"1-2 sentence summary\",\n \"refined_markdown\": \"Full cleaned Markdown content\",\n \"tags\": [\"tag1\", \"tag2\"]\n}\n```\n\nExample: for medical documents, create `prompts/medical.txt` and emphasize preserving patient IDs, dates, and diagnosis codes.\n\n---\n\n## Smart PDF Triage\n\nNot all PDFs are equal. doc-cleaner classifies each PDF before processing.\n\n### With opendataloader-pdf (v1.1, recommended)\n\nWhen `opendataloader-pdf` + Java 11+ are installed, doc-cleaner automatically uses it for PDF extraction. opendataloader-pdf produces **proper Markdown pipe tables** directly, dramatically reducing the number of files that need AI processing.\n\n```\nPDF input\n ↓\nopendataloader-pdf (Fast mode) ← tables → pipe tables automatically\n ↓\nQuality check\n ├─ Good (structured content) → Output Markdown directly ✓\n └─ Bad (scanned / empty) → Send to AI\n```\n\nWithout opendataloader-pdf, it falls back to PyMuPDF — same behavior as before.\n\n### Classification Logic\n\n| Type | Detection | Strategy |\n|---|---|---|\n| Native text | char density ≥8, garbage \u003c5%, short lines ≤70% | Direct text extraction (fast, free) |\n| Layout-broken | \u003e70% short lines (tables crushed) | AI vision + text fallback |\n| Scanned | char density \u003c8 | AI vision + text fallback |\n\n\u003e With opendataloader-pdf, many PDFs previously classified as \"layout-broken\" get upgraded to \"native text\" because ODL successfully extracts the tables — skipping AI entirely.\n\n### Hybrid Strategy (Recommended)\n\nThe most cost-effective workflow:\n\n```bash\n# Step 1: Extract everything in raw mode (fast, free, private)\npython cleaner.py --input ./downloads/ --ai none --output-dir ./output/raw\n\n# Step 2: Re-process only \"Scanned\" files with AI\npython cleaner.py --input problem_file.pdf --ai gemini --output-dir ./output/ai\n```\n\n---\n\n## Table Preservation\n\nTables are first-class citizens:\n\n- **DOCX**: `python-docx` extracts tables → Markdown pipe tables (`|` delimiters)\n- **XLSX/CSV**: `pandas.to_markdown()` — all sheets, empty cells filled, capped at 8000 chars/sheet\n- **AI prompt**: explicitly instructs \"keep existing pipe tables EXACTLY as-is\"\n\n---\n\n## Ollama Model Recommendations\n\nTable reconstruction from layout-broken PDFs is demanding. Smaller models will struggle. Tested on MacBook Air M2 (8GB) and iMac 2019 — neither performed well with local Ollama, but if your machine has more RAM, the qwen3.5 series supports vision (Image) natively — ideal for scanned PDFs:\n\n| Model | Size | Vision | Table reconstruction | CJK quality | Notes |\n|---|---|---|---|---|---|\n| `qwen3.5:27b` | 17 GB | Yes | Good | Excellent | **Recommended** — native vision, 256K context |\n| `qwen3.5:9b` | 6.6 GB | Yes | Fair | Good | **Default** — runs on most machines, handles scanned PDFs |\n| `qwen3.5:4b` | 3.4 GB | Yes | Poor | Fair | Text OK, tables marginal |\n| `qwen3:30b` | 19 GB | No | Good | Excellent | MoE, fast inference, but no vision |\n\n\u003e **Recommendation**: prefer the `qwen3.5` series — native vision means scanned PDFs can send images directly to the model without extra OCR. `qwen3.5:27b` gives the best results; `qwen3.5:9b` (6.6GB) is the default, balancing quality and resource requirements.\n\u003e\n\u003e If you don't need to process scanned PDFs (only native-text PDFs, DOCX, XLSX), `qwen3:30b` with MoE architecture offers faster inference.\n\u003e\n\u003e **8GB RAM users**: Ollama will be slow. Use `--ai gemini` or `--ai none` instead.\n\n---\n\n## Supported Formats\n\n| Format | Parser | Tables | Notes |\n|---|---|---|---|\n| **PDF** (native text) | opendataloader-pdf / PyMuPDF | pipe tables / AI rebuild | ODL produces tables directly; falls back to PyMuPDF |\n| **PDF** (scanned) | pdf2image → AI vision | AI rebuild | Requires poppler |\n| **PDF** (encrypted) | pikepdf → above | pipe tables / AI rebuild | Optional pikepdf |\n| **DOCX** | python-docx | pipe tables | Cross-platform; textutil fallback on macOS only |\n| **XLSX / XLS** | pandas + openpyxl | pipe tables | All sheets |\n| **CSV** | pandas | pipe tables | Auto-detected |\n| **TXT / MD** | stdlib | — | Multi-encoding (Big5, CP950, UTF-16) |\n\n### Installing opendataloader-pdf (recommended)\n\nHigh-quality PDF extraction with proper table support:\n\n```bash\n# Install Java 11+\nbrew install openjdk@21 # macOS\n# sudo apt install openjdk-21-jre # Ubuntu\n\n# Install Python package\npip install opendataloader-pdf\n```\n\nWhen installed, doc-cleaner auto-detects and uses it. Without it, PyMuPDF is used as fallback.\n\n### Installing poppler\n\nPDF vision mode (converting scanned PDF pages to images) requires the poppler system package:\n\n```bash\n# macOS\nbrew install poppler\n\n# Ubuntu / Debian\nsudo apt-get install poppler-utils\n```\n\nIf you don't need vision mode, use `--ai none` to skip it entirely.\n\n---\n\n## Security\n\n- **No cloud required**: `--ai ollama` or `--ai none` keeps everything local\n- **Atomic writes**: temp file + `os.replace()` prevents partial output\n- **Secret isolation**: API keys in `.env` only (never `config.json`), startup validation\n- **OOM protection**: PDF vision capped at 15 pages by default (configurable)\n- **Ad truncation guard**: truncation skipped if it would remove \u003e70% of content\n- **JSON graceful degradation**: if AI returns unparseable JSON, falls back to raw text mode\n\nSee [SECURITY.md](SECURITY.md) for the full security policy.\n\n---\n\n## AI Agent Integration (OpenClaw, etc.)\n\ndoc-cleaner is a standard CLI tool — any AI agent framework can call it via shell. It ships with a `SKILL.md` for direct use with [OpenClaw](https://openclaw.ai/).\n\n```bash\n# Agent usage: process file + get machine-readable summary\npython cleaner.py --input document.pdf --ai none --summary\n```\n\n`--summary` output example:\n```json\n{\"version\":\"1.0.0\",\"total\":1,\"success\":1,\"failed\":0,\"files\":[{\"file\":\"document.pdf\",\"output\":\"./output/document.md\",\"status\":\"ok\"}]}\n```\n\nAgents can use exit codes to determine success (0=all OK, 1=partial failure, 2=config error) and parse the `--summary` JSON for per-file results.\n\n---\n\n## Part of the notoriouslab Pipeline\n\n```\ngmail-statement-fetcher → Auto-download PDF statements from Gmail\n ↓\n doc-cleaner → PDF/DOCX/XLSX → structured Markdown\n ↓\n personal-cfo → Monthly audit + retirement glide path (in development)\n```\n\nEach tool works standalone. Together they form a full personal finance automation pipeline.\n\n---\n\n## Contributing\n\nThe easiest contributions:\n\n1. **Add ad truncation patterns** for your bank — add regex to `config.example.json`\n2. **Add prompt templates** for your document type — create a `.txt` in `prompts/`\n3. **Report encoding issues** with anonymized samples and logs\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotoriouslab%2Fdoc-cleaner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnotoriouslab%2Fdoc-cleaner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotoriouslab%2Fdoc-cleaner/lists"}