{"id":51026920,"url":"https://github.com/gitstq/docflow","last_synced_at":"2026-06-21T20:02:32.812Z","repository":{"id":351727562,"uuid":"1212226379","full_name":"gitstq/docflow","owner":"gitstq","description":"High-performance document conversion and intelligent processing engine - Convert PDF, Word, PowerPoint, Excel, HTML, Images to Markdown","archived":false,"fork":false,"pushed_at":"2026-04-16T07:19:10.000Z","size":92,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-16T09:24:16.869Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gitstq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-16T07:18:23.000Z","updated_at":"2026-04-16T07:18:58.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gitstq/docflow","commit_stats":null,"previous_names":["gitstq/docflow"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/gitstq/docflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gitstq","download_url":"https://codeload.github.com/gitstq/docflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34623906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-21T20:02:32.063Z","updated_at":"2026-06-21T20:02:32.800Z","avatar_url":"https://github.com/gitstq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DocFlow\n\n\u003cdiv align=\"center\"\u003e\n\n![DocFlow Logo](docs/images/logo.png)\n\n**High-performance document conversion and intelligent processing engine**\n\n[![PyPI version](https://badge.fury.io/py/docflow.svg)](https://badge.fury.io/py/docflow)\n[![Python Support](https://img.shields.io/pypi/pyversions/docflow.svg)](https://pypi.org/project/docflow/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n[English](#english) | [简体中文](#简体中文) | [繁體中文](#繁體中文)\n\n\u003c/div\u003e\n\n---\n\n\u003ca name=\"english\"\u003e\u003c/a\u003e\n\n## 🎉 Introduction\n\n**DocFlow** is a powerful command-line tool for converting various document formats to Markdown. It supports batch processing, OCR, metadata extraction, and AI-powered enhancements.\n\n### ✨ Key Features\n\n- **Multi-format Support**: PDF, Word, PowerPoint, Excel, HTML, Images, CSV, JSON, XML\n- **Batch Processing**: Convert entire directories with parallel processing\n- **OCR Support**: Extract text from images and scanned PDFs\n- **Metadata Extraction**: Preserve document metadata during conversion\n- **Image Extraction**: Extract and reference embedded images\n- **AI Enhancement**: Optional AI-powered summarization and keyword extraction\n- **Table Support**: Convert tables to Markdown format\n- **Quality Reports**: Generate conversion quality assessments\n\n### 🚀 Quick Start\n\n#### Installation\n\n```bash\n# Using pip\npip install docflow\n\n# With OCR support\npip install docflow[ocr]\n\n# With AI features\npip install docflow[ai]\n\n# Full installation\npip install docflow[all]\n```\n\n#### Basic Usage\n\n```bash\n# Convert a single file\ndocflow convert document.pdf\n\n# Convert with custom output\ndocflow convert document.docx -o output.md\n\n# Batch convert a directory\ndocflow convert ./documents -o ./markdown\n\n# Enable OCR for scanned documents\ndocflow convert scan.pdf --enable-ocr --ocr-language eng+chi_sim\n\n# Convert recursively\ndocflow batch ./docs -r -o ./output\n```\n\n### 📖 Detailed Usage Guide\n\n#### Convert Command\n\n```bash\ndocflow convert \u003csource\u003e [options]\n```\n\n| Option | Description |\n|--------|-------------|\n| `-o, --output` | Output file or directory |\n| `--extract-images` | Extract images from documents |\n| `--enable-ocr` | Enable OCR for images |\n| `--ocr-language` | OCR language (default: eng) |\n| `--include-metadata` | Include metadata in output |\n| `--overwrite` | Overwrite existing files |\n\n#### Batch Command\n\n```bash\ndocflow batch \u003csource\u003e [options]\n```\n\n| Option | Description |\n|--------|-------------|\n| `-r, --recursive` | Process directories recursively |\n| `-p, --pattern` | File pattern to match (default: *) |\n| `-o, --output-dir` | Output directory |\n| `-w, --workers` | Number of parallel workers |\n| `--enable-ocr` | Enable OCR |\n| `--overwrite` | Overwrite existing files |\n\n#### Other Commands\n\n```bash\n# List supported formats\ndocflow formats\n\n# Display document information\ndocflow info document.pdf\n```\n\n### 💡 Design Philosophy\n\nDocFlow is designed with the following principles:\n\n1. **Zero-dependency Core**: Minimal dependencies for basic functionality\n2. **Extensible Architecture**: Easy to add new converters\n3. **Quality First**: Accurate conversion over speed\n4. **Developer Friendly**: Clean API for programmatic use\n\n### 📦 Deployment\n\n#### Docker\n\n```dockerfile\nFROM python:3.11-slim\nRUN pip install docflow\nENTRYPOINT [\"docflow\"]\n```\n\n```bash\ndocker build -t docflow .\ndocker run -v $(pwd)/docs:/docs docflow convert /docs/input.pdf\n```\n\n#### PyInstaller (Standalone Executable)\n\n```bash\npip install pyinstaller\npyinstaller --onefile --name docflow docflow/cli/main.py\n```\n\n### 🤝 Contributing\n\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details.\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit changes (`git commit -m 'feat: add amazing feature'`)\n4. Push to branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n### 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n\u003ca name=\"简体中文\"\u003e\u003c/a\u003e\n\n## 🎉 项目介绍\n\n**DocFlow** 是一个强大的命令行工具，用于将各种文档格式转换为 Markdown。支持批量处理、OCR、元数据提取和 AI 增强功能。\n\n### ✨ 核心特性\n\n- **多格式支持**：PDF、Word、PowerPoint、Excel、HTML、图片、CSV、JSON、XML\n- **批量处理**：并行处理整个目录\n- **OCR 支持**：从图片和扫描 PDF 中提取文字\n- **元数据提取**：保留文档元数据\n- **图片提取**：提取并引用嵌入的图片\n- **AI 增强**：可选的 AI 摘要和关键词提取\n- **表格支持**：将表格转换为 Markdown 格式\n- **质量报告**：生成转换质量评估\n\n### 🚀 快速开始\n\n#### 安装\n\n```bash\n# 使用 pip\npip install docflow\n\n# 带 OCR 支持\npip install docflow[ocr]\n\n# 带 AI 功能\npip install docflow[ai]\n\n# 完整安装\npip install docflow[all]\n```\n\n#### 基本用法\n\n```bash\n# 转换单个文件\ndocflow convert document.pdf\n\n# 指定输出路径\ndocflow convert document.docx -o output.md\n\n# 批量转换目录\ndocflow convert ./documents -o ./markdown\n\n# 启用 OCR（扫描文档）\ndocflow convert scan.pdf --enable-ocr --ocr-language eng+chi_sim\n\n# 递归转换\ndocflow batch ./docs -r -o ./output\n```\n\n### 📖 详细使用指南\n\n#### convert 命令\n\n```bash\ndocflow convert \u003csource\u003e [options]\n```\n\n| 选项 | 说明 |\n|------|------|\n| `-o, --output` | 输出文件或目录 |\n| `--extract-images` | 从文档中提取图片 |\n| `--enable-ocr` | 启用图片 OCR |\n| `--ocr-language` | OCR 语言（默认：eng）|\n| `--include-metadata` | 在输出中包含元数据 |\n| `--overwrite` | 覆盖已存在的文件 |\n\n#### batch 命令\n\n```bash\ndocflow batch \u003csource\u003e [options]\n```\n\n| 选项 | 说明 |\n|------|------|\n| `-r, --recursive` | 递归处理目录 |\n| `-p, --pattern` | 文件匹配模式（默认：*）|\n| `-o, --output-dir` | 输出目录 |\n| `-w, --workers` | 并行工作进程数 |\n| `--enable-ocr` | 启用 OCR |\n| `--overwrite` | 覆盖已存在的文件 |\n\n#### 其他命令\n\n```bash\n# 列出支持的格式\ndocflow formats\n\n# 显示文档信息\ndocflow info document.pdf\n```\n\n### 💡 设计思路\n\nDocFlow 的设计原则：\n\n1. **核心零依赖**：基本功能无需额外依赖\n2. **可扩展架构**：易于添加新的转换器\n3. **质量优先**：准确性优于速度\n4. **开发者友好**：清晰的 API 便于编程使用\n\n### 📦 打包与部署\n\n#### Docker\n\n```dockerfile\nFROM python:3.11-slim\nRUN pip install docflow\nENTRYPOINT [\"docflow\"]\n```\n\n```bash\ndocker build -t docflow .\ndocker run -v $(pwd)/docs:/docs docflow convert /docs/input.pdf\n```\n\n#### PyInstaller（独立可执行文件）\n\n```bash\npip install pyinstaller\npyinstaller --onefile --name docflow docflow/cli/main.py\n```\n\n### 🤝 贡献指南\n\n欢迎参与贡献！详情请参阅 [CONTRIBUTING.md](CONTRIBUTING.md)。\n\n1. Fork 本仓库\n2. 创建特性分支 (`git checkout -b feature/amazing-feature`)\n3. 提交更改 (`git commit -m 'feat: add amazing feature'`)\n4. 推送到分支 (`git push origin feature/amazing-feature`)\n5. 提交 Pull Request\n\n### 📄 开源协议\n\n本项目采用 MIT 协议开源 - 详见 [LICENSE](LICENSE) 文件。\n\n---\n\n\u003ca name=\"繁體中文\"\u003e\u003c/a\u003e\n\n## 🎉 專案介紹\n\n**DocFlow** 是一個強大的命令列工具，用於將各種文件格式轉換為 Markdown。支援批次處理、OCR、元資料提取和 AI 增強功能。\n\n### ✨ 核心特性\n\n- **多格式支援**：PDF、Word、PowerPoint、Excel、HTML、圖片、CSV、JSON、XML\n- **批次處理**：平行處理整個目錄\n- **OCR 支援**：從圖片和掃描 PDF 中提取文字\n- **元資料提取**：保留文件元資料\n- **圖片提取**：提取並引用嵌入的圖片\n- **AI 增強**：可選的 AI 摘要和關鍵字提取\n- **表格支援**：將表格轉換為 Markdown 格式\n- **品質報告**：產生轉換品質評估\n\n### 🚀 快速開始\n\n#### 安裝\n\n```bash\n# 使用 pip\npip install docflow\n\n# 完整安裝\npip install docflow[all]\n```\n\n#### 基本用法\n\n```bash\n# 轉換單一檔案\ndocflow convert document.pdf\n\n# 批次轉換目錄\ndocflow convert ./documents -o ./markdown\n```\n\n### 📄 授權條款\n\n本專案採用 MIT 授權條款 - 詳見 [LICENSE](LICENSE) 檔案。\n\n---\n\n## 📊 Supported Formats\n\n| Format | Extension | Features |\n|--------|-----------|----------|\n| PDF | `.pdf` | Text, Tables, Images, OCR |\n| Word | `.docx`, `.doc` | Text, Tables, Images |\n| PowerPoint | `.pptx`, `.ppt` | Slides, Tables, Text |\n| Excel | `.xlsx`, `.xls` | Sheets, Tables |\n| HTML | `.html`, `.htm` | Full content |\n| Text | `.txt`, `.md` | Direct conversion |\n| CSV/TSV | `.csv`, `.tsv` | Table conversion |\n| JSON/XML | `.json`, `.xml` | Code blocks |\n| Images | `.png`, `.jpg`, etc. | OCR, Metadata |\n\n## 🗺️ Roadmap\n\n- [ ] Web UI interface\n- [ ] Cloud storage integration\n- [ ] More AI providers support\n- [ ] Custom template system\n- [ ] Real-time collaboration\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**Made with ❤️ by DocFlow Team**\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgitstq%2Fdocflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocflow/lists"}