https://github.com/gitstq/docflow
High-performance document conversion and intelligent processing engine - Convert PDF, Word, PowerPoint, Excel, HTML, Images to Markdown
https://github.com/gitstq/docflow
Last synced: 3 days ago
JSON representation
High-performance document conversion and intelligent processing engine - Convert PDF, Word, PowerPoint, Excel, HTML, Images to Markdown
- Host: GitHub
- URL: https://github.com/gitstq/docflow
- Owner: gitstq
- License: mit
- Created: 2026-04-16T07:18:23.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-16T07:19:10.000Z (2 months ago)
- Last Synced: 2026-04-16T09:24:16.869Z (2 months ago)
- Language: Python
- Size: 89.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# DocFlow

**High-performance document conversion and intelligent processing engine**
[](https://badge.fury.io/py/docflow)
[](https://pypi.org/project/docflow/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[English](#english) | [简体中文](#简体中文) | [繁體中文](#繁體中文)
---
## 🎉 Introduction
**DocFlow** is a powerful command-line tool for converting various document formats to Markdown. It supports batch processing, OCR, metadata extraction, and AI-powered enhancements.
### ✨ Key Features
- **Multi-format Support**: PDF, Word, PowerPoint, Excel, HTML, Images, CSV, JSON, XML
- **Batch Processing**: Convert entire directories with parallel processing
- **OCR Support**: Extract text from images and scanned PDFs
- **Metadata Extraction**: Preserve document metadata during conversion
- **Image Extraction**: Extract and reference embedded images
- **AI Enhancement**: Optional AI-powered summarization and keyword extraction
- **Table Support**: Convert tables to Markdown format
- **Quality Reports**: Generate conversion quality assessments
### 🚀 Quick Start
#### Installation
```bash
# Using pip
pip install docflow
# With OCR support
pip install docflow[ocr]
# With AI features
pip install docflow[ai]
# Full installation
pip install docflow[all]
```
#### Basic Usage
```bash
# Convert a single file
docflow convert document.pdf
# Convert with custom output
docflow convert document.docx -o output.md
# Batch convert a directory
docflow convert ./documents -o ./markdown
# Enable OCR for scanned documents
docflow convert scan.pdf --enable-ocr --ocr-language eng+chi_sim
# Convert recursively
docflow batch ./docs -r -o ./output
```
### 📖 Detailed Usage Guide
#### Convert Command
```bash
docflow convert [options]
```
| Option | Description |
|--------|-------------|
| `-o, --output` | Output file or directory |
| `--extract-images` | Extract images from documents |
| `--enable-ocr` | Enable OCR for images |
| `--ocr-language` | OCR language (default: eng) |
| `--include-metadata` | Include metadata in output |
| `--overwrite` | Overwrite existing files |
#### Batch Command
```bash
docflow batch [options]
```
| Option | Description |
|--------|-------------|
| `-r, --recursive` | Process directories recursively |
| `-p, --pattern` | File pattern to match (default: *) |
| `-o, --output-dir` | Output directory |
| `-w, --workers` | Number of parallel workers |
| `--enable-ocr` | Enable OCR |
| `--overwrite` | Overwrite existing files |
#### Other Commands
```bash
# List supported formats
docflow formats
# Display document information
docflow info document.pdf
```
### 💡 Design Philosophy
DocFlow is designed with the following principles:
1. **Zero-dependency Core**: Minimal dependencies for basic functionality
2. **Extensible Architecture**: Easy to add new converters
3. **Quality First**: Accurate conversion over speed
4. **Developer Friendly**: Clean API for programmatic use
### 📦 Deployment
#### Docker
```dockerfile
FROM python:3.11-slim
RUN pip install docflow
ENTRYPOINT ["docflow"]
```
```bash
docker build -t docflow .
docker run -v $(pwd)/docs:/docs docflow convert /docs/input.pdf
```
#### PyInstaller (Standalone Executable)
```bash
pip install pyinstaller
pyinstaller --onefile --name docflow docflow/cli/main.py
```
### 🤝 Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details.
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'feat: add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
### 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## 🎉 项目介绍
**DocFlow** 是一个强大的命令行工具,用于将各种文档格式转换为 Markdown。支持批量处理、OCR、元数据提取和 AI 增强功能。
### ✨ 核心特性
- **多格式支持**:PDF、Word、PowerPoint、Excel、HTML、图片、CSV、JSON、XML
- **批量处理**:并行处理整个目录
- **OCR 支持**:从图片和扫描 PDF 中提取文字
- **元数据提取**:保留文档元数据
- **图片提取**:提取并引用嵌入的图片
- **AI 增强**:可选的 AI 摘要和关键词提取
- **表格支持**:将表格转换为 Markdown 格式
- **质量报告**:生成转换质量评估
### 🚀 快速开始
#### 安装
```bash
# 使用 pip
pip install docflow
# 带 OCR 支持
pip install docflow[ocr]
# 带 AI 功能
pip install docflow[ai]
# 完整安装
pip install docflow[all]
```
#### 基本用法
```bash
# 转换单个文件
docflow convert document.pdf
# 指定输出路径
docflow convert document.docx -o output.md
# 批量转换目录
docflow convert ./documents -o ./markdown
# 启用 OCR(扫描文档)
docflow convert scan.pdf --enable-ocr --ocr-language eng+chi_sim
# 递归转换
docflow batch ./docs -r -o ./output
```
### 📖 详细使用指南
#### convert 命令
```bash
docflow convert [options]
```
| 选项 | 说明 |
|------|------|
| `-o, --output` | 输出文件或目录 |
| `--extract-images` | 从文档中提取图片 |
| `--enable-ocr` | 启用图片 OCR |
| `--ocr-language` | OCR 语言(默认:eng)|
| `--include-metadata` | 在输出中包含元数据 |
| `--overwrite` | 覆盖已存在的文件 |
#### batch 命令
```bash
docflow batch [options]
```
| 选项 | 说明 |
|------|------|
| `-r, --recursive` | 递归处理目录 |
| `-p, --pattern` | 文件匹配模式(默认:*)|
| `-o, --output-dir` | 输出目录 |
| `-w, --workers` | 并行工作进程数 |
| `--enable-ocr` | 启用 OCR |
| `--overwrite` | 覆盖已存在的文件 |
#### 其他命令
```bash
# 列出支持的格式
docflow formats
# 显示文档信息
docflow info document.pdf
```
### 💡 设计思路
DocFlow 的设计原则:
1. **核心零依赖**:基本功能无需额外依赖
2. **可扩展架构**:易于添加新的转换器
3. **质量优先**:准确性优于速度
4. **开发者友好**:清晰的 API 便于编程使用
### 📦 打包与部署
#### Docker
```dockerfile
FROM python:3.11-slim
RUN pip install docflow
ENTRYPOINT ["docflow"]
```
```bash
docker build -t docflow .
docker run -v $(pwd)/docs:/docs docflow convert /docs/input.pdf
```
#### PyInstaller(独立可执行文件)
```bash
pip install pyinstaller
pyinstaller --onefile --name docflow docflow/cli/main.py
```
### 🤝 贡献指南
欢迎参与贡献!详情请参阅 [CONTRIBUTING.md](CONTRIBUTING.md)。
1. Fork 本仓库
2. 创建特性分支 (`git checkout -b feature/amazing-feature`)
3. 提交更改 (`git commit -m 'feat: add amazing feature'`)
4. 推送到分支 (`git push origin feature/amazing-feature`)
5. 提交 Pull Request
### 📄 开源协议
本项目采用 MIT 协议开源 - 详见 [LICENSE](LICENSE) 文件。
---
## 🎉 專案介紹
**DocFlow** 是一個強大的命令列工具,用於將各種文件格式轉換為 Markdown。支援批次處理、OCR、元資料提取和 AI 增強功能。
### ✨ 核心特性
- **多格式支援**:PDF、Word、PowerPoint、Excel、HTML、圖片、CSV、JSON、XML
- **批次處理**:平行處理整個目錄
- **OCR 支援**:從圖片和掃描 PDF 中提取文字
- **元資料提取**:保留文件元資料
- **圖片提取**:提取並引用嵌入的圖片
- **AI 增強**:可選的 AI 摘要和關鍵字提取
- **表格支援**:將表格轉換為 Markdown 格式
- **品質報告**:產生轉換品質評估
### 🚀 快速開始
#### 安裝
```bash
# 使用 pip
pip install docflow
# 完整安裝
pip install docflow[all]
```
#### 基本用法
```bash
# 轉換單一檔案
docflow convert document.pdf
# 批次轉換目錄
docflow convert ./documents -o ./markdown
```
### 📄 授權條款
本專案採用 MIT 授權條款 - 詳見 [LICENSE](LICENSE) 檔案。
---
## 📊 Supported Formats
| Format | Extension | Features |
|--------|-----------|----------|
| PDF | `.pdf` | Text, Tables, Images, OCR |
| Word | `.docx`, `.doc` | Text, Tables, Images |
| PowerPoint | `.pptx`, `.ppt` | Slides, Tables, Text |
| Excel | `.xlsx`, `.xls` | Sheets, Tables |
| HTML | `.html`, `.htm` | Full content |
| Text | `.txt`, `.md` | Direct conversion |
| CSV/TSV | `.csv`, `.tsv` | Table conversion |
| JSON/XML | `.json`, `.xml` | Code blocks |
| Images | `.png`, `.jpg`, etc. | OCR, Metadata |
## 🗺️ Roadmap
- [ ] Web UI interface
- [ ] Cloud storage integration
- [ ] More AI providers support
- [ ] Custom template system
- [ ] Real-time collaboration
---
**Made with ❤️ by DocFlow Team**