{"id":51026861,"url":"https://github.com/gitstq/documark","last_synced_at":"2026-06-21T20:02:27.218Z","repository":{"id":361768254,"uuid":"1255742535","full_name":"gitstq/documark","owner":"gitstq","description":"智能文档转换器 - 将PDF/Word/Excel/PowerPoint/HTML/图片等多种格式转换为Markdown","archived":false,"fork":false,"pushed_at":"2026-06-01T06:16:16.000Z","size":24,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-01T08:14:42.824Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gitstq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-01T06:12:29.000Z","updated_at":"2026-06-01T06:16:20.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gitstq/documark","commit_stats":null,"previous_names":["gitstq/documark"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/gitstq/documark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocumark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocumark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocumark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocumark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gitstq","download_url":"https://codeload.github.com/gitstq/documark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocumark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34623906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-21T20:02:26.483Z","updated_at":"2026-06-21T20:02:27.213Z","avatar_url":"https://github.com/gitstq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 📄 DocuMark\n\n**智能文档转换器 - 将各种文件格式一键转换为Markdown**\n\n[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Version](https://img.shields.io/badge/Version-1.0.0-orange)](https://github.com/gitstq/documark/releases)\n\n[简体中文](#简体中文) | [繁體中文](#繁體中文) | [English](#english)\n\n\u003c/div\u003e\n\n---\n\n## 简体中文\n\n### 🎉 项目介绍\n\n**DocuMark** 是一款强大的智能文档转换工具，能够将各种常见文件格式（PDF、Word、Excel、PowerPoint、HTML、图片等）快速转换为结构化的Markdown格式。\n\n**灵感来源**: 受到Microsoft MarkItDown项目的启发，DocuMark在此基础上进行了全面升级，支持更多文件格式、更美观的CLI界面、批量转换功能以及图片OCR识别。\n\n**核心价值**:\n- 🚀 一站式解决文档格式转换需求\n- 🎨 保留原文档结构和格式\n- 🤖 支持图片文字识别（OCR）\n- ⚡ 批量处理与并行转换\n\n### ✨ 核心特性\n\n| 特性 | 描述 |\n|------|------|\n| 📑 **多格式支持** | PDF、Word、Excel、PowerPoint、HTML、图片、文本等 |\n| 🎨 **美观CLI** | 基于Rich库的彩色终端界面与进度显示 |\n| ⚡ **批量转换** | 支持目录递归处理与多线程并行转换 |\n| 🤖 **OCR识别** | 图片文字智能识别（支持中英文）|\n| 📊 **表格提取** | 自动识别并转换文档中的表格 |\n| 🔗 **链接保留** | 保留原文档中的超链接信息 |\n| 🛠️ **易于扩展** | 模块化设计，轻松添加新格式支持 |\n\n### 🚀 快速开始\n\n#### 环境要求\n\n- **Python**: 3.8 或更高版本\n- **操作系统**: Windows、macOS、Linux\n\n#### 安装步骤\n\n```bash\n# 克隆仓库\ngit clone https://github.com/gitstq/documark.git\ncd documark\n\n# 安装依赖\npip install -r requirements.txt\n\n# 安装包\npip install -e .\n```\n\n#### 快速使用\n\n```bash\n# 转换单个文件\ndocumark convert document.pdf\ndocumark convert document.docx -o output.md\n\n# 批量转换目录（递归）\ndocumark convert ./documents -r -o ./output\n\n# 启用OCR识别图片\ndocumark convert image.png --ocr\n\n# 查看支持的格式\ndocumark formats\n```\n\n### 📖 详细使用指南\n\n#### 命令行参数\n\n```bash\n# 基础转换\ndocumark convert \u003c输入路径\u003e [选项]\n\n# 选项说明\n-o, --output PATH       # 指定输出路径\n-r, --recursive         # 递归处理子目录\n-w, --workers INTEGER   # 并行线程数（默认：4）\n--ocr                   # 启用OCR识别\n--ocr-lang TEXT         # OCR语言（默认：chi_sim+eng）\n--no-tables             # 不提取表格\n--no-links              # 不提取链接\n```\n\n#### Python API 使用\n\n```python\nfrom documark import DocuMarkConverter\n\n# 创建转换器\nconverter = DocuMarkConverter(output_dir=\"./output\")\n\n# 转换单个文件\ncontent = converter.convert(\"document.pdf\")\n\n# 批量转换\nconverter.convert_batch([\"file1.pdf\", \"file2.docx\"])\n\n# 转换整个目录\nconverter.convert_directory(\"./documents\", recursive=True)\n```\n\n### 💡 设计思路与迭代规划\n\n#### 技术选型\n\n- **pdfplumber/PyPDF2**: PDF解析与文本提取\n- **python-docx**: Word文档处理\n- **openpyxl**: Excel表格处理\n- **python-pptx**: PowerPoint处理\n- **BeautifulSoup**: HTML解析\n- **pytesseract**: OCR文字识别\n- **Rich/Typer**: 现代化CLI界面\n\n#### 后续迭代计划\n\n- [ ] 支持更多格式（EPUB、RTF等）\n- [ ] Web API服务模式\n- [ ] 目录监控自动转换\n- [ ] AI增强内容结构化\n- [ ] 图形界面（GUI）版本\n\n### 📦 打包与部署\n\n```bash\n# 构建分发包\nmake build\n\n# 安装到本地\npip install dist/documark-1.0.0-py3-none-any.whl\n```\n\n### 🤝 贡献指南\n\n欢迎提交Issue和Pull Request！\n\n1. Fork 本仓库\n2. 创建特性分支 (`git checkout -b feature/amazing-feature`)\n3. 提交更改 (`git commit -m 'feat: 添加新特性'`)\n4. 推送分支 (`git push origin feature/amazing-feature`)\n5. 创建 Pull Request\n\n### 📄 开源协议\n\n本项目采用 [MIT](LICENSE) 协议开源。\n\n---\n\n## 繁體中文\n\n### 🎉 專案介紹\n\n**DocuMark** 是一款強大的智慧文件轉換工具，能夠將各種常見文件格式（PDF、Word、Excel、PowerPoint、HTML、圖片等）快速轉換為結構化的Markdown格式。\n\n**核心價值**:\n- 🚀 一站式解決文件格式轉換需求\n- 🎨 保留原始文件結構和格式\n- 🤖 支援圖片文字識別（OCR）\n- ⚡ 批次處理與平行轉換\n\n### ✨ 核心特性\n\n| 特性 | 描述 |\n|------|------|\n| 📑 **多格式支援** | PDF、Word、Excel、PowerPoint、HTML、圖片、文字等 |\n| 🎨 **美觀CLI** | 基於Rich函式庫的彩色終端介面 |\n| ⚡ **批次轉換** | 支援目錄遞迴處理與多執行緒 |\n| 🤖 **OCR識別** | 圖片文字智慧識別（支援中英文）|\n| 📊 **表格提取** | 自動識別並轉換文件中的表格 |\n\n### 🚀 快速開始\n\n```bash\n# 安裝\npip install -r requirements.txt\npip install -e .\n\n# 轉換檔案\ndocumark convert document.pdf\ndocumark convert ./documents -r -o ./output\n```\n\n### 📄 開源協議\n\n[MIT](LICENSE) License\n\n---\n\n## English\n\n### 🎉 Introduction\n\n**DocuMark** is a powerful intelligent document converter that transforms various file formats (PDF, Word, Excel, PowerPoint, HTML, Images, etc.) into structured Markdown format.\n\n**Core Values**:\n- 🚀 One-stop solution for document format conversion\n- 🎨 Preserves original document structure and formatting\n- 🤖 Image text recognition (OCR) support\n- ⚡ Batch processing with parallel conversion\n\n### ✨ Features\n\n| Feature | Description |\n|---------|-------------|\n| 📑 **Multi-format** | PDF, Word, Excel, PowerPoint, HTML, Images, Text |\n| 🎨 **Beautiful CLI** | Colorful terminal UI based on Rich library |\n| ⚡ **Batch Convert** | Directory recursion \u0026 multi-threading |\n| 🤖 **OCR Support** | Intelligent image text recognition |\n| 📊 **Table Extraction** | Auto-detect and convert tables |\n\n### 🚀 Quick Start\n\n```bash\n# Install\npip install -r requirements.txt\npip install -e .\n\n# Convert files\ndocumark convert document.pdf\ndocumark convert ./documents -r -o ./output\n```\n\n### 📄 License\n\n[MIT](LICENSE) License\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**Made with ❤️ by gitstq**\n\n[GitHub](https://github.com/gitstq) • [Issues](https://github.com/gitstq/documark/issues)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocumark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgitstq%2Fdocumark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocumark/lists"}