{"id":51026962,"url":"https://github.com/gitstq/docconvert-ai","last_synced_at":"2026-06-21T20:02:47.955Z","repository":{"id":358172911,"uuid":"1240343365","full_name":"gitstq/docconvert-ai","owner":"gitstq","description":"📄 DocConvert-AI - Lightweight AI Document Intelligence Conversion \u0026 Knowledge Extraction Engine | 轻量级AI文档智能转换与知识提取引擎 - Zero Dependencies, Multi-format Support, Markdown/JSON/HTML Export","archived":false,"fork":false,"pushed_at":"2026-05-16T03:10:44.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-16T05:19:25.118Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gitstq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-16T03:07:19.000Z","updated_at":"2026-05-16T03:10:30.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gitstq/docconvert-ai","commit_stats":null,"previous_names":["gitstq/docconvert-ai"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/gitstq/docconvert-ai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocconvert-ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocconvert-ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocconvert-ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocconvert-ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gitstq","download_url":"https://codeload.github.com/gitstq/docconvert-ai/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fdocconvert-ai/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34623906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-21T20:02:47.864Z","updated_at":"2026-06-21T20:02:47.938Z","avatar_url":"https://github.com/gitstq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 📄 DocConvert-AI\n\n**Lightweight AI Document Intelligence Conversion \u0026 Knowledge Extraction Engine**\n\n**轻量级AI文档智能转换与知识提取引擎**\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Zero Dependencies](https://img.shields.io/badge/dependencies-zero-brightgreen.svg)]()\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)]()\n\n[English](#english) | [简体中文](#简体中文) | [繁體中文](#繁體中文)\n\n\u003c/div\u003e\n\n---\n\n\u003ca name=\"english\"\u003e\u003c/a\u003e\n## 🎉 Introduction\n\n**DocConvert-AI** is a lightweight, zero-dependency document conversion engine that transforms complex documents into structured, AI-ready formats. Built with pure Python and utilizing only the standard library, it requires no external dependencies while supporting multiple input and output formats.\n\n### 💡 Why DocConvert-AI?\n\n- **🚀 Zero Dependencies**: Pure Python implementation, no pip install hell\n- **📦 Lightweight**: Single file, easy to integrate and deploy\n- **🔄 Multi-Format Support**: PDF, DOCX, XLSX, PPTX, HTML, Markdown, TXT\n- **📤 Flexible Export**: Markdown, JSON, HTML, Plain Text\n- **⚡ Fast Processing**: Optimized for speed and efficiency\n- **🤖 AI-Ready**: Structured output perfect for LLM consumption\n\n---\n\n## ✨ Core Features\n\n| Feature | Description | Status |\n|---------|-------------|--------|\n| 📄 **PDF Support** | Extract text from PDF documents | ✅ Supported |\n| 📝 **Word Documents** | Parse DOCX files with formatting | ✅ Supported |\n| 📊 **Excel Sheets** | Convert XLSX to structured data | ✅ Supported |\n| 🎨 **PowerPoint** | Extract text from PPTX slides | ✅ Supported |\n| 🌐 **HTML** | Clean extraction from web pages | ✅ Supported |\n| 📑 **Markdown** | Native markdown processing | ✅ Supported |\n| 📃 **Plain Text** | Basic text file handling | ✅ Supported |\n| 🔄 **Batch Processing** | Convert entire directories | ✅ Supported |\n| 📤 **Multi-Format Export** | MD, JSON, HTML, TXT output | ✅ Supported |\n\n---\n\n## 🚀 Quick Start\n\n### Requirements\n\n- Python 3.8 or higher\n- No external dependencies required!\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/gitstq/docconvert-ai.git\ncd docconvert-ai\n\n# Or install via pip (when published)\npip install docconvert-ai\n```\n\n### Basic Usage\n\n```bash\n# Convert a single file to Markdown (default)\npython docconvert.py document.docx\n\n# Convert to specific format\npython docconvert.py document.pdf -f json\npython docconvert.py document.html -f html\npython docconvert.py document.pptx -f text\n\n# Specify output file\npython docconvert.py input.docx -o output.md\n\n# Batch convert entire directory\npython docconvert.py ./input_dir -b -o ./output_dir -f markdown\n```\n\n---\n\n## 📖 Detailed Usage Guide\n\n### Command Line Options\n\n```\nusage: docconvert.py [-h] [-o OUTPUT] [-f {markdown,json,html,text}] [-b] [-v]\n                     input\n\nDocConvert-AI: Lightweight AI Document Intelligence Conversion Engine\n\npositional arguments:\n  input                 Input file or directory path\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -o, --output          Output file or directory path\n  -f, --format          Output format: markdown, json, html, text (default: markdown)\n  -b, --batch           Batch convert all files in directory\n  -v, --version         show program's version number and exit\n```\n\n### Python API\n\n```python\nfrom docconvert import DocConvertAI, OutputFormat\n\n# Initialize converter\nconverter = DocConvertAI()\n\n# Convert single file\noutput = converter.convert_file(\n    input_path=\"document.docx\",\n    output_path=\"output.md\",\n    output_format=OutputFormat.MARKDOWN\n)\n\n# Batch convert\nfiles = converter.batch_convert(\n    input_dir=\"./documents\",\n    output_dir=\"./converted\",\n    output_format=OutputFormat.JSON\n)\n```\n\n### Output Formats\n\n#### Markdown Output\n```markdown\n# Document Title\n\n**Author:** John Doe  \n**Created:** 2026-05-16T10:30:00\n\n---\n\n## Section 1\n\nContent here...\n\n- List item 1\n- List item 2\n```\n\n#### JSON Output\n```json\n{\n  \"title\": \"Document Title\",\n  \"author\": \"John Doe\",\n  \"created_date\": \"2026-05-16T10:30:00\",\n  \"metadata\": {\n    \"source_file\": \"document.docx\",\n    \"file_type\": \"docx\",\n    \"element_count\": 42\n  },\n  \"elements\": [\n    {\n      \"type\": \"heading\",\n      \"content\": \"Introduction\",\n      \"level\": 1\n    }\n  ]\n}\n```\n\n---\n\n## 💡 Design Philosophy\n\n### Zero-Dependency Approach\n\nUnlike other document conversion tools that require dozens of dependencies (PyPDF2, python-docx, pandas, etc.), DocConvert-AI leverages Python's powerful standard library:\n\n- **`zipfile`**: For DOCX, XLSX, PPTX (all are ZIP archives)\n- **`xml.etree.ElementTree`**: For parsing XML content\n- **`re`**: For pattern matching and text extraction\n- **`argparse`**: For CLI interface\n- **`dataclasses`**: For structured data representation\n\n### Performance Optimizations\n\n- Streaming file processing for large documents\n- Efficient regex patterns for text extraction\n- Minimal memory footprint\n- Fast batch processing\n\n---\n\n## 📦 Packaging \u0026 Deployment\n\n### Standalone Script\n\nThe entire tool is contained in a single `docconvert.py` file:\n\n```bash\n# Just download and run\ncurl -O https://raw.githubusercontent.com/gitstq/docconvert-ai/main/docconvert.py\npython docconvert.py your_document.docx\n```\n\n### Python Package\n\n```bash\n# Install from source\npip install -e .\n\n# Use as module\npython -m docconvert your_document.docx\n```\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! Please follow these guidelines:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'feat: add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n### Commit Message Convention\n\n- `feat:` New feature\n- `fix:` Bug fix\n- `docs:` Documentation changes\n- `refactor:` Code refactoring\n- `test:` Adding tests\n- `chore:` Maintenance tasks\n\n---\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n\u003ca name=\"简体中文\"\u003e\u003c/a\u003e\n## 🎉 项目介绍\n\n**DocConvert-AI** 是一款轻量级、零依赖的文档转换引擎，能够将复杂文档转换为结构化的AI友好格式。采用纯Python实现，仅使用标准库，无需外部依赖，同时支持多种输入和输出格式。\n\n### 💡 为什么选择 DocConvert-AI？\n\n- **🚀 零依赖**: 纯Python实现，无需pip安装地狱\n- **📦 轻量级**: 单文件设计，易于集成和部署\n- **🔄 多格式支持**: PDF、DOCX、XLSX、PPTX、HTML、Markdown、TXT\n- **📤 灵活导出**: Markdown、JSON、HTML、纯文本\n- **⚡ 快速处理**: 针对速度和效率优化\n- **🤖 AI就绪**: 结构化输出，完美适配大语言模型\n\n---\n\n## ✨ 核心特性\n\n| 特性 | 描述 | 状态 |\n|------|------|------|\n| 📄 **PDF支持** | 从PDF文档提取文本 | ✅ 已支持 |\n| 📝 **Word文档** | 解析带格式的DOCX文件 | ✅ 已支持 |\n| 📊 **Excel表格** | 将XLSX转换为结构化数据 | ✅ 已支持 |\n| 🎨 **PowerPoint** | 从PPTX幻灯片提取文本 | ✅ 已支持 |\n| 🌐 **HTML** | 从网页清理提取内容 | ✅ 已支持 |\n| 📑 **Markdown** | 原生Markdown处理 | ✅ 已支持 |\n| 📃 **纯文本** | 基础文本文件处理 | ✅ 已支持 |\n| 🔄 **批处理** | 转换整个目录 | ✅ 已支持 |\n| 📤 **多格式导出** | MD、JSON、HTML、TXT输出 | ✅ 已支持 |\n\n---\n\n## 🚀 快速开始\n\n### 环境要求\n\n- Python 3.8 或更高版本\n- 无需外部依赖！\n\n### 安装\n\n```bash\n# 克隆仓库\ngit clone https://github.com/gitstq/docconvert-ai.git\ncd docconvert-ai\n\n# 或通过pip安装（发布后）\npip install docconvert-ai\n```\n\n### 基本用法\n\n```bash\n# 将单个文件转换为Markdown（默认）\npython docconvert.py document.docx\n\n# 转换为特定格式\npython docconvert.py document.pdf -f json\npython docconvert.py document.html -f html\npython docconvert.py document.pptx -f text\n\n# 指定输出文件\npython docconvert.py input.docx -o output.md\n\n# 批量转换整个目录\npython docconvert.py ./input_dir -b -o ./output_dir -f markdown\n```\n\n---\n\n## 📖 详细使用指南\n\n### 命令行选项\n\n```\n用法: docconvert.py [-h] [-o OUTPUT] [-f {markdown,json,html,text}] [-b] [-v]\n                     input\n\nDocConvert-AI: 轻量级AI文档智能转换引擎\n\n位置参数:\n  input                 输入文件或目录路径\n\n可选参数:\n  -h, --help            显示帮助信息并退出\n  -o, --output          输出文件或目录路径\n  -f, --format          输出格式: markdown, json, html, text (默认: markdown)\n  -b, --batch           批量转换目录中的所有文件\n  -v, --version         显示程序版本号并退出\n```\n\n### Python API\n\n```python\nfrom docconvert import DocConvertAI, OutputFormat\n\n# 初始化转换器\nconverter = DocConvertAI()\n\n# 转换单个文件\noutput = converter.convert_file(\n    input_path=\"document.docx\",\n    output_path=\"output.md\",\n    output_format=OutputFormat.MARKDOWN\n)\n\n# 批量转换\nfiles = converter.batch_convert(\n    input_dir=\"./documents\",\n    output_dir=\"./converted\",\n    output_format=OutputFormat.JSON\n)\n```\n\n---\n\n## 💡 设计理念\n\n### 零依赖方案\n\n与其他需要几十个依赖项（PyPDF2、python-docx、pandas等）的文档转换工具不同，DocConvert-AI利用Python强大的标准库：\n\n- **`zipfile`**: 用于DOCX、XLSX、PPTX（都是ZIP压缩包）\n- **`xml.etree.ElementTree`**: 用于解析XML内容\n- **`re`**: 用于模式匹配和文本提取\n- **`argparse`**: 用于命令行界面\n- **`dataclasses`**: 用于结构化数据表示\n\n### 性能优化\n\n- 大文档的流式文件处理\n- 高效的正则表达式模式\n- 最小内存占用\n- 快速批处理\n\n---\n\n## 📦 打包与部署\n\n### 独立脚本\n\n整个工具包含在单个 `docconvert.py` 文件中：\n\n```bash\n# 下载并运行\ncurl -O https://raw.githubusercontent.com/gitstq/docconvert-ai/main/docconvert.py\npython docconvert.py your_document.docx\n```\n\n### Python包\n\n```bash\n# 从源码安装\npip install -e .\n\n# 作为模块使用\npython -m docconvert your_document.docx\n```\n\n---\n\n## 🤝 贡献指南\n\n我们欢迎贡献！请遵循以下准则：\n\n1. Fork 仓库\n2. 创建功能分支 (`git checkout -b feature/amazing-feature`)\n3. 提交更改 (`git commit -m 'feat: add amazing feature'`)\n4. 推送到分支 (`git push origin feature/amazing-feature`)\n5. 发起 Pull Request\n\n### 提交信息规范\n\n- `feat:` 新功能\n- `fix:` Bug修复\n- `docs:` 文档更改\n- `refactor:` 代码重构\n- `test:` 添加测试\n- `chore:` 维护任务\n\n---\n\n## 📄 开源协议\n\n本项目采用 MIT 协议 - 查看 [LICENSE](LICENSE) 文件了解详情。\n\n---\n\n\u003ca name=\"繁體中文\"\u003e\u003c/a\u003e\n## 🎉 專案介紹\n\n**DocConvert-AI** 是一款輕量級、零依賴的文件轉換引擎，能夠將複雜文件轉換為結構化的AI友好格式。採用純Python實現，僅使用標準庫，無需外部依賴，同時支援多種輸入和輸出格式。\n\n### 💡 為什麼選擇 DocConvert-AI？\n\n- **🚀 零依賴**: 純Python實現，無需pip安裝地獄\n- **📦 輕量級**: 單檔案設計，易於整合和部署\n- **🔄 多格式支援**: PDF、DOCX、XLSX、PPTX、HTML、Markdown、TXT\n- **📤 靈活匯出**: Markdown、JSON、HTML、純文字\n- **⚡ 快速處理**: 針對速度和效率最佳化\n- **🤖 AI就緒**: 結構化輸出，完美適配大語言模型\n\n---\n\n## ✨ 核心特性\n\n| 特性 | 描述 | 狀態 |\n|------|------|------|\n| 📄 **PDF支援** | 從PDF文件提取文字 | ✅ 已支援 |\n| 📝 **Word文件** | 解析帶格式的DOCX檔案 | ✅ 已支援 |\n| 📊 **Excel表格** | 將XLSX轉換為結構化資料 | ✅ 已支援 |\n| 🎨 **PowerPoint** | 從PPTX投影片提取文字 | ✅ 已支援 |\n| 🌐 **HTML** | 從網頁清理提取內容 | ✅ 已支援 |\n| 📑 **Markdown** | 原生Markdown處理 | ✅ 已支援 |\n| 📃 **純文字** | 基礎文字檔案處理 | ✅ 已支援 |\n| 🔄 **批次處理** | 轉換整個目錄 | ✅ 已支援 |\n| 📤 **多格式匯出** | MD、JSON、HTML、TXT輸出 | ✅ 已支援 |\n\n---\n\n## 🚀 快速開始\n\n### 環境要求\n\n- Python 3.8 或更高版本\n- 無需外部依賴！\n\n### 安裝\n\n```bash\n# 克隆倉庫\ngit clone https://github.com/gitstq/docconvert-ai.git\ncd docconvert-ai\n\n# 或透過pip安裝（釋出後）\npip install docconvert-ai\n```\n\n### 基本用法\n\n```bash\n# 將單個檔案轉換為Markdown（預設）\npython docconvert.py document.docx\n\n# 轉換為特定格式\npython docconvert.py document.pdf -f json\npython docconvert.py document.html -f html\npython docconvert.py document.pptx -f text\n\n# 指定輸出檔案\npython docconvert.py input.docx -o output.md\n\n# 批次轉換整個目錄\npython docconvert.py ./input_dir -b -o ./output_dir -f markdown\n```\n\n---\n\n## 📖 詳細使用指南\n\n### 命令列選項\n\n```\n用法: docconvert.py [-h] [-o OUTPUT] [-f {markdown,json,html,text}] [-b] [-v]\n                     input\n\nDocConvert-AI: 輕量級AI文件智慧轉換引擎\n\n位置參數:\n  input                 輸入檔案或目錄路徑\n\n可選參數:\n  -h, --help            顯示幫助資訊並退出\n  -o, --output          輸出檔案或目錄路徑\n  -f, --format          輸出格式: markdown, json, html, text (預設: markdown)\n  -b, --batch           批次轉換目錄中的所有檔案\n  -v, --version         顯示程式版本號並退出\n```\n\n### Python API\n\n```python\nfrom docconvert import DocConvertAI, OutputFormat\n\n# 初始化轉換器\nconverter = DocConvertAI()\n\n# 轉換單個檔案\noutput = converter.convert_file(\n    input_path=\"document.docx\",\n    output_path=\"output.md\",\n    output_format=OutputFormat.MARKDOWN\n)\n\n# 批次轉換\nfiles = converter.batch_convert(\n    input_dir=\"./documents\",\n    output_dir=\"./converted\",\n    output_format=OutputFormat.JSON\n)\n```\n\n---\n\n## 💡 設計理念\n\n### 零依賴方案\n\n與其他需要數十個依賴項（PyPDF2、python-docx、pandas等）的文件轉換工具不同，DocConvert-AI利用Python強大的標準庫：\n\n- **`zipfile`**: 用於DOCX、XLSX、PPTX（都是ZIP壓縮包）\n- **`xml.etree.ElementTree`**: 用於解析XML內容\n- **`re`**: 用於模式匹配和文字提取\n- **`argparse`**: 用於命令列介面\n- **`dataclasses`**: 用於結構化資料表示\n\n### 效能最佳化\n\n- 大文件的流式檔案處理\n- 高效的正規表示式模式\n- 最小記憶體佔用\n- 快速批次處理\n\n---\n\n## 📦 打包與部署\n\n### 獨立指令碼\n\n整個工具包含在單個 `docconvert.py` 檔案中：\n\n```bash\n# 下載並執行\ncurl -O https://raw.githubusercontent.com/gitstq/docconvert-ai/main/docconvert.py\npython docconvert.py your_document.docx\n```\n\n### Python包\n\n```bash\n# 從原始碼安裝\npip install -e .\n\n# 作為模組使用\npython -m docconvert your_document.docx\n```\n\n---\n\n## 🤝 貢獻指南\n\n我們歡迎貢獻！請遵循以下準則：\n\n1. Fork 倉庫\n2. 建立功能分支 (`git checkout -b feature/amazing-feature`)\n3. 提交更改 (`git commit -m 'feat: add amazing feature'`)\n4. 推送到分支 (`git push origin feature/amazing-feature`)\n5. 發起 Pull Request\n\n### 提交資訊規範\n\n- `feat:` 新功能\n- `fix:` Bug修復\n- `docs:` 文件更改\n- `refactor:` 程式碼重構\n- `test:` 新增測試\n- `chore:` 維護任務\n\n---\n\n## 📄 開源協議\n\n本專案採用 MIT 協議 - 檢視 [LICENSE](LICENSE) 檔案瞭解詳情。\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**Made with ❤️ by GitStq**\n\n⭐ Star us on GitHub — it motivates us a lot!\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocconvert-ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgitstq%2Fdocconvert-ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fdocconvert-ai/lists"}