{"id":30612661,"url":"https://github.com/lianjiatech/bella-domify","last_synced_at":"2025-10-04T19:36:24.024Z","repository":{"id":310982105,"uuid":"1039825343","full_name":"LianjiaTech/bella-domify","owner":"LianjiaTech","description":"文档解析（Document Parser），支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式，高效提取与解析内容，生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser，助力 RAG、知识库、全文检索等智能应用。","archived":false,"fork":false,"pushed_at":"2025-08-28T07:30:01.000Z","size":33626,"stargazers_count":28,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-28T14:30:51.592Z","etag":null,"topics":["document-parser","parser","pdf-parser"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LianjiaTech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-18T03:27:00.000Z","updated_at":"2025-08-28T07:30:05.000Z","dependencies_parsed_at":"2025-08-21T12:41:25.977Z","dependency_job_id":null,"html_url":"https://github.com/LianjiaTech/bella-domify","commit_stats":null,"previous_names":["lianjiatech/bella-domify"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LianjiaTech/bella-domify","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LianjiaTech%2Fbella-domify","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LianjiaTech%2Fbella-domify/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LianjiaTech%2Fbella-domify/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LianjiaTech%2Fbella-domify/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LianjiaTech","download_url":"https://codeload.github.com/LianjiaTech/bella-domify/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LianjiaTech%2Fbella-domify/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272808936,"owners_count":24996603,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-parser","parser","pdf-parser"],"created_at":"2025-08-30T05:34:50.308Z","updated_at":"2025-10-04T19:36:24.018Z","avatar_url":"https://github.com/LianjiaTech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# document_parser\n\n![python-version](https://img.shields.io/badge/python-\u003e=3.6-green.svg)\n[中文](README.md) | [English Version](README_EN.md)\n\n一个贝壳开源的文档解析Python库。使用Python lib包形式引入，也可以服务化方式运行，支持多种文档格式的解析和转换。\n\n## 功能特点\n\n### 支持多种文件格式\n- PDF\n- Word文档 (DOCX/DOC)\n- Excel表格 (XLSX/XLS)\n- CSV文件\n- PowerPoint演示文稿 (PPTX)\n- 文本文件\n- 图片文件 \n\n### 解析功能\n- **版面解析 (Layout Parse)**：提取文档的基本布局结构，包括文本块和图片块\n- **DOM树解析 (DomTree Parse)**：构建详细的文档对象模型，便于进一步处理和分析\n- **Markdown转换**：将解析结果转换为Markdown格式\n\n### 高级功能\n- **图像处理**：内置使用大模型ORC能力提取图像信息功能\n- **表格处理**：解析表格结构和内容\n- **页眉页脚识别**：自动识别和过滤页眉页脚\n- **多进程解析**：使用多进程并行处理提高解析效率\n- **评测标注功能**：内含评测模块，可标注PDF解析详情\n![pdf_marked](./assets/pdf_marked.png)\n\n## 系统要求\n- Python \u003e= 3.9\n- 其他依赖项（详见requirements.txt）\n\n以服务形式启动不依赖贝壳OpenAI开源体系，但文档解析流程依赖贝壳开源的（上传的文件是数据来源）,文件数据扭转如下\n\n![pipline](./assets/pipline.png)\n\nBella-Rag地址：https://github.com/LianjiaTech/bella-rag\n\nBella-Knowledge地址：https://github.com/LianjiaTech/bella-knowledge\n\n\n\n## 环境配置\n\n需要设置以下环境变量：\n- OPENAI_API_KEY：用于调用OpenAI API的密钥\n- OPENAI_BASE_URL：OpenAI API的基础URL\n- OPENAPI_CONSOLE_KEY：需要调用OpenAI console类接口获得元信息时候默认的全局key，目前主要用来获取视觉模型列表，用户可自行实现`VisionModelProvider`返回支持视觉的模型列表\n\n## 快速开始\n\n### library库形式使用\n\n1. 安装依赖\n\n   ```shell\n   pip install document_parser\n   ```\n\n2. 配置\n\n   ```python\n   parser_config = ParserConfig(image_provider=ImageStorageProvider(),\n                                ocr_model_name=\"gtp-4o\",\n                                # 是否开启OCR能力\n                                # 如不开启则vision_model_provider或vision_model_list不需要实现或配置\n                                ocr_enable=True, \n                                vision_model_provider=OpenAIVisionModelProvider())\n   parser_context.register_all_config(parser_config)\n   parser_context.register_user(\"userId\") # 请求模型时的用户ID,如果不设置会影响OCR使用\n   ```\n\n3. 执行解析\n   ```python\n   converter = Converter(stream=stream) # 以文件流的形式传入\n   dom_tree = converter.dom_tree_parse( \n       remove_watermark=True,   # 是否开启去水印\n       parse_stream_table=False # 是否解析流式表格\n   )\n   ```\n\n### 服务化运行\n\n1. 从Git下载代码\n\n2. 启动命令\n\n   ```bash\n   uvicorn server.app:app --port 8080 --host 0.0.0.0\n   ```\n\n*也可以根据自身需要打包成docker镜像\n\n## 优势\n从下图效果测评数据可以看出贝壳自研的解析能力很强，正确率更高（基于贝壳有限测评集）\n\n![image2](./assets/evaluation.png)\n\n\n\n## 致谢\n\n本项目基于 [pdf2docx](https://github.com/dothinking/pdf2docx) 进行二次开发，感谢原作者及其团队的杰出贡献。pdf2docx 是基于 PyMuPDF 提取文本、图片、矢量等原始数据。并基于规则解析章节、段落、表格、图片、文本等布局及样式等，具体功能可访问其GitHub地址。为我们的文档解析功能提供了重要的技术基础。\n\n## 更多文章\n\n[PDF解析： 视觉到结构的重建之旅](./assets/share.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flianjiatech%2Fbella-domify","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flianjiatech%2Fbella-domify","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flianjiatech%2Fbella-domify/lists"}