{"id":16881423,"url":"https://github.com/swhl/extractofficecontent","last_synced_at":"2025-12-13T17:13:58.052Z","repository":{"id":170613946,"uuid":"646628886","full_name":"SWHL/ExtractOfficeContent","owner":"SWHL","description":"Extract content (include text, table, image) from the office files (Word, Excel, PPT).","archived":false,"fork":false,"pushed_at":"2023-07-16T14:18:33.000Z","size":3948,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T15:51:46.632Z","etag":null,"topics":["docs","excel","extract-content","lxml","office","openpyxl","ppt","python-docx","python-pptx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SWHL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-29T01:25:25.000Z","updated_at":"2025-03-31T18:26:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"3bea8b37-22da-4a4e-91c8-71e00c0287dd","html_url":"https://github.com/SWHL/ExtractOfficeContent","commit_stats":null,"previous_names":["swhl/extractofficetext","swhl/extractofficecontent"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FExtractOfficeContent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FExtractOfficeContent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FExtractOfficeContent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FExtractOfficeContent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SWHL","download_url":"https://codeload.github.com/SWHL/ExtractOfficeContent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248397254,"owners_count":21097079,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docs","excel","extract-content","lxml","office","openpyxl","ppt","python-docx","python-pptx"],"created_at":"2024-10-13T16:02:18.634Z","updated_at":"2025-12-13T17:13:52.684Z","avatar_url":"https://github.com/SWHL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Extract Office Content\n\u003cp\u003e\n    \u003ca href=\"https://swhl-extractofficecontentdemo.hf.space\" target=\"_blank\"\u003e\u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97-Online Demo-blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-\u003e=3.6,\u003c3.12-aff.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/extract_office_content/\"\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/extract_office_content\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pepy.tech/project/extract_office_content\"\u003e\u003cimg src=\"https://static.pepy.tech/personalized-badge/extract_office_content?period=total\u0026units=abbreviation\u0026left_color=grey\u0026right_color=blue\u0026left_text=Downloads\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://semver.org/\"\u003e\u003cimg alt=\"SemVer2.0\" src=\"https://img.shields.io/badge/SemVer-2.0-brightgreen\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n### 目前已知问题\n- 提取PPT:\n  - 提取ppt中的内容时，会丢失带有公式的文本框\n  - 提取的表格格式不全\n  - PPT中的表格会提取为对应的excel文件，是否有更好的方式？\n- 提取Word:\n  - 表格位置不能与原文中一一对应\n\n### Use\n1. Install`extract_office_content`\n   ```bash\n   $ pip install extract_office_content\n   ```\n2. Run by CLI.\n    - Extract All office file's content.\n        ```bash\n        $ extract_office_content -h\n        usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path\n\n        positional arguments:\n        file_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_office_content tests/test_files\n        ```\n    - Extract Word.\n        ```bash\n        $ extract_word -h\n        usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path\n\n        positional arguments:\n        word_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_word tests/test_files/word_example.docx\n        ```\n    - Extract PPT.\n        ```bash\n        $ extract_ppt -h\n        usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path\n\n        positional arguments:\n        ppt_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_ppt tests/test_files/ppt_example.pptx\n        ```\n    - Extract Excel.\n        ```bash\n        $ extract_excel -h\n        usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]\n                            excel_path\n\n        positional arguments:\n        excel_path\n\n        optional arguments:\n        -h, --help            show this help message and exit\n        -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}\n        -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR\n\n        $ extract_excel tests/test_files/excel_example.xlsx\n        ```\n3. Run by python script.\n    - Extract All.\n        ```python\n        from pathlib import Path\n\n        from extract_office_content import ExtractOfficeContent\n\n        extracter = ExtractOfficeContent()\n        file_list = list(Path('tests/test_files').iterdir())\n\n        for file_path in file_list:\n            res = extracter(file_path)\n            print(res)\n        ```\n    - Extract Word.\n        ```python\n        from extract_office_content import ExtractWord\n\n        word_extract = ExtractWord()\n        word_path = 'tests/test_files/word_example.docx'\n        text = word_extract(word_path, \"outputs/word\")\n\n        # or bytes\n        with open(word_path, 'rb') as f:\n            word_content = f.read()\n        text = word_extract(word_content, \"outputs/word\")\n        print(text)\n        ```\n    - Extract PPT.\n        ```python\n        from pathlib import Path\n\n        from extract_office_content import ExtractPPT\n\n        ppt_extracter = ExtractPPT()\n\n        ppt_path = 'tests/test_files/ppt_example.pptx'\n        save_dir = 'outputs'\n        save_img_dir = Path(save_dir) / Path(ppt_path).stem\n        res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))\n\n        # or bytes\n        with open(ppt_path, 'rb') as f:\n            ppt_content = f.read()\n        res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))\n        print(res)\n        ```\n    - Extract Excel.\n        ```python\n        from extract_office_content import ExtractExcel\n\n        excel_extract = ExtractExcel()\n\n        excel_path = 'tests/test_files/excel_with_image.xlsx'\n        res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')\n\n        # or bytes\n        with open(excel_path, 'rb') as f:\n            excel_content = f.read()\n        res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')\n        print(res)\n        ```\n\n\n### 更新日志\n- 2023-07-02 v0.0.6 update:\n  - 统一提取word接口返回值为List，与其他统一\n- 2023-06-17 v0.0.4 update:\n  - 支持`file-like object`输入\n\n### Reference\n- [Pandas读取excel合并单元格的正确姿势（openpyxl合并单元格拆分并填充内容）](https://blog.51cto.com/u_11466419/6100833)\n- [python-docx2txt](https://github.com/ankushshah89/python-docx2txt)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswhl%2Fextractofficecontent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswhl%2Fextractofficecontent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswhl%2Fextractofficecontent/lists"}