https://github.com/swhl/extractofficecontent
Extract content (include text, table, image) from the office files (Word, Excel, PPT).
https://github.com/swhl/extractofficecontent
docs excel extract-content lxml office openpyxl ppt python-docx python-pptx
Last synced: 3 months ago
JSON representation
Extract content (include text, table, image) from the office files (Word, Excel, PPT).
- Host: GitHub
- URL: https://github.com/swhl/extractofficecontent
- Owner: SWHL
- Created: 2023-05-29T01:25:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-16T14:18:33.000Z (about 2 years ago)
- Last Synced: 2025-04-07T15:51:46.632Z (6 months ago)
- Topics: docs, excel, extract-content, lxml, office, openpyxl, ppt, python-docx, python-pptx
- Language: Python
- Homepage:
- Size: 3.77 MB
- Stars: 7
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Extract Office Content
### 目前已知问题
- 提取PPT:
- 提取ppt中的内容时,会丢失带有公式的文本框
- 提取的表格格式不全
- PPT中的表格会提取为对应的excel文件,是否有更好的方式?
- 提取Word:
- 表格位置不能与原文中一一对应### Use
1. Install`extract_office_content`
```bash
$ pip install extract_office_content
```
2. Run by CLI.
- Extract All office file's content.
```bash
$ extract_office_content -h
usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_pathpositional arguments:
file_pathoptional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR$ extract_office_content tests/test_files
```
- Extract Word.
```bash
$ extract_word -h
usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_pathpositional arguments:
word_pathoptional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR$ extract_word tests/test_files/word_example.docx
```
- Extract PPT.
```bash
$ extract_ppt -h
usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_pathpositional arguments:
ppt_pathoptional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR$ extract_ppt tests/test_files/ppt_example.pptx
```
- Extract Excel.
```bash
$ extract_excel -h
usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
excel_pathpositional arguments:
excel_pathoptional arguments:
-h, --help show this help message and exit
-f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
-o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR$ extract_excel tests/test_files/excel_example.xlsx
```
3. Run by python script.
- Extract All.
```python
from pathlib import Pathfrom extract_office_content import ExtractOfficeContent
extracter = ExtractOfficeContent()
file_list = list(Path('tests/test_files').iterdir())for file_path in file_list:
res = extracter(file_path)
print(res)
```
- Extract Word.
```python
from extract_office_content import ExtractWordword_extract = ExtractWord()
word_path = 'tests/test_files/word_example.docx'
text = word_extract(word_path, "outputs/word")# or bytes
with open(word_path, 'rb') as f:
word_content = f.read()
text = word_extract(word_content, "outputs/word")
print(text)
```
- Extract PPT.
```python
from pathlib import Pathfrom extract_office_content import ExtractPPT
ppt_extracter = ExtractPPT()
ppt_path = 'tests/test_files/ppt_example.pptx'
save_dir = 'outputs'
save_img_dir = Path(save_dir) / Path(ppt_path).stem
res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))# or bytes
with open(ppt_path, 'rb') as f:
ppt_content = f.read()
res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
print(res)
```
- Extract Excel.
```python
from extract_office_content import ExtractExcelexcel_extract = ExtractExcel()
excel_path = 'tests/test_files/excel_with_image.xlsx'
res = excel_extract(excel_path, out_format='markdown', save_img_dir='1')# or bytes
with open(excel_path, 'rb') as f:
excel_content = f.read()
res = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
print(res)
```### 更新日志
- 2023-07-02 v0.0.6 update:
- 统一提取word接口返回值为List,与其他统一
- 2023-06-17 v0.0.4 update:
- 支持`file-like object`输入### Reference
- [Pandas读取excel合并单元格的正确姿势(openpyxl合并单元格拆分并填充内容)](https://blog.51cto.com/u_11466419/6100833)
- [python-docx2txt](https://github.com/ankushshah89/python-docx2txt)