Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/opendatalab/magic-doc
https://github.com/opendatalab/magic-doc
Last synced: 4 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/opendatalab/magic-doc
- Owner: opendatalab
- License: apache-2.0
- Created: 2024-06-13T11:18:37.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-07-26T12:50:22.000Z (4 months ago)
- Last Synced: 2024-11-04T04:36:51.096Z (12 days ago)
- Language: Python
- Size: 4.04 MB
- Stars: 346
- Watchers: 6
- Forks: 26
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![license](https://img.shields.io/github/license/InternLM/magic-doc.svg)](https://github.com/InternLM/magic-doc/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/magic-doc)](https://github.com/InternLM/magic-doc/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/magic-doc)](https://github.com/InternLM/magic-doc/issues)
đ join us on Discord and WeChat[English](README.md) | [įŽäŊä¸æ](README_zh-CN.md)
### Install
Prerequisites: python3.10
Install Dependencies
**linux/osx**
```bash
apt-get/yum/brew install libreoffice
```**windows**
```text
install libreoffice
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
```Install Magic-Doc
```bash
pip install fairy-doc[cpu] # cpu version
or
pip install fairy-doc[gpu] # gpu version
```## Introduction
Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.
## Example
```python
# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
``````python
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Configs3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)
```## Performance
ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7
| File Type | Speed |
| ------------------ | -------- |
| PDF (digital) | 347 (page/s) |
| PDF (ocr) | 2.7 (page/s) |
| PPT | 20 (page/s) |
| PPTX | 149 (page/s) |
| DOC | 600 (page/s) |
| DOCX | 1482 (page/s) |## All Thanks To Our Contributors:
![image](https://github.com/InternLM/magic-doc/blob/main/assets/contributor.png)
## Acknowledgments
- [Antiword](https://github.com/rsdoiel/antiword)
- [LibreOffice](https://www.libreoffice.org/)
- [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)
- [paddleocr](https://github.com/PaddlePaddle/PaddleOCR)## đī¸ Citation
```bibtex
@misc{2024magic-doc,
title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
author={Magic-Doc Contributors},
howpublished = {\url{https://github.com/InternLM/magic-doc}},
year={2024}
}
```## License
This project is released under the [Apache 2.0 license](LICENSE).