Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
https://github.com/QuivrHQ/MegaParse

docx llm parser pdf powerpoint

Last synced: 29 days ago
JSON representation

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

Host: GitHub
URL: https://github.com/QuivrHQ/MegaParse
Owner: QuivrHQ
License: apache-2.0
Created: 2024-05-29T08:40:29.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-06-11T09:58:20.000Z (about 1 month ago)
Last Synced: 2024-06-11T11:22:40.894Z (about 1 month ago)
Topics: docx, llm, parser, pdf, powerpoint
Language: Python
Homepage: https://pypi.org/project/megaparse/
Size: 2.43 MB
Stars: 204
Watchers: 2
Forks: 11
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Lists

awesome-stars - QuivrHQ/MegaParse - File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs. (Python)

README

        # MegaParse - Your Mega Parser for every type of documents



    



MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

## Key Features 🎯

- **Versatile Parser**: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.

- **No Information Loss**: Focus on having no information loss during parsing.

- **Fast and Efficient**: Designed with speed and efficiency at its core.

- **Wide File Compatibility**: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.

- **Open Source**: Freedom is beautiful, and so is MegaParse. Open source and free to use.

## Support

- Files: ✅ PDF ✅ Powerpoint ✅ Word

- Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

### Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

## Installation

```bash

pip install megaparse

```

## Usage

1. Add your OpenAI API key to the .env file

2. Install poppler on your computer (images and PDFs)

3. Install tesseract on your computer (images and PDFs)

```python

from megaparse.Converter import MegaParse

megaparse = MegaParse(file_path="./test.pdf")

content = megaparse.convert()

print(content)

megaparse.save_md(content, "./test.md")

```

### (Optional) Use LlamaParse for Improved Results

1. Create an account on [Llama Cloud](https://cloud.llamaindex.ai/) and get your API key.

2. Call Megaparse with the `llama_parse_api_key` parameter

```python

from megaparse.Converter import MegaParse

megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")

content = megaparse.convert()

print(content)

```

## BenchMark

| Parser | Diff |

|---|---|

| Megaparse with LLamaParse and GPTCleaner | 84 |

| **Megaparse** | 100 |

| Megaparse with LLamaParse | 104 |

| LLama Parse | 108 |

*Lower is better*

## Next Steps

- [ ] Improve Table Parsing

- [ ] Improve Image Parsing and description

- [ ] Add TOC for Docx

- [ ] Add Hyperlinks for Docx

- [ ] Order Headers for Docx to Markdown

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=QuivrHQ/MegaParse&type=Date)](https://star-history.com/#QuivrHQ/MegaParse&Date)