https://github.com/QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
https://github.com/QuivrHQ/MegaParse

docx llm parser pdf powerpoint

Last synced: about 2 months ago
JSON representation

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

Host: GitHub
URL: https://github.com/QuivrHQ/MegaParse
Owner: QuivrHQ
License: apache-2.0
Created: 2024-05-29T08:40:29.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-26T15:29:23.000Z (10 months ago)
Last Synced: 2024-12-26T16:26:24.709Z (10 months ago)
Topics: docx, llm, parser, pdf, powerpoint
Language: Python
Homepage: https://megaparse.com
Size: 28.9 MB
Stars: 4,679
Watchers: 23
Forks: 233
Open Issues: 26
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

Awesome-LLM-RAG-Application - MegaParse
awesome-LLM-resources - MegaParse
AiTreasureBox - QuivrHQ/MegaParse - 10-08_7193_1](https://img.shields.io/github/stars/QuivrHQ/MegaParse.svg)|File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.| (Repos)

README

# MegaParse - Your Parser for every type of documents

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

## Key Features 🎯

- **Versatile Parser**: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
- **No Information Loss**: Focus on having no information loss during parsing.
- **Fast and Efficient**: Designed with speed and efficiency at its core.
- **Wide File Compatibility**: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
- **Open Source**: Freedom is beautiful, and so is MegaParse. Open source and free to use.

## Support

- Files: ✅ PDF ✅ Powerpoint ✅ Word
- Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

### Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

## Installation

required python version >= 3.11

```bash
pip install megaparse
```

## Usage

1. Add your OpenAI or Anthropic API key to the .env file

2. Install poppler on your computer (images and PDFs)

3. Install tesseract on your computer (images and PDFs)

4. If you have a mac, you also need to install libmagic ```brew install libmagic```

Use MegaParse as it is :
```python
from megaparse import MegaParse
from langchain_openai import ChatOpenAI

megaparse = MegaParse()
response = megaparse.load("./test.pdf")
print(response)
```

### Use MegaParse Vision

```python
from megaparse.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY")) # type: ignore
parser = MegaParseVision(model=model)
response = parser.convert("./test.pdf")
print(response)

```
**Note**: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

## Use as an API
There is a MakeFile for you, simply use :
```make dev```
at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

## BenchMark

| Parser | similarity_ratio |
| ----------------------------- | ---------------- |
| megaparse_vision | 0.87 |
| unstructured_with_check_table | 0.77 |
| unstructured | 0.59 |
| llama_parser | 0.33 |

_Higher the better_

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in ```evaluations/script.py``` and then run ```python evaluations/script.py```. If it is better, do a PR, I mean, let's go higher together .

## In Construction 🚧
- Improve table checker
- Create Checkers to add **modular postprocessing** ⚙️
- Add Structured output, **let's get computer talking** 🤖

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=QuivrHQ/MegaParse&type=Date)](https://star-history.com/#QuivrHQ/MegaParse&Date)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/QuivrHQ/MegaParse

Awesome Lists containing this project

README