https://github.com/mindsdb/aipdf
A tool to OCR PDFs using gen-AI models
https://github.com/mindsdb/aipdf
ai gpt groq hacktoberfest ocr ollama openai pdf
Last synced: about 1 year ago
JSON representation
A tool to OCR PDFs using gen-AI models
- Host: GitHub
- URL: https://github.com/mindsdb/aipdf
- Owner: mindsdb
- License: mit
- Created: 2024-10-04T09:10:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-31T19:51:45.000Z (over 1 year ago)
- Last Synced: 2025-04-02T12:12:45.777Z (about 1 year ago)
- Topics: ai, gpt, groq, hacktoberfest, ocr, ollama, openai, pdf
- Language: Python
- Homepage:
- Size: 51.8 KB
- Stars: 37
- Watchers: 12
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AIPDF: Minimalistic PDF to Markdown (and others), with GPT-like Multimodal Models
AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON.
## Installation
```bash
pip install aipdf
```
in macOS you will need to install poppler
```bash
brew install poppler
```
## Quick Start
```python
from aipdf import ocr
# Your OpenAI API key
api_key = 'your_openai_api_key'
file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key)
```
## Ollama
You can use with any ollama multi-modal models
```python
ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=...)
```
## Any file system
We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc
### From url
```python
pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)
# extract
pages = ocr(pdf_file, api_key, prompt="extract tables, return each table in json")
```
### From S3
```python
s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
aws_access_key_id=access_token,
aws_secret_access_key='', # Not needed for token-based auth
aws_session_token=access_token)
pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract
pages = ocr(pdf_file, api_key, prompt="extract charts data, turn it into tables that represent the variables in the chart")
```
## Why AIPDF?
1. **Simplicity**: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.
2. **Flexibility**: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.
3. **Power of AI**: Leverages state-of-the-art multi modal models (gpt, llama, ..).
4. **Customizable**: Tailor the extraction process to your specific needs with custom prompts.
5. **Efficient**: Utilizes parallel processing for faster extraction of multi-page PDFs.
## Requirements
- Python 3.7+
We will keep this super clean, only 3 required libraries:
- openai library to talk to completion endpoints
- pdf2image library (for PDF to image conversion)
- Pillow (PIL) library
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Support
If you encounter any problems or have any questions, please open an issue on the GitHub repository.
---
AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!