https://github.com/u-c4n/pdftollm

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/u-c4n/pdftollm
Owner: U-C4N
Created: 2025-01-15T13:55:59.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-02-09T07:56:23.000Z (8 months ago)
Last Synced: 2025-02-09T08:25:18.043Z (8 months ago)
Language: Python
Size: 16.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PDF to Markdown Converter with OCR & Image Extraction

![PDF to Markdown](screen1.jpg)
![PDF to Markdown](screen2.jpg)
A powerful web-based PDF to Markdown converter that combines OCR technology with advanced image extraction capabilities. Convert your PDF files to Markdown format while preserving images and extracting text from scanned documents. Perfect for researchers, developers, and content creators who need to convert complex PDFs with images and tables into clean, structured Markdown.

## ✨ Features

- 📄 Smart Document Processing
- Drag-and-drop file upload
- Fast conversion process
- Automatic table format conversion
- Precise header and list detection

- 🔍 Advanced OCR & Image Handling
- OCR support for scanned documents
- Automatic image extraction
- Image preview in grid layout
- Bulk image download as ZIP

- 📊 Performance & Analytics
- Smart token calculation
- Real-time processing feedback
- Conversion progress tracking

- 💫 User Experience
- One-click copy functionality
- Mobile-responsive interface
- Clean, modern UI design
- Secure file handling

## 🛠️ Technologies

- Backend
- Python Flask for web server
- PyPDF2 for PDF processing
- Tesseract OCR for image text extraction
- Rich for terminal UI and logging

- Frontend
- Modern HTML5/CSS3/JavaScript
- Bootstrap 5 for responsive design
- Base64 image encoding for secure display
- Dynamic grid layout for images

## 🚀 Installation

1. Prerequisites:
- Python 3.8 or higher
- [Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Installation.html)
- Git

2. Setup:
```bash
# Clone repository
git clone https://github.com/U-C4N/PDFtoLLM.git
cd PDFtoLLM

# Install dependencies
python -m pip install -r requirements.txt

# Start application
python app.py
```

3. Configure and run:
```bash
# Set environment variables (optional)
export FLASK_ENV=development
export PORT=8080 # or your preferred port

# Start application
python app.py
```

4. Access the application in your browser at the URL shown in the console output

## 💡 Usage

1. Open the application in your browser
2. Drag and drop or select your PDF file
3. Click the "Convert" button to process the PDF:
- Text is converted to Markdown format
- Images are automatically extracted
- OCR is applied to scanned text
4. View and manage the results:
- Copy the generated Markdown content
- Browse extracted images in the grid view
- Download all images as a ZIP file
5. Track token usage and processing statistics

## 🔜 Upcoming Features

- [x] OCR support for scanned documents
- [x] Image extraction and management
- [ ] Batch file processing support
- [ ] Custom OCR language support
- [ ] Enhanced table detection with image tables
- [ ] Markdown preview with image rendering
- [ ] PDF annotation extraction
- [ ] Custom formatting templates

## Author

Umutcan Edizaslan:

## 📝 License

This project is licensed under the [MIT](LICENSE) license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/u-c4n/pdftollm

Awesome Lists containing this project

README