https://github.com/unitvectory-labs/gcspdf2mdapi
An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.
https://github.com/unitvectory-labs/gcspdf2mdapi
Last synced: 12 months ago
JSON representation
An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.
- Host: GitHub
- URL: https://github.com/unitvectory-labs/gcspdf2mdapi
- Owner: UnitVectorY-Labs
- License: mit
- Created: 2025-03-12T02:37:00.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-03T13:36:41.000Z (about 1 year ago)
- Last Synced: 2025-05-08T00:14:10.440Z (about 1 year ago)
- Language: Python
- Size: 37.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# gcspdf2mdapi
An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.
## Overview
**gcspdf2mdapi** is a Flask-based API service that converts PDF documents stored in Google Cloud Storage to Markdown format. It offers two conversion methods:
1. **OCR-based conversion**: Uses Tesseract OCR via pytesseract to extract text from PDF pages rendered as images. This method is helpful for scanned documents or PDFs with text embedded in images.
2. **Direct text extraction**: Leverages PyMuPDF (fitz) and pymupdf4llm to extract text content directly from PDF documents while preserving structure.
Key technologies used:
- **Flask**: Web framework for the API endpoints
- **PyMuPDF**: PDF parsing and rendering
- **pymupdf4llm**: Converts PDF content to structured markdown
- **pytesseract & Pillow**: OCR processing
- **Google Cloud Storage**: For accessing PDF documents
The API is containerized using Docker and can be deployed to any container-supporting environment.
## Usage
The API provides endpoints to convert PDF files stored in Google Cloud Storage to Markdown format.
### Endpoints
#### Convert PDF to Markdown
```
POST /convert
```
**Request body:**
```json
{
"file": "gs://bucket-name/path/to/file.pdf",
"mode": "ocr|direct"
}
```
Parameters:
- `file`: GCS path to the PDF file (must start with `gs://`)
- `mode`: (Optional) Conversion method
- `ocr`: Uses Optical Character Recognition (default)
- `direct`: Uses direct text extraction
**Response:**
```json
{
"markdown": "Extracted markdown content..."
}
```
#### Health Check
```
GET /
```
Returns API status:
```json
{
"status": "ok"
}
```
### Examples
**Convert using OCR (default):**
```bash
curl -X POST https://your-api-endpoint/convert \
-H "Content-Type: application/json" \
-d '{"file": "gs://my-bucket/documents/report.pdf"}'
```
**Convert using direct text extraction:**
```bash
curl -X POST https://your-api-endpoint/convert \
-H "Content-Type: application/json" \
-d '{"file": "gs://my-bucket/documents/report.pdf", "mode": "direct"}'
```