https://github.com/arikusi/sahaf

Local PDF & EPUB to Markdown converter with OCR — runs on your hardware, no cloud APIs
https://github.com/arikusi/sahaf

converter epub fastapi markdown marker ocr pdf python surya

Last synced: 3 months ago
JSON representation

Local PDF & EPUB to Markdown converter with OCR — runs on your hardware, no cloud APIs

Host: GitHub
URL: https://github.com/arikusi/sahaf
Owner: arikusi
License: gpl-3.0
Created: 2026-03-10T08:30:14.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-04-10T03:06:31.000Z (4 months ago)
Last Synced: 2026-04-10T04:26:49.010Z (4 months ago)
Topics: converter, epub, fastapi, markdown, marker, ocr, pdf, python, surya
Language: Python
Size: 37.6 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Sahaf

[![CI](https://github.com/arikusi/sahaf/actions/workflows/ci.yml/badge.svg)](https://github.com/arikusi/sahaf/actions/workflows/ci.yml)

[![PyPI](https://img.shields.io/pypi/v/sahaf)](https://pypi.org/project/sahaf/)

[![Downloads](https://img.shields.io/pypi/dm/sahaf)](https://pypi.org/project/sahaf/)

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

Local PDF & EPUB to Markdown converter with automatic digital/scanned detection, OCR support, smart splitting, and page-range selection. Converts books to clean, self-contained Markdown files with embedded images using Marker (95.67% accuracy) and Surya OCR (90+ languages). No cloud APIs — runs entirely on your hardware.



  

  



## Features

- **PDF & EPUB support** — handles both formats natively

- **Automatic PDF classification** — detects digital, scanned, or mixed PDFs via PyMuPDF

- **High-accuracy conversion** — Marker with 95.67% benchmark accuracy

- **Built-in OCR** — Surya OCR supports 90+ languages (Turkish, English, Arabic, etc.)

- **Page/chapter range selection** — convert only a specific section of the book (e.g. pages 19-88)

- **Smart splitting** — split output into N parts, cutting at heading/paragraph boundaries instead of mid-sentence

- **Self-contained output** — images embedded as base64 directly in Markdown, no separate files

- **Split preview** — see exactly how parts will be divided before downloading

- **Bilingual UI** — Turkish / English interface with one-click toggle

- **Dark/light theme** — lavender-toned design, persistent toggle

- **Drag & drop UI** — clean single-page web interface

## Install

```bash

pip install sahaf

```

Or from source:

```bash

git clone https://github.com/arikusi/sahaf.git

cd sahaf

pip install -e .

```

> Marker models (~2-3GB) are downloaded automatically on first conversion.

## Quick Start

```bash

sahaf

```

Open `http://localhost:8000` in your browser.

## How It Works

1. **Upload** — drag & drop a PDF or EPUB file

2. **Classify** — PyMuPDF analyzes PDF type; EPUB chapters are counted

3. **Select range** *(optional)* — pick specific pages or chapters to convert

4. **Convert** — Marker processes PDF; ebooklib + markdownify handles EPUB

5. **Split** *(optional)* — choose how many parts to split the output into

6. **Download** — get a single `.md` or a ZIP with split parts, all images embedded inline

## API

| Method | Path | Description |

|--------|------|-------------|

| `POST` | `/api/upload` | Upload PDF/EPUB, returns `task_id` |

| `GET` | `/api/classify/{task_id}` | Detect PDF type + page count, or EPUB chapter count |

| `POST` | `/api/convert/{task_id}?page_from=&page_to=` | Start conversion (optional page range) |

| `GET` | `/api/status/{task_id}` | Poll conversion progress |

| `GET` | `/api/result/{task_id}` | Get markdown + image list |

| `GET` | `/api/download/{task_id}` | Download `.md` with embedded images |

| `GET` | `/api/download/{task_id}/zip?parts=N` | Download ZIP with N split `.md` files |

| `GET` | `/api/split-preview/{task_id}?parts=N` | Preview split structure before download |

## Tech Stack

- **Backend**: FastAPI + Uvicorn

- **PDF Classification**: PyMuPDF

- **PDF Conversion**: Marker (marker-pdf) + Surya OCR

- **EPUB Conversion**: ebooklib + markdownify

- **Smart Splitting**: Custom algorithm — heading/HR/paragraph boundary detection

- **Frontend**: Vanilla HTML/CSS/JS + marked.js

- **i18n**: TR/EN with client-side toggle

## Requirements

- Python 3.10+

- 4-6GB RAM (when Marker models are loaded)

- **GPU strongly recommended for PDF** — CPU-only is extremely slow (~1 hour for a 27-page mixed PDF on i5 + 40GB RAM). A CUDA-capable GPU converts the same file in minutes.

- EPUB conversion is lightweight — no GPU needed, runs instantly

## License

GPL-3.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arikusi/sahaf

Awesome Lists containing this project

README