{"id":49080144,"url":"https://github.com/arikusi/sahaf","last_synced_at":"2026-04-20T12:36:07.450Z","repository":{"id":350371214,"uuid":"1177651237","full_name":"arikusi/sahaf","owner":"arikusi","description":"Local PDF \u0026 EPUB to Markdown converter with OCR — runs on your hardware, no cloud APIs","archived":false,"fork":false,"pushed_at":"2026-04-10T03:06:31.000Z","size":39387,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-10T04:26:49.010Z","etag":null,"topics":["converter","epub","fastapi","markdown","marker","ocr","pdf","python","surya"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arikusi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-10T08:30:14.000Z","updated_at":"2026-04-10T03:06:35.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/arikusi/sahaf","commit_stats":null,"previous_names":["arikusi/sahaf"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/arikusi/sahaf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arikusi%2Fsahaf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arikusi%2Fsahaf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arikusi%2Fsahaf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arikusi%2Fsahaf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arikusi","download_url":"https://codeload.github.com/arikusi/sahaf/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arikusi%2Fsahaf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32047521,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["converter","epub","fastapi","markdown","marker","ocr","pdf","python","surya"],"created_at":"2026-04-20T12:36:06.824Z","updated_at":"2026-04-20T12:36:07.435Z","avatar_url":"https://github.com/arikusi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sahaf\n\n[![CI](https://github.com/arikusi/sahaf/actions/workflows/ci.yml/badge.svg)](https://github.com/arikusi/sahaf/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/sahaf)](https://pypi.org/project/sahaf/)\n[![Downloads](https://img.shields.io/pypi/dm/sahaf)](https://pypi.org/project/sahaf/)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\nLocal PDF \u0026 EPUB to Markdown converter with automatic digital/scanned detection, OCR support, smart splitting, and page-range selection. Converts books to clean, self-contained Markdown files with embedded images using Marker (95.67% accuracy) and Surya OCR (90+ languages). No cloud APIs — runs entirely on your hardware.\n\n\u003cp align=\"center\"\u003e\n  \u003cvideo src=\"https://github.com/user-attachments/assets/76b2484c-b69b-4cf2-8436-1ef2ae3cef20\" width=\"480\" autoplay loop muted\u003e\n  \u003c/video\u003e\n\u003c/p\u003e\n\n## Features\n\n- **PDF \u0026 EPUB support** — handles both formats natively\n- **Automatic PDF classification** — detects digital, scanned, or mixed PDFs via PyMuPDF\n- **High-accuracy conversion** — Marker with 95.67% benchmark accuracy\n- **Built-in OCR** — Surya OCR supports 90+ languages (Turkish, English, Arabic, etc.)\n- **Page/chapter range selection** — convert only a specific section of the book (e.g. pages 19-88)\n- **Smart splitting** — split output into N parts, cutting at heading/paragraph boundaries instead of mid-sentence\n- **Self-contained output** — images embedded as base64 directly in Markdown, no separate files\n- **Split preview** — see exactly how parts will be divided before downloading\n- **Bilingual UI** — Turkish / English interface with one-click toggle\n- **Dark/light theme** — lavender-toned design, persistent toggle\n- **Drag \u0026 drop UI** — clean single-page web interface\n\n## Install\n\n```bash\npip install sahaf\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/arikusi/sahaf.git\ncd sahaf\npip install -e .\n```\n\n\u003e Marker models (~2-3GB) are downloaded automatically on first conversion.\n\n## Quick Start\n\n```bash\nsahaf\n```\n\nOpen `http://localhost:8000` in your browser.\n\n## How It Works\n\n1. **Upload** — drag \u0026 drop a PDF or EPUB file\n2. **Classify** — PyMuPDF analyzes PDF type; EPUB chapters are counted\n3. **Select range** *(optional)* — pick specific pages or chapters to convert\n4. **Convert** — Marker processes PDF; ebooklib + markdownify handles EPUB\n5. **Split** *(optional)* — choose how many parts to split the output into\n6. **Download** — get a single `.md` or a ZIP with split parts, all images embedded inline\n\n## API\n\n| Method | Path | Description |\n|--------|------|-------------|\n| `POST` | `/api/upload` | Upload PDF/EPUB, returns `task_id` |\n| `GET` | `/api/classify/{task_id}` | Detect PDF type + page count, or EPUB chapter count |\n| `POST` | `/api/convert/{task_id}?page_from=\u0026page_to=` | Start conversion (optional page range) |\n| `GET` | `/api/status/{task_id}` | Poll conversion progress |\n| `GET` | `/api/result/{task_id}` | Get markdown + image list |\n| `GET` | `/api/download/{task_id}` | Download `.md` with embedded images |\n| `GET` | `/api/download/{task_id}/zip?parts=N` | Download ZIP with N split `.md` files |\n| `GET` | `/api/split-preview/{task_id}?parts=N` | Preview split structure before download |\n\n## Tech Stack\n\n- **Backend**: FastAPI + Uvicorn\n- **PDF Classification**: PyMuPDF\n- **PDF Conversion**: Marker (marker-pdf) + Surya OCR\n- **EPUB Conversion**: ebooklib + markdownify\n- **Smart Splitting**: Custom algorithm — heading/HR/paragraph boundary detection\n- **Frontend**: Vanilla HTML/CSS/JS + marked.js\n- **i18n**: TR/EN with client-side toggle\n\n## Requirements\n\n- Python 3.10+\n- 4-6GB RAM (when Marker models are loaded)\n- **GPU strongly recommended for PDF** — CPU-only is extremely slow (~1 hour for a 27-page mixed PDF on i5 + 40GB RAM). A CUDA-capable GPU converts the same file in minutes.\n- EPUB conversion is lightweight — no GPU needed, runs instantly\n\n## License\n\nGPL-3.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farikusi%2Fsahaf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farikusi%2Fsahaf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farikusi%2Fsahaf/lists"}