{"id":27974369,"url":"https://github.com/unitvectory-labs/gcspdf2mdapi","last_synced_at":"2025-06-21T10:43:19.010Z","repository":{"id":281958551,"uuid":"947010083","full_name":"UnitVectorY-Labs/gcspdf2mdapi","owner":"UnitVectorY-Labs","description":"An API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.","archived":false,"fork":false,"pushed_at":"2025-05-03T13:36:41.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-08T00:14:10.440Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UnitVectorY-Labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-12T02:37:00.000Z","updated_at":"2025-05-03T13:36:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"0d119b0c-f730-49d7-be14-62c94dcab77d","html_url":"https://github.com/UnitVectorY-Labs/gcspdf2mdapi","commit_stats":null,"previous_names":["unitvectory-labs/gcspdf2mdapi"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnitVectorY-Labs%2Fgcspdf2mdapi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnitVectorY-Labs%2Fgcspdf2mdapi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnitVectorY-Labs%2Fgcspdf2mdapi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnitVectorY-Labs%2Fgcspdf2mdapi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UnitVectorY-Labs","download_url":"https://codeload.github.com/UnitVectorY-Labs/gcspdf2mdapi/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252973669,"owners_count":21834108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-08T00:14:15.110Z","updated_at":"2025-05-08T00:14:15.733Z","avatar_url":"https://github.com/UnitVectorY-Labs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gcspdf2mdapi\n\nAn API that converts PDFs stored in Google Cloud Storage to Markdown format using OCR or direct text extraction.\n\n## Overview\n\n**gcspdf2mdapi** is a Flask-based API service that converts PDF documents stored in Google Cloud Storage to Markdown format. It offers two conversion methods:\n\n1. **OCR-based conversion**: Uses Tesseract OCR via pytesseract to extract text from PDF pages rendered as images. This method is helpful for scanned documents or PDFs with text embedded in images.\n\n2. **Direct text extraction**: Leverages PyMuPDF (fitz) and pymupdf4llm to extract text content directly from PDF documents while preserving structure.\n\nKey technologies used:\n- **Flask**: Web framework for the API endpoints\n- **PyMuPDF**: PDF parsing and rendering\n- **pymupdf4llm**: Converts PDF content to structured markdown\n- **pytesseract \u0026 Pillow**: OCR processing\n- **Google Cloud Storage**: For accessing PDF documents\n\nThe API is containerized using Docker and can be deployed to any container-supporting environment.\n\n## Usage\n\nThe API provides endpoints to convert PDF files stored in Google Cloud Storage to Markdown format.\n\n### Endpoints\n\n#### Convert PDF to Markdown\n```\nPOST /convert\n```\n\n**Request body:**\n```json\n{\n  \"file\": \"gs://bucket-name/path/to/file.pdf\",\n  \"mode\": \"ocr|direct\"\n}\n```\n\nParameters:\n- `file`: GCS path to the PDF file (must start with `gs://`)\n- `mode`: (Optional) Conversion method\n  - `ocr`: Uses Optical Character Recognition (default)\n  - `direct`: Uses direct text extraction\n\n**Response:**\n```json\n{\n  \"markdown\": \"Extracted markdown content...\"\n}\n```\n\n#### Health Check\n```\nGET /\n```\n\nReturns API status:\n```json\n{\n  \"status\": \"ok\"\n}\n```\n\n### Examples\n\n**Convert using OCR (default):**\n```bash\ncurl -X POST https://your-api-endpoint/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"file\": \"gs://my-bucket/documents/report.pdf\"}'\n```\n\n**Convert using direct text extraction:**\n```bash\ncurl -X POST https://your-api-endpoint/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"file\": \"gs://my-bucket/documents/report.pdf\", \"mode\": \"direct\"}'\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funitvectory-labs%2Fgcspdf2mdapi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funitvectory-labs%2Fgcspdf2mdapi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funitvectory-labs%2Fgcspdf2mdapi/lists"}