An open API service indexing awesome lists of open source software.

awesome-pdf

A curated list of amazingly libraries, services and resources to work with PDF files
https://github.com/OneOffTech/awesome-pdf

Last synced: 8 days ago
JSON representation

  • Creation and production

    • shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
    • WeasyPrint - Generate PDF using html and CSS.
    • qpdf/qpdf - A content-preserving PDF document transformer.
    • Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
    • unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
    • PdfRest - PDF Api to create, shrink and compress.
    • Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
    • Smallpdf - Set of tools to extract and manipulate PDF content.
    • typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
    • Vexlio - Tool to create diagrams and export in SVG or PDF.
    • renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
    • veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
    • renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
    • veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
  • Datasets

    • tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
    • pdf-association/pdf-corpora - An index of PDF-centric corpora.
    • DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
    • gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
    • DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
  • Parsers, OCR and extraction

    • Parxy - A PDF parsers gateway to use different parsers using a unified API.
    • SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
    • VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
    • UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
    • huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
    • Reducto - Document Ingestion API.
    • adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
    • lumina-ai-inc/chunkr - Vision model based PDF chunking.
    • lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
    • allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
    • opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
    • smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
    • Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
    • PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
    • CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
    • climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.
    • Iteration Layer - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
    • Docling - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
    • SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
    • Filimoa/open-parse - Improved file parsing for LLMs.
    • UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
  • Readers and viewers