awesome-pdf

A curated list of amazingly libraries, services and resources to work with PDF files
https://github.com/OneOffTech/awesome-pdf

Last synced: 14 days ago
JSON representation

Creation and production
- shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
- WeasyPrint - Generate PDF using html and CSS.
- qpdf/qpdf - A content-preserving PDF document transformer.
- Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
- unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
- PdfRest - PDF Api to create, shrink and compress.
- Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
- Smallpdf - Set of tools to extract and manipulate PDF content.
- typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
- Vexlio - Tool to create diagrams and export in SVG or PDF.
- renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
- BentoPDF - A privacy-first, self-hostable PDF toolkit that manipulates, edits, merges and processes files entirely in the browser, with no server-side processing.
- PDF 2 EPUB - Converts PDF to reflowable EPUB 3 compliant with EPUB Accessibility 1.1 and WCAG 2.2 AA.
Datasets
- tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
- pdf-association/pdf-corpora - An index of PDF-centric corpora.
- DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
- gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
- DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
Parsers, OCR and extraction
- Parxy - A PDF parsers gateway to use different parsers using a unified API.
- SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
- UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
- huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
- Reducto - Document Ingestion API.
- adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
- lumina-ai-inc/chunkr - Vision model based PDF chunking.
- lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
- allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
- opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
- smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
- Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
- CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.
- Iteration Layer - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
- LiteParse - An open-source standalone PDF parser that provides spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies.
- PDF Oxide - Fast Rust PDF library and CLI for text and image extraction and markdown conversion, with bindings for Python, Go, JS, .NET, Java, PHP, Ruby, and WASM.
- Iteration Layer - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
Readers and viewers
- mozilla/pdf.js - PDF Reader in JavaScript.
- agentcooper/react-pdf-highlighter - Set of React components for PDF annotation.
- Sioyek - PDF viewer with a focus on technical books and research papers (desktop app).
Validation and compliance
- HTPBE? - A forensic tool that analyses the structural layer of a PDF to detect whether it has been modified since creation.

Programming Languages

Python 7 TypeScript 3 Rust 2 PHP 2 HTML 2 Java 1 C++ 1 JavaScript 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-pdf

Creation and production

Datasets

Parsers, OCR and extraction

Readers and viewers

Validation and compliance