awesome-pdf
A curated list of amazingly libraries, services and resources to work with PDF files
https://github.com/OneOffTech/awesome-pdf
Last synced: 8 days ago
JSON representation
-
Creation and production
- shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
- WeasyPrint - Generate PDF using html and CSS.
- qpdf/qpdf - A content-preserving PDF document transformer.
- Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
- unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
- PdfRest - PDF Api to create, shrink and compress.
- Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
- Smallpdf - Set of tools to extract and manipulate PDF content.
- typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
- Vexlio - Tool to create diagrams and export in SVG or PDF.
- renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
- renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
-
Datasets
- tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
- pdf-association/pdf-corpora - An index of PDF-centric corpora.
- DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
- gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
- DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
-
Parsers, OCR and extraction
- Parxy - A PDF parsers gateway to use different parsers using a unified API.
- SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
- UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
- huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
- Reducto - Document Ingestion API.
- adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
- lumina-ai-inc/chunkr - Vision model based PDF chunking.
- lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
- allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
- opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
- smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
- Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
- CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.
- Iteration Layer - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
- Docling - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- Filimoa/open-parse - Improved file parsing for LLMs.
- UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
-
Readers and viewers
- mozilla/pdf.js - PDF Reader in JavaScript.
- agentcooper/react-pdf-highlighter - Set of React components for PDF annotation.
- Sioyek - PDF viewer with a focus on technical books and research papers (desktop app).
Programming Languages
Categories
Sub Categories
Keywords
pdf
6
ocr
2
ml
1
machine-learning
1
llm
1
langchain
1
information-retrieval
1
donut
1
docx
1
document-parsing
1
document-parser
1
document-image-processing
1
document-image-analysis
1
deep-learning
1
data-pipelines
1
pdfmerger
1
pdf-web-apps
1
pdf-tools
1
pdf-ocr
1
pdf-merger
1
pdf-manipulation
1
pdf-converter
1
java
1
docker
1
typesetting
1
markup
1
compiler
1
react
1
pdf-viewer
1
highlighting
1
annotator
1
serverless
1
pdfjs
1
php
1
pdf-renderer
1
pdf-render
1
pdf-generation
1
microservice
1
laravel
1
document-templates
1
document
1
whisper-api
1
web-crawler
1
vision-transformer
1
parser-library
1
parse-server
1
omniparser
1
ingestion-api
1
preprocessing
1
pdf-to-text
1