https://github.com/explosion/prodigy-pdf
A Prodigy plugin for PDF annotation
https://github.com/explosion/prodigy-pdf
Last synced: 6 months ago
JSON representation
A Prodigy plugin for PDF annotation
- Host: GitHub
- URL: https://github.com/explosion/prodigy-pdf
- Owner: explosion
- License: mit
- Created: 2023-09-29T11:07:52.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-24T09:21:46.000Z (over 1 year ago)
- Last Synced: 2025-01-29T18:38:17.583Z (about 1 year ago)
- Language: Python
- Size: 2.59 MB
- Stars: 28
- Watchers: 8
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 📄 Prodigy-PDF
This repository contains a [Prodigy](https://prodi.gy) plugin with recipes for image- and text-based annotation of PDF files, as well as recipes for OCR (Optical Character Recognition) to extract content from documents. The `pdf.spans.manual` recipe uses [`spacy-layout`](https://github.com/explosion/spacy-layout) and [Docling](https://ds4sd.github.io/docling/) to extract the text contents from PDFs and lets you annotate spans of text, with an optional side-by-side preview of the original document and pre-fetching for faster loading during annotation.


You can install this plugin via `pip`.
```
pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"
```
If you want to use the OCR recipes, you'll also want to ensure that tesseract is installed.
```bash
# for mac
brew install tesseract
# for ubuntu
sudo apt install tesseract-ocr
```
To learn more about this plugin, you can check the [Prodigy docs](https://prodi.gy/docs/plugins/#pdf).
## Issues?
Are you have trouble with this plugin? Let us know on our [support forum](https://support.prodi.gy/) and we'll get back to you!