https://github.com/explosion/prodigy-pdf

A Prodigy plugin for PDF annotation
https://github.com/explosion/prodigy-pdf

Last synced: 8 months ago
JSON representation

A Prodigy plugin for PDF annotation

Host: GitHub
URL: https://github.com/explosion/prodigy-pdf
Owner: explosion
License: mit
Created: 2023-09-29T11:07:52.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-12-24T09:21:46.000Z (over 1 year ago)
Last Synced: 2025-01-29T18:38:17.583Z (over 1 year ago)
Language: Python
Size: 2.59 MB
Stars: 28
Watchers: 8
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 📄 Prodigy-PDF

This repository contains a [Prodigy](https://prodi.gy) plugin with recipes for image- and text-based annotation of PDF files, as well as recipes for OCR (Optical Character Recognition) to extract content from documents. The `pdf.spans.manual` recipe uses [`spacy-layout`](https://github.com/explosion/spacy-layout) and [Docling](https://ds4sd.github.io/docling/) to extract the text contents from PDFs and lets you annotate spans of text, with an optional side-by-side preview of the original document and pre-fetching for faster loading during annotation.

![pdf.image.manual recipe](https://github.com/user-attachments/assets/da40ee4c-369d-407d-a412-fdb8d341aee8)

![pdf_spans_manual](https://github.com/user-attachments/assets/bc0a5fe8-1995-4ff8-8766-7dc4a03a52be)

You can install this plugin via `pip`.

```
pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"
```

If you want to use the OCR recipes, you'll also want to ensure that tesseract is installed.

```bash
# for mac
brew install tesseract

# for ubuntu
sudo apt install tesseract-ocr
```

To learn more about this plugin, you can check the [Prodigy docs](https://prodi.gy/docs/plugins/#pdf).

## Issues?

Are you have trouble with this plugin? Let us know on our [support forum](https://support.prodi.gy/) and we'll get back to you!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/explosion/prodigy-pdf

Awesome Lists containing this project

README