Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ljvmiranda921/prodigy-pdf-custom-recipe
Custom recipe and utilities for document processing
https://github.com/ljvmiranda921/prodigy-pdf-custom-recipe
Last synced: 7 days ago
JSON representation
Custom recipe and utilities for document processing
- Host: GitHub
- URL: https://github.com/ljvmiranda921/prodigy-pdf-custom-recipe
- Owner: ljvmiranda921
- Created: 2022-05-02T02:16:37.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2022-06-19T03:55:37.000Z (over 2 years ago)
- Last Synced: 2024-11-26T05:52:09.568Z (18 days ago)
- Language: Python
- Homepage:
- Size: 722 KB
- Stars: 198
- Watchers: 5
- Forks: 20
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🪐 spaCy Project: Prodigy recipes for document processing and layout understanding
This repository contains recipes on how to use [Prodigy](https://prodi.gy) and
[Hugging Face](https://huggingface.co) for annotating, training, and reviewing
document layout datasets. We'll be finetuning a
[LayoutLMv3](https://arxiv.org/abs/2204.08387) model using
[FUNSD](https://guillaumejaume.github.io/FUNSD/), a dataset of noisy scanned
documents.![](docs/prodigy_annotation.gif)
This also serves as an illustration of how to design document processing
solutions. I attempted to generalize this approach into a framework, which you
can read more [from my
blog.](https://ljvmiranda921.github.io/notebook/2022/06/19/document-processing-framework/)![](docs/design_principles.png)
## 📋 project.yml
The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).### ⏯ Commands
The following commands are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
Commands are only re-run if their inputs have changed.| Command | Description |
| --- | --- |
| `install` | Install dependencies |
| `hydrate-db` | Hydrate the Prodigy database with annotated data from FUNSD |
| `review` | Review hydrated annotations |
| `train` | Train FUNSD model |
| `qa` | Perform QA for the test dataset using a trained model |
| `clean-db` | Drop all generated Prodigy datasets |
| `clean-files` | Clean all intermediary files |### ⏭ Workflows
The following workflows are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.| Workflow | Steps |
| --- | --- |
| `all` | `install` → `hydrate-db` → `train` |
| `clean-all` | `clean-db` → `clean-files` |### 🗂 Assets
The following assets are defined by the project. They can
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
in the project directory.| File | Source | Description |
| --- | --- | --- |
| `assets/funsd.zip` | URL | FUNSD dataset - noisy scanned documents for layout understanding |