https://github.com/deepdoctection/notebooks

Repository for deepdoctection tutorial notebooks
https://github.com/deepdoctection/notebooks

Last synced: 24 days ago
JSON representation

Repository for deepdoctection tutorial notebooks

Host: GitHub
URL: https://github.com/deepdoctection/notebooks
Owner: deepdoctection
License: apache-2.0
Created: 2022-12-04T13:20:56.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-07-21T21:14:45.000Z (9 months ago)
Last Synced: 2024-07-21T22:37:36.122Z (9 months ago)
Language: Jupyter Notebook
Homepage:
Size: 22.4 MB
Stars: 33
Watchers: 2
Forks: 15
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Deep Doctection Logo

A Document AI Package - Jupyter notebook tutorials

# Breaking changes

With the latest release of **deep**doctection v.0.33.0 the package has been refactored and is not compatible with
previous releases. If you are on a previous version, please update to the latest version or choose the repo
version that is tagged with v.0.32.0

# Jupyter Notebooks for **deep**doctection

In this repo you will find jupyter notebooks that used to be in the main repo [**deep**doctection](https://github.com/deepdoctection/deepdoctection). If you encouter problems, feel free to open an issue in the **deep**doctection repository.

In addition, the repo contains a folder with examples that are used in the notebooks.git

[Get_Started.ipynb](Get_Started.ipynb):
- Introduction to **deep**doctection
- Analyzer
- Output structure: Page, Layouts, Tables
- Saving and reading a parsed document

[Pipelines.ipynb](Pipelines.ipynb):
- Pipelines
- Analyzer configuration
- Pipeline components
- Layout detection models
- OCR matching and reading order

[Analyzer_Configuration.ipynb](Analyzer_Configuration.ipynb)
- Analyzer Configuration
- How to change configuration
- High level Configuration
- Layout models
- Table transformer
- Custom model
- Table segmentation
- Text extraction
- PDFPlumber
- Tesseract
- DocTr
- AWS Textract
- Word matching
- Text ordering

[Analyzer_with_Table_Transformer.ipynb](Analyzer_with_Table_Transformer.ipynb):
- Analyzer configuration for running Table Transformer
- General configuration
- Table segmentation

[Doclaynet_with_YOLO.ipynb](Doclaynet_with_YOLO.ipynb)
- Writing a predictor from a third party library
- Adding the model wrapper for YOLO
- Adding the model to the `ModelCatalog`
- Modifying the factory class to build the Analyzer
- Running the Analyzer with the YoloDetector

[Doclaynet_Analyzer_Config.ipynb](Doclaynet_Analyzer_Config.ipynb)
- Advanced Analyzer Configuration
- Adding the model wrapper for YOLO
- Configuration to parse the page with respect to granular layout segments
- Extracting figures
- Relating captions to figures and tables

[Custom_Pipeline.ipynb](Custom_Pipeline.ipynb):
- Model catalog and registries
- Predictors
- Instantiating Pipeline backbones
- Instantiating Pipelines

[Datasets_and_Eval.ipynb](Datasets_and_Eval.ipynb):
- Creation of custom datasets
- Evaluation
- Fine tuning models

[Data_structure.ipynb](Data_structure.ipynb):
- Diving deeper into the data structure
- Page and Image
- `ObjectTypes`
- `ImageAnnotation` and sub categories
- Adding an `ImageAnnotation`
- Adding a `ContainerAnnotation` to an `ImageAnnotation`
- Sub images from given `ImageAnnotation`

[Using_LayoutLM_for_sequence_classification.ipynb](Using_LayoutLM_for_sequence_classification.ipynb):
- Fine tuning LayoutLM for sequence classification on a custom dataset
- Evaluation
- Building and running a production pipeline

[Running_pre_trained_models_from_other_libraries.ipynb](Running_pre_trained_models_from_other_libraries.ipynb)
- Installing and running pre-trained models provided by Layout-Parser
- Adding new categories

The next three notebooks are experiments on a custom dataset for token classification that has been made available
through [Huggingface](https://huggingface.co/datasets/deepdoctection/FRFPE). It shows, how to train and evaluate each
model of the LayoutLM family and how to track experiments with W&B.

[Layoutlm_v1_on_custom_token_classification.ipynb](Layoutlm_v1_on_custom_token_classification.ipynb)
- LayoutLMv1 for financial report NER
- Defining object types
- Visualization and display of ground truth
- Defining Dataflow and Dataset
- Defining a split and saving the split distribution as W&B artifact
- LayoutLMv1 training
- Further exploration of evaluation
- Evaluation with confusion matrix
- Visualizing predictions and ground truth
- Evaluation on test set
- Changing training parameters and settings

[Layoutlm_v2_on_custom_token_classification.ipynb](Layoutlm_v2_on_custom_token_classification.ipynb)
- LayoutLMv2 for financial report NER
- Defining `ObjectTypes`, Dataset and Dataflow
- Loading W&B artifact and building dataset split
- Exporing the language distribustion across the split
- Evaluation
- LayoutXLM for financial report NER
- Training XLM models on separate languages

[Layoutlm_v3_on_custom_token_classification.ipynb](Layoutlm_v3_on_custom_token_classification.ipynb)
- LayoutLMv3 for financial report NER
- Evaluation
- Conclusion

To use the notebooks **deep**doctection must be installed.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deepdoctection/notebooks

Awesome Lists containing this project

README

A Document AI Package - Jupyter notebook tutorials