https://github.com/deepdoctection/deepdoctection
A Repo For Document AI
https://github.com/deepdoctection/deepdoctection
document-ai document-image-analysis document-layout-analysis document-parser document-understanding layoutlm nlp ocr publaynet pubtabnet python pytorch table-detection table-recognition tensorflow
Last synced: 4 days ago
JSON representation
A Repo For Document AI
- Host: GitHub
- URL: https://github.com/deepdoctection/deepdoctection
- Owner: deepdoctection
- License: apache-2.0
- Created: 2021-12-09T06:43:29.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2025-04-10T10:59:42.000Z (9 months ago)
- Last Synced: 2025-04-23T21:02:12.392Z (8 months ago)
- Topics: document-ai, document-image-analysis, document-layout-analysis, document-parser, document-understanding, layoutlm, nlp, ocr, publaynet, pubtabnet, python, pytorch, table-detection, table-recognition, tensorflow
- Language: Python
- Homepage: https://deepdoctection.readthedocs.io/
- Size: 21.8 MB
- Stars: 2,796
- Watchers: 20
- Forks: 154
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-document-understanding - deepdoctection - tuning, evaluating and running models. (Resources)
README



------------------------------------------------------------------------------------------------------------------------
# NEW
Version `v.1.0` includes a major refactoring. Key changes include:
* PyTorch-only support for all deep learning models.
* Support for many more fine-tuned models from the Huggingface Hub (Bert, RobertA, LayoutLM, LiLT, ...)
* Decomposition into small sub-packages: dd-core, dd-datasets and deepdoctection
* Type validations of core data structures
* New test suite
------------------------------------------------------------------------------------------------------------------------
A Package for Document Understanding
**deep**doctection is a Python library that orchestrates Scan and PDF document layout analysis, OCR and document
and token classification. Build and run a pipeline for your document extraction tasks, develop your own document
extraction workflow, fine-tune pre-trained models and use them seamlessly for inference.
# Overview
- Document layout analysis and table recognition in PyTorch with
[**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) and
[**Transformers**](https://github.com/huggingface/transformers),
- OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr) and
[**AWS Textract**](https://aws.amazon.com/textract/),
- Document and token classification with the [**LayoutLM**](https://github.com/microsoft/unilm) family,
[**LiLT**](https://github.com/jpWang/LiLT) and and many
[**Bert**](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)-style models including features like sliding windows.
- Text mining for native PDFs with [**pdfplumber**](https://github.com/jsvine/pdfplumber),
- Language detection with with transformer based `papluca/xlm-roberta-base-language-detection`.
- Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew) or [**Tesseract**](https://github.com/tesseract-ocr/tesseract).
- Fine-tuning object detection, document or token classification models and evaluating whole pipelines.
- Lot's of [tutorials](https://github.com/deepdoctection/notebooks)
Have a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Get_Started.ipynb) for an easy start.
Check the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.
----------------------------------------------------------------------------------------
# Hugging Face Space Demo
Check the demo of a document layout analysis pipeline with OCR on 🤗
[**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection).
--------------------------------------------------------------------------------------------------------
# Example
The following example shows how to use the built-in analyzer to decompose a PDF document into its layout structures.
```python
import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt
analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo
df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline
df.reset_state() # Trigger some initialization
doc = iter(df)
page = next(doc)
image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)
```
```
HTML(page.tables[0].html)
```
```
print(page.text)
```
-----------------------------------------------------------------------------------------
# Requirements

- Python >= 3.10
- PyTorch >= 2.6
- To fine-tune models, a GPU is recommended.
------------------------------------------------------------------------------------------
# Installation
We recommend using a virtual environment.
## Get started installation
For a simple setup which is enough to parse documents with the default setting, install the following
```
uv pip install timm # needed for the default setup
uv pip install transformers
uv pip install python-doctr
uv pip install deepdoctection
```
This setup is sufficient to run the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb).
### Full installation
The following installation will give you a general setup so that you can experiment with various configurations.
Remember, that you always have to install PyTorch separately.
First install **Detectron2** separately as it is not distributed via PyPi. Check the instruction
[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) or try:
```
uv pip install --no-build-isolation detectron2@git+https://github.com/deepdoctection/detectron2.git
```
Then install **deep**doctection with all its dependencies:
```
uv pip install deepdoctection[full]
```
For further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).
## Installation from source
Download the repository or clone via
```
git clone https://github.com/deepdoctection/deepdoctection.git
```
The easiest way is to install with make. A virtual environment is required
```bash
make install-dd
```
## Running a Docker container from Docker hub
Pre-existing Docker images can be downloaded from the [Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).
Additionally, specify a working directory to mount files to be processed into the container.
```
docker compose up -d
```
will start the container. There is no endpoint exposed, though.
-----------------------------------------------------------------------------------------------
# Credits
We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible
to develop this framework.
# If you like **deep**doctection ...
...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
# License
Distributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE) for additional information.