{"id":13579633,"url":"https://github.com/deepdoctection/deepdoctection","last_synced_at":"2026-01-04T12:12:14.008Z","repository":{"id":37687274,"uuid":"436510608","full_name":"deepdoctection/deepdoctection","owner":"deepdoctection","description":"A Repo For Document AI","archived":false,"fork":false,"pushed_at":"2025-04-10T10:59:42.000Z","size":22907,"stargazers_count":2796,"open_issues_count":19,"forks_count":154,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-23T21:02:12.392Z","etag":null,"topics":["document-ai","document-image-analysis","document-layout-analysis","document-parser","document-understanding","layoutlm","nlp","ocr","publaynet","pubtabnet","python","pytorch","table-detection","table-recognition","tensorflow"],"latest_commit_sha":null,"homepage":"https://deepdoctection.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepdoctection.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-09T06:43:29.000Z","updated_at":"2025-04-23T06:43:47.000Z","dependencies_parsed_at":"2023-10-15T16:25:13.764Z","dependency_job_id":"2faa7ca6-5f5c-4169-afd7-c1045dd9443e","html_url":"https://github.com/deepdoctection/deepdoctection","commit_stats":{"total_commits":1146,"total_committers":10,"mean_commits":114.6,"dds":0.03403141361256545,"last_synced_commit":"d8bb95d3b1aa72723c56d48cc58e05cc7bbede33"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fdeepdoctection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fdeepdoctection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fdeepdoctection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fdeepdoctection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepdoctection","download_url":"https://codeload.github.com/deepdoctection/deepdoctection/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250514780,"owners_count":21443209,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-ai","document-image-analysis","document-layout-analysis","document-parser","document-understanding","layoutlm","nlp","ocr","publaynet","pubtabnet","python","pytorch","table-detection","table-recognition","tensorflow"],"created_at":"2024-08-01T15:01:41.448Z","updated_at":"2025-12-30T02:25:04.926Z","avatar_url":"https://github.com/deepdoctection.png","language":"Python","funding_links":[],"categories":["Python","🔧 General Document Processing \u0026 Pipelines","Resources"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/_imgs/dd_logo.png\" alt=\"Deep Doctection Logo\" width=\"60%\"\u003e\n\u003c/p\u003e\n\n![GitHub Repo stars](https://img.shields.io/github/stars/deepdoctection/deepdoctection)\n![PyPI - Version](https://img.shields.io/pypi/v/deepdoctection)\n![PyPI - License](https://img.shields.io/pypi/l/deepdoctection)\n\n\n------------------------------------------------------------------------------------------------------------------------\n# NEW \n\nVersion `v.1.0` includes a major refactoring.  Key changes include:\n\n* PyTorch-only support for all deep learning models.\n* Support for many more fine-tuned models from the Huggingface Hub (Bert, RobertA, LayoutLM, LiLT, ...)\n* Decomposition into small sub-packages: dd-core, dd-datasets and deepdoctection\n* Type validations of core data structures\n* New test suite\n\n------------------------------------------------------------------------------------------------------------------------\n\n\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003e\n  A Package for Document Understanding\n  \u003c/h1\u003e\n\u003c/p\u003e\n\n\n**deep**doctection is a Python library that orchestrates Scan and PDF document layout analysis, OCR and document \nand token classification. Build and run a pipeline for your document extraction tasks, develop your own document\nextraction workflow, fine-tune pre-trained models and use them seamlessly for inference.\n\n# Overview\n\n- Document layout analysis and table recognition in PyTorch with \n[**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) and \n[**Transformers**](https://github.com/huggingface/transformers),\n- OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr) and \n  [**AWS Textract**](https://aws.amazon.com/textract/),\n- Document and token classification with the [**LayoutLM**](https://github.com/microsoft/unilm) family,\n  [**LiLT**](https://github.com/jpWang/LiLT) and and many\n  [**Bert**](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)-style models including features like sliding windows.\n- Text mining for native PDFs with [**pdfplumber**](https://github.com/jsvine/pdfplumber),\n- Language detection with with transformer based `papluca/xlm-roberta-base-language-detection`. \n- Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew) or [**Tesseract**](https://github.com/tesseract-ocr/tesseract).\n- Fine-tuning object detection, document or token classification models and evaluating whole pipelines.\n- Lot's of [tutorials](https://github.com/deepdoctection/notebooks)\n\nHave a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Get_Started.ipynb) for an easy start.\n\nCheck the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.\n\n----------------------------------------------------------------------------------------\n\n# Hugging Face Space Demo\n\nCheck the demo of a document layout analysis pipeline with OCR on 🤗\n[**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection).\n\n--------------------------------------------------------------------------------------------------------\n\n# Example \n\nThe following example shows how to use the built-in analyzer to decompose a PDF document into its layout structures.\n\n```python\nimport deepdoctection as dd\nfrom IPython.core.display import HTML\nfrom matplotlib import pyplot as plt\n\nanalyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo\n\ndf = analyzer.analyze(path = \"/path/to/your/doc.pdf\")  # setting up pipeline\ndf.reset_state()                 # Trigger some initialization\n\ndoc = iter(df)\npage = next(doc) \n\nimage = page.viz(show_figures=True, show_residual_layouts=True)\nplt.figure(figsize = (25,17))\nplt.axis('off')\nplt.imshow(image)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/_imgs/dd_rm_sample.png\" \nalt=\"sample\" width=\"40%\"\u003e\n\u003c/p\u003e\n\n```\nHTML(page.tables[0].html)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/_imgs/dd_rm_table.png\" \nalt=\"table\" width=\"40%\"\u003e\n\u003c/p\u003e\n\n```\nprint(page.text)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/deepdoctection/deepdoctection/raw/master/docs/_imgs/dd_rm_text.png\" \nalt=\"text\" width=\"40%\"\u003e\n\u003c/p\u003e\n\n\n\n-----------------------------------------------------------------------------------------\n\n# Requirements\n\n![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/_imgs/install_01.png)\n\n- Python \u003e= 3.10\n- PyTorch \u003e= 2.6\n- To fine-tune models, a GPU is recommended.\n\n------------------------------------------------------------------------------------------\n\n# Installation\n\nWe recommend using a virtual environment.\n\n## Get started installation\n\nFor a simple setup which is enough to parse documents with the default setting, install the following\n\n```\nuv pip install timm  # needed for the default setup\nuv pip install transformers\nuv pip install python-doctr\nuv pip install deepdoctection\n```\n\nThis setup is sufficient to run the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb).\n\n### Full installation\n\nThe following installation will give you a general setup so that you can experiment with various configurations.\nRemember, that you always have to install PyTorch separately.\n\nFirst install **Detectron2** separately as it is not distributed via PyPi. Check the instruction\n[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) or try:\n\n```\nuv pip install --no-build-isolation detectron2@git+https://github.com/deepdoctection/detectron2.git\n```\n\nThen install **deep**doctection with all its dependencies:\n\n```\nuv pip install deepdoctection[full]\n```\n\n\nFor further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).\n\n\n## Installation from source\n\nDownload the repository or clone via\n\n```\ngit clone https://github.com/deepdoctection/deepdoctection.git\n```\n\nThe easiest way is to install with make. A virtual environment is required\n\n```bash\nmake install-dd\n```\n\n\n## Running a Docker container from Docker hub\n\nPre-existing Docker images can be downloaded from the [Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).\n\nAdditionally, specify a working directory to mount files to be processed into the container.\n\n```\ndocker compose up -d\n```\n\nwill start the container. There is no endpoint exposed, though.\n\n-----------------------------------------------------------------------------------------------\n\n# Credits\n\nWe thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible\nto develop this framework.\n\n\n# If you like **deep**doctection ...\n\n...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.\n\n# License\n\nDistributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE) for additional information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepdoctection%2Fdeepdoctection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepdoctection%2Fdeepdoctection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepdoctection%2Fdeepdoctection/lists"}