{"id":18253691,"url":"https://github.com/deepdoctection/notebooks","last_synced_at":"2025-04-04T17:30:38.488Z","repository":{"id":64209996,"uuid":"574112573","full_name":"deepdoctection/notebooks","owner":"deepdoctection","description":"Repository for deepdoctection tutorial notebooks ","archived":false,"fork":false,"pushed_at":"2024-07-21T21:14:45.000Z","size":23444,"stargazers_count":33,"open_issues_count":0,"forks_count":15,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-07-21T22:37:36.122Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepdoctection.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-04T13:20:56.000Z","updated_at":"2024-07-15T17:05:10.000Z","dependencies_parsed_at":"2023-12-23T12:22:43.801Z","dependency_job_id":"19f51040-31f8-4ed2-8483-2e5ca2a412db","html_url":"https://github.com/deepdoctection/notebooks","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fnotebooks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fnotebooks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fnotebooks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepdoctection%2Fnotebooks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepdoctection","download_url":"https://codeload.github.com/deepdoctection/notebooks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223150776,"owners_count":17095959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T10:08:00.964Z","updated_at":"2025-04-04T17:30:38.482Z","avatar_url":"https://github.com/deepdoctection.png","language":"Jupyter Notebook","readme":"\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/deepdoctection/deepdoctection/blob/master/docs/tutorials/_imgs/dd_logo.png\" alt=\"Deep Doctection Logo\" width=\"60%\"\u003e\n  \u003ch3 align=\"center\"\u003e\n  A Document AI Package - Jupyter notebook tutorials\n  \u003c/h3\u003e\n\u003c/p\u003e\n\n# Breaking changes\n\nWith the latest release of **deep**doctection v.0.33.0 the package has been refactored and is not compatible with \nprevious releases. If you are on a previous version, please update to the latest version or choose the repo\nversion that is tagged with v.0.32.0\n\n# Jupyter Notebooks for **deep**doctection \n\nIn this repo you will find jupyter notebooks that used to be in the main repo [**deep**doctection](https://github.com/deepdoctection/deepdoctection). If you encouter problems, feel free to open an issue in the **deep**doctection repository.\n\n\nIn addition, the repo contains a folder with examples that are used in the notebooks.git \n\n[Get_Started.ipynb](Get_Started.ipynb): \n- Introduction to **deep**doctection\n- Analyzer \n- Output structure: Page, Layouts, Tables\n- Saving and reading a parsed document\n\n[Pipelines.ipynb](Pipelines.ipynb):\n- Pipelines\n- Analyzer configuration\n- Pipeline components\n- Layout detection models\n- OCR matching and reading order\n\n[Analyzer_Configuration.ipynb](Analyzer_Configuration.ipynb)\n- Analyzer Configuration\n- How to change configuration\n- High level Configuration\n- Layout models\n- Table transformer\n- Custom model\n- Table segmentation\n- Text extraction\n- PDFPlumber\n- Tesseract\n- DocTr\n- AWS Textract\n- Word matching\n- Text ordering\n\n[Analyzer_with_Table_Transformer.ipynb](Analyzer_with_Table_Transformer.ipynb):\n- Analyzer configuration for running Table Transformer\n- General configuration\n- Table segmentation\n\n[Doclaynet_with_YOLO.ipynb](Doclaynet_with_YOLO.ipynb)\n- Writing a predictor from a third party library\n- Adding the model wrapper for YOLO\n- Adding the model to the `ModelCatalog`\n- Modifying the factory class to build the Analyzer\n- Running the Analyzer with the YoloDetector\n\n[Doclaynet_Analyzer_Config.ipynb](Doclaynet_Analyzer_Config.ipynb)\n- Advanced Analyzer Configuration\n- Adding the model wrapper for YOLO\n- Configuration to parse the page with respect to granular layout segments\n- Extracting figures\n- Relating captions to figures and tables\n\n[Custom_Pipeline.ipynb](Custom_Pipeline.ipynb): \n- Model catalog and registries\n- Predictors\n- Instantiating Pipeline backbones\n- Instantiating Pipelines\n\n[Datasets_and_Eval.ipynb](Datasets_and_Eval.ipynb): \n- Creation of custom datasets\n- Evaluation\n- Fine tuning models\n\n[Data_structure.ipynb](Data_structure.ipynb):\n- Diving deeper into the data structure\n- Page and Image\n- `ObjectTypes`\n- `ImageAnnotation` and sub categories\n- Adding an `ImageAnnotation`\n- Adding a `ContainerAnnotation` to an `ImageAnnotation`\n- Sub images from given `ImageAnnotation`\n\n[Using_LayoutLM_for_sequence_classification.ipynb](Using_LayoutLM_for_sequence_classification.ipynb): \n- Fine tuning LayoutLM for sequence classification on a custom dataset\n- Evaluation \n- Building and running a production pipeline \n\n[Running_pre_trained_models_from_other_libraries.ipynb](Running_pre_trained_models_from_other_libraries.ipynb)\n- Installing and running pre-trained models provided by Layout-Parser\n- Adding new categories\n\nThe next three notebooks are experiments on a custom dataset for token classification that has been made available \nthrough [Huggingface](https://huggingface.co/datasets/deepdoctection/FRFPE). It shows, how to train and evaluate each \nmodel of the LayoutLM family and how to track experiments with W\u0026B. \n\n[Layoutlm_v1_on_custom_token_classification.ipynb](Layoutlm_v1_on_custom_token_classification.ipynb)\n- LayoutLMv1 for financial report NER\n- Defining object types\n- Visualization and display of ground truth\n- Defining Dataflow and Dataset\n- Defining a split and saving the split distribution as W\u0026B artifact \n- LayoutLMv1 training\n- Further exploration of evaluation\n- Evaluation with confusion matrix\n- Visualizing predictions and ground truth\n- Evaluation on test set\n- Changing training parameters and settings\n\n[Layoutlm_v2_on_custom_token_classification.ipynb](Layoutlm_v2_on_custom_token_classification.ipynb)\n- LayoutLMv2 for financial report NER\n- Defining `ObjectTypes`, Dataset and Dataflow\n- Loading W\u0026B artifact and building dataset split\n- Exporing the language distribustion across the split\n- Evaluation\n- LayoutXLM for financial report NER\n- Training XLM models on separate languages\n\n[Layoutlm_v3_on_custom_token_classification.ipynb](Layoutlm_v3_on_custom_token_classification.ipynb)\n- LayoutLMv3 for financial report NER\n- Evaluation\n- Conclusion\n\nTo use the notebooks **deep**doctection must be installed. \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepdoctection%2Fnotebooks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepdoctection%2Fnotebooks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepdoctection%2Fnotebooks/lists"}