{"id":25486798,"url":"https://github.com/lahter/document_segmentation","last_synced_at":"2025-10-27T19:42:15.244Z","repository":{"id":216307440,"uuid":"727721609","full_name":"LAHTeR/document_segmentation","owner":"LAHTeR","description":"Tool for segmenting and classifying document boundaries.","archived":false,"fork":false,"pushed_at":"2024-04-12T15:36:07.000Z","size":50974,"stargazers_count":1,"open_issues_count":14,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-13T08:14:06.144Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LAHTeR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-12-05T12:49:07.000Z","updated_at":"2024-04-15T10:27:27.213Z","dependencies_parsed_at":"2024-04-15T10:38:25.663Z","dependency_job_id":null,"html_url":"https://github.com/LAHTeR/document_segmentation","commit_stats":null,"previous_names":["lahter/document_segmentation"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAHTeR%2Fdocument_segmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAHTeR%2Fdocument_segmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAHTeR%2Fdocument_segmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAHTeR%2Fdocument_segmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LAHTeR","download_url":"https://codeload.github.com/LAHTeR/document_segmentation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239535843,"owners_count":19655119,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-18T19:31:08.084Z","updated_at":"2025-10-27T19:42:10.190Z","avatar_url":"https://github.com/LAHTeR.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Document Segmentation\n\n[![python](https://img.shields.io/badge/Python-3.9-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n[![python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n[![python](https://img.shields.io/badge/Python-3.11-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n[![python](https://img.shields.io/badge/Python-3.12-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n\n[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n\n![Build \u0026 Test](https://github.com/LAHTeR/document_segmentation/actions/workflows/build-test.yaml/badge.svg?branch=feature/badges)\n![Ruff](https://github.com/LAHTeR/document_segmentation/actions/workflows/ruff.yaml/badge.svg?branch=feature/badges)\n\n## Overview\n\nThis repository provides tooling for processing VOC inventories to\n\n1. extract document boundaries and\n1. classify documents\n\nFor both cases, two scripts exist respectively:\n\n1. train a new model\n2. apply a model\n\n### Definitions and Workflow\n\n- An _inventory_ is a collection of pages.\n- A _document_ is a subset of such pages; a document can start and end on the same page, or stretch over hundreds of pages.\n- A document falls in a `TANAP category`, as defined in the `Tanap` class in [label.py](document_segmentation/pagexml/datamodel/label.py).\n\nThere are two separate tasks defined in this repository:\n\n- segmenting inventories: identify the boundaries of individual documents inside an inventory\n- classifying documents: identify the category of a document\n\n### Training and Applying Models\n\nFor each task, there is a script to train a model:\n\n- [train_segmentation_model.py](scripts/train_segmentation_model.py)\n- [train_classifier.py](scripts/train_classifier.py)\n\nSee below for instructions on installing prerequisites and running the scripts.\n\nBoth produce a model file; run either script with the `--help` argument for the specific arguments.\n\nIn order to apply a model as produced by the respective training script, call\n\n- [extract_docs.py](scripts/extract_docs.py) for extracting documents from an inventory\n- [predict_inventories](scripts/predict_inventories.py) is variation and applies the document segmentation model on multiple inventories, optionally generating a CSV file with thumbnail for human evaluation.\n- TODO: [classify_documents.py](scripts/classify_documents.py) for classifying documents\n\nAs above, run any of the scripts with the `--help` argument to get the specific usage.\n\n## Prerequisites\n\n### Install Poetry\n\n```console\ncurl -sSL https://install.python-poetry.org | python3 -\n```\n\nOr:\n\n```console\npipx install poetry\n```\n\nAls see [Poetry documentation](https://python-poetry.org/docs/#installation).\n\n### Install the dependencies\n\n```console\npoetry install\n```\n\n## Usage\n\nTo _train_ a model run the [`scripts/train_model.py`](scripts/train_segmentation_model.py) script.\nIt downloads the necessary data from the HUC server into the local temporary directory.\n\nSet your HUC credentials in the `HUC_USER` and `HUC_PASSWORD` environment variables or in [`settings.py`](document_segmentation/settings.py), and run the script.\n\n```console\nHUC_USER=... HUC_PASSWORD=... poetry run python scripts/train_model.py\n```\n\nWithout the credentials, the script is not able to download the inventories, but can proceed with previously downloaded ones.\nAdd the `--help` flag to see all available options.\n\nTo extract the documents of one or more inventories using a previously trained model, use the [`scripts/predict_inventories.py`](scripts/predict_inventories.py) script, for instance:\n\n```console\npoetry run python scripts/predict_inventories.py --model model.pt --inventory 1547,1548 --output 1547_1548.csv\n```\n\nMissing inventories are downloaded from the HUC server if the `HUC_USER` and `HUC_PASSWORD` environment variables are provided.\n\nAdd the `--help` flag to see all available options.\n\n## Development Instructions\n\nThis project uses\n\n- Python version \u003e= 3.9 and \u003c= 3.12\n- [Poetry](https://python-poetry.org/) for package management\n- [PyTest](https://docs.pytest.org) for unit testing\n- [Ruff](https://github.com/astral-sh/ruff) for linting and formatting\n- [Pre-commit](https://pre-commit.com/) for managing pre-commit hooks\n\n### Install Development Dependencies\n\n```console\npoetry install --with=dev\n```\n\n### Set up pre-commit hooks\n\n```console\npoetry run pre-commit install\n```\n\n### Run Tests\n\n```console\npoetry run pytest\n```\n\n### Architecture\n\nBoth document segmentation and classification are based on page embeddings -- defined in the `PageEmbedding` class --, and region embeddings -- defined in the `RegionEmbedding` class.\nThe models are implemented in the `PageSequenceTagger` and the `DocumentClassifier` class respectively, both are sub-classes of the `AbstractPageLearner` class (see diagram below).\nThese classes are used for document boundary detection and document type classification respectively.\n\nThe `Inventory` class is the main data class.\nIt holds sequences of pages and labels, and is inherited by the `Document` class, for using different labels.\n\nThe `Sheet` class and its sub-classes are used for reading and processing the annotated data from CSV/Excel sheets as stored in the [annotations](document_segmentation/data/annotations/) directory.\n\n(Hyper-)parameters like layer sizes and language model are defined in [settings.py](document_segmentation/settings.py).\n\n#### Classes Diagram\n\n![classes](classes.svg)\n\nRun this command for updating the classes diagram:\n\n```console\npoetry run pyreverse --output svg --colorized document_segmentation\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flahter%2Fdocument_segmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flahter%2Fdocument_segmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flahter%2Fdocument_segmentation/lists"}