{"id":49527323,"url":"https://github.com/microsoft/comphrdoc","last_synced_at":"2026-05-02T04:00:56.731Z","repository":{"id":230196648,"uuid":"763901325","full_name":"microsoft/CompHRDoc","owner":"microsoft","description":"Datasets and Evaluation Scripts for CompHRDoc","archived":false,"fork":false,"pushed_at":"2025-02-25T02:55:28.000Z","size":1461,"stargazers_count":58,"open_issues_count":1,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-05-02T03:31:06.904Z","etag":null,"topics":["document-structure-analysis","document-understanding","rag-related"],"latest_commit_sha":null,"homepage":"https://github.com/microsoft/CompHRDoc","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-02-27T05:45:59.000Z","updated_at":"2026-04-15T07:37:03.000Z","dependencies_parsed_at":"2026-02-26T05:02:44.544Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/CompHRDoc","commit_stats":null,"previous_names":["microsoft/comphrdoc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/CompHRDoc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCompHRDoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCompHRDoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCompHRDoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCompHRDoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/CompHRDoc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCompHRDoc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32522252,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-02T01:12:54.858Z","status":"online","status_checked_at":"2026-05-02T02:00:05.923Z","response_time":132,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-structure-analysis","document-understanding","rag-related"],"created_at":"2026-05-02T04:00:47.815Z","updated_at":"2026-05-02T04:00:56.683Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CompHRDoc\n\nComp-HRDoc is the first comprehensive benchmark, specifically designed for hierarchical document structure analysis. It encompasses tasks such as page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction. Comp-HRDoc is built upon the [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc), which comprises 1,000 documents for training and 500 documents for testing. We retain all original images without modification and extend the original annotations to accommodate the evaluation of these included tasks. The dataset is for model training and testing. Users can use this dataset to train a model or evaluate the performance for hierarchical document structure analysis.\n\n## News\n\n- **We released the annotations of the Comp-HRDoc benchmark, please refer to [`CompHRDoc.zip`](./CompHRDoc.zip).**\n- **We released the evaluation tool of the Comp-HRDoc benchmark, please refer to [`evaluation`](evaluation/) folder.**\n- **We released the original paper, [Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis](https://arxiv.org/pdf/2401.11874.pdf), to Arxiv.**\n\n## Introduction\n\nDocument Structure Analysis (DSA) is a comprehensive process that identifies the fundamental components within a document, encompassing headings, paragraphs, lists, tables, and figures, and subsequently establishes the logical relationships and structures of these components. This process results in a structured representation of the document’s physical layout that accurately mirrors its logical structure, thereby enhancing the effectiveness and accessibility of information retrieval and processing. In a contemporary digital landscape, the majority of mainstream documents are structured creations, crafted using hierarchical-schema authoring software such as LaTeX, Microsoft Word, and HTML. Consequently, Hierarchical Document Structure Analysis (HDSA), which focuses on extracting and reconstructing the inherent hierarchical structures within these document layouts, has gained significant attention. Previous datasets primarily focus on specific sub-tasks of DSA, such as Page Object Detection, Reading Order Prediction, and Table of Contents (TOC) Extraction, among others. Despite the substantial progress achieved in these individual sub-tasks, there remains a gap in the research community for a comprehensive end-to-end system or benchmark that addresses all aspects of document structure analysis concurrently. Leveraging the HRDoc dataset, we establish a comprehensive benchmark, Comp-HRDoc, aimed at evaluating page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction concurrently.\n\n\u003c!-- ![](assets/example.png) --\u003e\n\u003cimg src=\"assets/example.png\" height=\"500\" alt=\"\"\u003e\n\n### Data Directory Structure\n\n```plaintext\nComp-HRDoc/\n├── HRDH_MSRA_POD_TRAIN/\n│   ├── Images/ # put the document images of HRDoc-Hard training set into this folder\n│   │   ├── 1401.6399_0.png\n│   │   ├── 1401.6399_1.png\n│   │   └── ...\n│   ├── hdsa_train.json\n│   ├── coco_train.json\n│   ├── README.md # a detailed explanation of each file and folder\n│   └── ...\n└──HRDH_MSRA_POD_TEST/\n    ├── Images/ # put the document images of HRDoc-Hard test set into this folder\n    │   ├── 1401.3699_0.png\n    │   ├── 1401.3699_1.png\n    │   └── ...\n    ├── test_eval/ # hierarchical document structure for evaluation\n    │   ├── 1401.3699.json\n    │   ├── 1402.2741.json\n    │   └── ...\n    ├── test_eval_toc/ # table of contents structure for evaluation\n    │   ├── 1401.3699.json\n    │   ├── 1402.2741.json\n    │   └── ...\n    ├── hdsa_test.json\n    ├── coco_test.json\n    ├── README.md # a detailed explanation of each file and folder\n    └── ...\n```\n\n**For a detailed explanation of each file and folder, please refer to `datasets/Comp-HRDoc/HRDH_MSRA_POD_TRAIN/README.md` and `datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/README.md`.**\n\n**Due to license restrictions, please go to [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc) to download the images of HRDoc-Hard and put them into the corresponding folders.**\n\n### Evaluation Tool\n\nTo utilize the evaluation tool for assessing your model's performance on the Comp-HRDoc dataset, please consult the script located at [`evaluation/unified_layout_evaluation.py`](evaluation/unified_layout_evaluation.py).\n\nBelow is an example illustrating how to conduct an evaluation for the task of reconstructing the hierarchical document structure:\n```python\nhds_gt = \"datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/test_eval/\"\nhds_pred = \"path_to_your_predicted_hierarchical_structure/\"\npython evaluation/hrdoc_tool/teds_eval.py --gt_anno {hds_gt} --pred_folder {hds_pred}\n```\n\nWe also provide some examples in [`evaluation/examples/`](evaluation/examples/) to demonstrate the format of predicted files required by the evaluation tool.\n\n### Detect-Order-Construct\n\nWe proposed a comprehensive approach to thoroughly analyzing hierarchical document structures using a tree construction based method. This method decomposes tree construction into three distinct stages, namely Detect, Order, and Construct. Initially, given a set of document images, the Detect stage is dedicated to identifying all page objects and assigning a logical role to each object, thereby forming the nodes of the hierarchical document structure tree. Following this, the Order stage establishes the reading order relationships among these nodes, which corresponds to a pre-order traversal of the hierarchical document structure tree. Finally, the Construct stage identifies hierarchical relationships (e.g., Table of Contents) between semantic units to construct an abstract hierarchical document structure tree. By integrating the results of all three stages, we can effectively construct a complete hierarchical document structure tree, facilitating a more comprehensive understanding of complex documents.\n\n\u003cimg src=\"assets/pipeline.png\"\u003e\n\n## Results\n\n### Hierarchical Document Structure Reconstruction on HRDoc\n\u003cimg src=\"assets/hrdoc_results.png\"\u003e\n\n### End-to-End Evaluation on Comp-HRDoc\n\u003cimg src=\"assets/results.png\"\u003e\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Privacy Statement\n\nPlease refer to [Microsoft Privacy Statement](https://go.microsoft.com/fwlink/?LinkId=521839).\n\n## 📝Citing\n\nIf you find this code useful, please consider to cite our work.\n\n```\n@article{wang2024detect,\n  title={Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis},\n  author={Wang, Jiawei and Hu, Kai and Zhong, Zhuoyao and Sun, Lei and Huo, Qiang},\n  journal={arXiv preprint arXiv:2401.11874},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fcomphrdoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fcomphrdoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fcomphrdoc/lists"}