{"id":23178848,"url":"https://github.com/Alpha-Innovator/DocGenome","last_synced_at":"2025-09-14T18:32:40.955Z","repository":{"id":242887933,"uuid":"805240851","full_name":"UniModal4Reasoning/DocGenome","owner":"UniModal4Reasoning","description":"DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models","archived":false,"fork":false,"pushed_at":"2024-09-06T06:51:09.000Z","size":15972,"stargazers_count":146,"open_issues_count":6,"forks_count":5,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-01T09:04:33.683Z","etag":null,"topics":["document-understanding","paper-annotation","question-answering"],"latest_commit_sha":null,"homepage":"https://unimodal4reasoning.github.io/DocGenome_page/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UniModal4Reasoning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-24T07:04:35.000Z","updated_at":"2024-12-23T07:31:51.000Z","dependencies_parsed_at":"2024-06-05T15:12:40.094Z","dependency_job_id":"ed8b8403-6fbd-47d3-a12d-b315ebebe946","html_url":"https://github.com/UniModal4Reasoning/DocGenome","commit_stats":null,"previous_names":["unimodal4reasoning/docgenome"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniModal4Reasoning%2FDocGenome","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniModal4Reasoning%2FDocGenome/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniModal4Reasoning%2FDocGenome/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniModal4Reasoning%2FDocGenome/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UniModal4Reasoning","download_url":"https://codeload.github.com/UniModal4Reasoning/DocGenome/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233011235,"owners_count":18611078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-understanding","paper-annotation","question-answering"],"created_at":"2024-12-18T07:12:58.127Z","updated_at":"2025-09-14T18:32:40.923Z","avatar_url":"https://github.com/UniModal4Reasoning.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![arXiv](https://img.shields.io/badge/arXiv-2406.11633-b31b1b.svg)](https://arxiv.org/abs/2406.11633)\n[![GitHub issues](https://img.shields.io/github/issues/UniModal4Reasoning/DocGenome)](https://github.com/UniModal4Reasoning/DocGenome/issues)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://github.com/UniModal4Reasoning/DocGenome/pulls)\n\n# DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models\n\nWe present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:\n\n- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \\LaTeX\\ source codes. \n- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. \n- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.  \n- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. \n\n\n## Release\n- [2025/1/13] Release the Document parsing tools [DocParser](https://github.com/Alpha-Innovator/DocParser)\n- [2024/9/5] 🔥 Add the **data quality rating** for each structured document in DocGenome [here](https://huggingface.co/datasets/U4R/DocGenome/blob/main/tire_classification_train.json)\n- [2024/8/27] Add the tutorials on how to use the [DocGenome dataset](https://github.com/UniModal4Reasoning/DocGenome/blob/main/tutorials/tutorial.ipynb).\n- [2024/8/7] Add the detalied explanation about the different file structures in DocGenome.[Dataset_Details_README](Dataset_Details_README.md)\n- [2024/7/23] We have supported **TestSet** downloads from [Huggingface](https://huggingface.co/datasets/U4R/DocGenome-Testset-DocQA/tree/main ). If you want to evaluate your model on TestSet, please refer to [Evaluation](docs/Evaluation_README.md).\n- [2024/7/12] We have supported dataset downloads from [Huggingface](https://huggingface.co/datasets/U4R/DocGenome/tree/main).\n- [2024/6/15] 🔥 Our paper entitled \"DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models\" has been released in arXiv [Link]()\n- [2024/6/6] 🔥 We have released the DocGenome benchmark, includes 8 subsets as follows: \n    - [docgenome-train-000.tar.gz](https://drive.google.com/file/d/1p6o6naxPgWLvfmBIvmVionBv4ygSV5Sy/view?usp=drive_link)\n    - [docgenome-train-001.tar.gz](https://drive.google.com/file/d/16xTMiZb7E-VPUdIU32mA3qNeuZIoIJ-9/view?usp=drive_link)\n    - [docgenome-train-002.tar.gz](https://drive.google.com/file/d/1qW64JRqlFzkx1wwMwo8vdsyTUcetH_H2/view?usp=drive_link)\n    - [docgenome-train-003.tar.gz](https://drive.google.com/file/d/1JlgHou0JchCn8F4Dspb22YMQRkgHm1dG/view?usp=drive_link)\n    - [docgenome-train-004.tar.gz](https://drive.google.com/file/d/1XEuAz1tlo1jzBYk7scSBPbVsf1KqjN6c/view?usp=drive_link)\n    - [docgenome-train-005.tar.gz](https://drive.google.com/file/d/1Fz4f9YBRG7Ro7b1uovEKl2pfyizFIq-z/view?usp=drive_link)\n    - [docgenome-train-006.tar.gz]()\n    - [docgenome-train-007.tar.gz]()\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/motivation.png\" height=\"95%\"\u003e\n\u003c/div\u003e\n\n## File Structure\nPlease refer to [Dataset_Details_README](Dataset_Details_README.md) for the detalied explanation about the different file structures in DocGenome.\n\n## DocGenome Benchmark Introduction\n\n| Datasets                | \\# Discipline | \\# Category of Units  | \\# Pages in Train-set       | \\# Pages in Test-set | \\# Task    | \\# Used Metric | Publication | Entity Relations          |\n|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|\n|                                          |                      \n| DocVQA         | -                              | N/A             | 11K                | 1K           | 1          | 2                  | 1960-2000   | ❎     |\n| DocLayNet | -                              | 11              | 80K                | 8K           | 1          | 1                  | -           | ❎     |\n| DocBank            | -                              | 13              | 0.45M              | **50K** | 3          | 1                  | 2014-2018   | ❎     |\n| PubLayNet   | -                              | 5               | 0.34M              | 12K          | 1          | 1                  | -           | ❎     |\n| VRDU               | -                              | 10              | 7K                 | 3K           | 3          | 1                  | -           | ❎     |\n| DUDE             | -                              | N/A             | 20K                | 6K           | 3          | 3                  | 1860-2022   | ❎     |\n| D^4LA             | -                              | **27**    | 8K                 | 2K           | 1          | 3                  | -           | ❎     |\n| Fox Benchmark       | -                              | 5               | N/A (No train-set) | 0.2K         | 3          | 5                  | -           | ❎     |\n| ArXivCap        | 32                             | N/A             | 6.4M*           | N/A          | 4          | 3                  | -           | ❎    |\n| DocGenome (ours)                | **153**                   | 13              | **6.8M**      | 9K           | **7** | **7**         | 2007-2022   | ✅     |\n\n\n\u0026ensp;\n------------------------\n\n### 👇🏻DocGenome-train Download\n\nWe provide 8 subsets of DocGenome-train for downloading:\n\n\u003cdetails\u003e\n\u003csummary\u003e Data Download\u003c/summary\u003e\n\n- [docgenome-train-000.tar.gz](https://drive.google.com/file/d/1p6o6naxPgWLvfmBIvmVionBv4ygSV5Sy/view?usp=drive_link)\n- [docgenome-train-001.tar.gz](https://drive.google.com/file/d/16xTMiZb7E-VPUdIU32mA3qNeuZIoIJ-9/view?usp=drive_link)\n- [docgenome-train-002.tar.gz](https://drive.google.com/file/d/1qW64JRqlFzkx1wwMwo8vdsyTUcetH_H2/view?usp=drive_link)\n- [docgenome-train-003.tar.gz](https://drive.google.com/file/d/1JlgHou0JchCn8F4Dspb22YMQRkgHm1dG/view?usp=drive_link)\n- [docgenome-train-004.tar.gz](https://drive.google.com/file/d/1XEuAz1tlo1jzBYk7scSBPbVsf1KqjN6c/view?usp=drive_link)\n- [docgenome-train-005.tar.gz](https://drive.google.com/file/d/1Fz4f9YBRG7Ro7b1uovEKl2pfyizFIq-z/view?usp=drive_link)\n- [docgenome-train-006.tar.gz]()\n- [docgenome-train-007.tar.gz]()\n\u003c/details\u003e\n\n\n### Definition of relationships between component units\nDocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:\n\n| **Name**       | Description         | Example                 |\n|--------------------------------------|---------------------------------------------------------|----------------------------------------------------------------------------|\n| Identical         | Two blocks share the same source code.                           | Cross-column text; Cross-page text.                                        |\n| Title adjacent      | The two titles are adjacent.                                     | (\\section\\{introduction\\}, \\section\\{method\\}) |\n| Subordinate        | One block is a subclass of another block.                        | (\\section\\{introduction\\}, paragraph within Introduction)    |\n| Non-title adjacent  | The two text or equation blocks are adjacent.                    | (Paragraph 1, Paragraph 2)                                                 |\n| Explicitly-referred | One block refers to another block via footnote, reference, etc.  | (As shown in \\ref\\{Fig: 5\\} ..., Figure 5)                   |\n| Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1)           \n\u003c/details\u003e\n\n### Attribute of component units\nDocGenome has 13 attributes of component units, which can be categorized into two classes\n- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.\n- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \\texttt{\\textbackslash ref} and \\texttt{\\textbackslash label}.\n\n| **Index**  | **Category** | **Notes**                           |\n|----------------|-------------------|------------------------------------------|\n| 0              | Algorithm         |                                          |\n| 1              | Caption           | Titles of Images, Tables, and Algorithms |\n| 2              | Equation          |                                          |\n| 3              | Figure            |                                          |\n| 4              | Footnote          |                                          |\n| 5              | List              |                                          |\n| 7              | Table             |                                          |\n| 8              | Text              |                                          |\n| 9              | Text-EQ           | Text block with inline equations         |\n| 10             | Title             | Section titles                           |\n| 12             | PaperTitle        |                                          |\n| 13             | Code              |                                          |\n| 14             | Abstract          |                                          |\n\n\n\n## Types of disciplines\n\nPage distribution of DocGenome. 20\\% of documents are five pages or fewer, 50\\% are ten pages or fewer, and 80\\% are nineteen pages or fewer.\n\u003cdetails\u003e\n\u003csummary\u003e Page Distribution\u003c/summary\u003e\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/page_distribution.png\" height=\"500\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u0026ensp;\n\nDistribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.\n\n\u003cdetails\u003e\n\u003csummary\u003e Discipline Distribution\u003c/summary\u003e\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/second_discipline.png\" height=\"1000\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\n\n\n\n\u0026ensp;\n------------------------\n## DocParser: A Cutting-edge Auto-labeling Pipeline\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/auto_label_pipeline.png\" height=\"85%\"\u003e\n\u003c/div\u003e\n\n\n\n## Visualizations\n\n\u003cdetails\u003e\n\u003csummary\u003e Visual Example One of annotations in DocGenome\u003c/summary\u003e\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/docgenome_label_examples_1.png\" height=\"900\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003e Visual Example One of annotations in DocGenome\u003c/summary\u003e\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/docgenome_label_examples_2.png\" height=\"900\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e Visual examples of document-oriented tasks in DocGenome\u003c/summary\u003e\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/docgenome_task_examples.png\" height=\"980\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n## Citation\nIf you find our work useful in your research, please consider citing Fox:\n```bibtex\n@article{xia2024docgenome,\n  title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},\n  author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},\n  journal={arXiv preprint arXiv:2406.11633},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-Innovator%2FDocGenome","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlpha-Innovator%2FDocGenome","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-Innovator%2FDocGenome/lists"}