Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/unimodal4reasoning/docgenome
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models
https://github.com/unimodal4reasoning/docgenome
document-understanding paper-annotation question-answering
Last synced: 12 days ago
JSON representation
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models
- Host: GitHub
- URL: https://github.com/unimodal4reasoning/docgenome
- Owner: UniModal4Reasoning
- License: cc-by-4.0
- Created: 2024-05-24T07:04:35.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-09-06T06:51:09.000Z (5 months ago)
- Last Synced: 2025-01-01T09:04:33.683Z (19 days ago)
- Topics: document-understanding, paper-annotation, question-answering
- Language: Jupyter Notebook
- Homepage: https://unimodal4reasoning.github.io/DocGenome_page/
- Size: 15.2 MB
- Stars: 146
- Watchers: 5
- Forks: 5
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![arXiv](https://img.shields.io/badge/arXiv-2406.11633-b31b1b.svg)](https://arxiv.org/abs/2406.11633)
[![GitHub issues](https://img.shields.io/github/issues/UniModal4Reasoning/DocGenome)](https://github.com/UniModal4Reasoning/DocGenome/issues)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://github.com/UniModal4Reasoning/DocGenome/pulls)# DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
We present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:
- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.## Release
- [2024/9/5] π₯ Add the **data quality rating** for each structured document in DocGenome [here](https://huggingface.co/datasets/U4R/DocGenome/blob/main/tire_classification_train.json)
- [2024/8/27] Add the tutorials on how to use the [DocGenome dataset](https://github.com/UniModal4Reasoning/DocGenome/blob/main/tutorials/tutorial.ipynb).
- [2024/8/7] Add the detalied explanation about the different file structures in DocGenome.[Dataset_Details_README](Dataset_Details_README.md)
- [2024/7/23] We have supported **TestSet** downloads from [Huggingface](https://huggingface.co/datasets/U4R/DocGenome-Testset-DocQA/tree/main ). If you want to evaluate your model on TestSet, please refer to [Evaluation](docs/Evaluation_README.md).
- [2024/7/12] We have supported dataset downloads from [Huggingface](https://huggingface.co/datasets/U4R/DocGenome/tree/main).
- [2024/6/15] π₯ Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
- [2024/6/6] π₯ We have released the DocGenome benchmark, includes 8 subsets as follows:
- [docgenome-train-000.tar.gz](https://drive.google.com/file/d/1p6o6naxPgWLvfmBIvmVionBv4ygSV5Sy/view?usp=drive_link)
- [docgenome-train-001.tar.gz](https://drive.google.com/file/d/16xTMiZb7E-VPUdIU32mA3qNeuZIoIJ-9/view?usp=drive_link)
- [docgenome-train-002.tar.gz](https://drive.google.com/file/d/1qW64JRqlFzkx1wwMwo8vdsyTUcetH_H2/view?usp=drive_link)
- [docgenome-train-003.tar.gz](https://drive.google.com/file/d/1JlgHou0JchCn8F4Dspb22YMQRkgHm1dG/view?usp=drive_link)
- [docgenome-train-004.tar.gz](https://drive.google.com/file/d/1XEuAz1tlo1jzBYk7scSBPbVsf1KqjN6c/view?usp=drive_link)
- [docgenome-train-005.tar.gz](https://drive.google.com/file/d/1Fz4f9YBRG7Ro7b1uovEKl2pfyizFIq-z/view?usp=drive_link)
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()
## File Structure
Please refer to [Dataset_Details_README](Dataset_Details_README.md) for the detalied explanation about the different file structures in DocGenome.## DocGenome Benchmark Introduction
| Datasets | \# Discipline | \# Category of Units | \# Pages in Train-set | \# Pages in Test-set | \# Task | \# Used Metric | Publication | Entity Relations |
|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
| |
| DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | β |
| DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - | β |
| DocBank | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 | β |
| PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - | β |
| VRDU | - | 10 | 7K | 3K | 3 | 1 | - | β |
| DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | β |
| D^4LA | - | **27** | 8K | 2K | 1 | 3 | - | β |
| Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | β |
| ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | β |
| DocGenome (ours) | **153** | 13 | **6.8M** | 9K | **7** | **7** | 2007-2022 | β |β
------------------------### ππ»DocGenome-train Download
We provide 8 subsets of DocGenome-train for downloading:
Data Download
- [docgenome-train-000.tar.gz](https://drive.google.com/file/d/1p6o6naxPgWLvfmBIvmVionBv4ygSV5Sy/view?usp=drive_link)
- [docgenome-train-001.tar.gz](https://drive.google.com/file/d/16xTMiZb7E-VPUdIU32mA3qNeuZIoIJ-9/view?usp=drive_link)
- [docgenome-train-002.tar.gz](https://drive.google.com/file/d/1qW64JRqlFzkx1wwMwo8vdsyTUcetH_H2/view?usp=drive_link)
- [docgenome-train-003.tar.gz](https://drive.google.com/file/d/1JlgHou0JchCn8F4Dspb22YMQRkgHm1dG/view?usp=drive_link)
- [docgenome-train-004.tar.gz](https://drive.google.com/file/d/1XEuAz1tlo1jzBYk7scSBPbVsf1KqjN6c/view?usp=drive_link)
- [docgenome-train-005.tar.gz](https://drive.google.com/file/d/1Fz4f9YBRG7Ro7b1uovEKl2pfyizFIq-z/view?usp=drive_link)
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()### Definition of relationships between component units
DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:| **Name** | Description | Example |
|--------------------------------------|---------------------------------------------------------|----------------------------------------------------------------------------|
| Identical | Two blocks share the same source code. | Cross-column text; Cross-page text. |
| Title adjacent | The two titles are adjacent. | (\section\{introduction\}, \section\{method\}) |
| Subordinate | One block is a subclass of another block. | (\section\{introduction\}, paragraph within Introduction) |
| Non-title adjacent | The two text or equation blocks are adjacent. | (Paragraph 1, Paragraph 2) |
| Explicitly-referred | One block refers to another block via footnote, reference, etc. | (As shown in \ref\{Fig: 5\} ..., Figure 5) |
| Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1)### Attribute of component units
DocGenome has 13 attributes of component units, which can be categorized into two classes
- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.| **Index** | **Category** | **Notes** |
|----------------|-------------------|------------------------------------------|
| 0 | Algorithm | |
| 1 | Caption | Titles of Images, Tables, and Algorithms |
| 2 | Equation | |
| 3 | Figure | |
| 4 | Footnote | |
| 5 | List | |
| 7 | Table | |
| 8 | Text | |
| 9 | Text-EQ | Text block with inline equations |
| 10 | Title | Section titles |
| 12 | PaperTitle | |
| 13 | Code | |
| 14 | Abstract | |## Types of disciplines
Page distribution of DocGenome. 20\% of documents are five pages or fewer, 50\% are ten pages or fewer, and 80\% are nineteen pages or fewer.
Page Distribution
β
Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.
Discipline Distribution
β
------------------------
## DocParser: A Cutting-edge Auto-labeling Pipeline
## Visualizations
Visual Example One of annotations in DocGenome
Visual Example One of annotations in DocGenome
Visual examples of document-oriented tasks in DocGenome
## Citation
If you find our work useful in your research, please consider citing Fox:
```bibtex
@article{xia2024docgenome,
title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
journal={arXiv preprint arXiv:2406.11633},
year={2024}
}
```