https://github.com/alpha-innovator/docgenome_page
Project page of DocGenome Dataset
https://github.com/alpha-innovator/docgenome_page
Last synced: about 1 year ago
JSON representation
Project page of DocGenome Dataset
- Host: GitHub
- URL: https://github.com/alpha-innovator/docgenome_page
- Owner: Alpha-Innovator
- Created: 2024-06-06T09:38:18.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-18T10:32:50.000Z (almost 2 years ago)
- Last Synced: 2025-04-05T02:16:37.522Z (about 1 year ago)
- Language: JavaScript
- Homepage: https://unimodal4reasoning.github.io/DocGenome_page/
- Size: 34.8 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics:
- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes.
- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.
Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
## This is [Official Page](https://unimodal4reasoning.github.io/DocGenome_page/) of DocGenome Benchmark.