{"id":13589396,"url":"https://github.com/Academic-Hammer/SciTSR","last_synced_at":"2025-04-08T09:32:40.384Z","repository":{"id":44476991,"uuid":"204964525","full_name":"Academic-Hammer/SciTSR","owner":"Academic-Hammer","description":"Table structure recognition dataset of the paper: Complicated Table Structure Recognition","archived":false,"fork":false,"pushed_at":"2020-07-07T01:03:17.000Z","size":49,"stargazers_count":350,"open_issues_count":32,"forks_count":57,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-11-06T09:39:38.695Z","etag":null,"topics":["pdf-to-text","pdf2txt","table-structure-recognition"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/1908.04729.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Academic-Hammer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-28T15:29:31.000Z","updated_at":"2024-11-04T11:02:30.000Z","dependencies_parsed_at":"2022-08-12T11:11:26.213Z","dependency_job_id":null,"html_url":"https://github.com/Academic-Hammer/SciTSR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Academic-Hammer%2FSciTSR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Academic-Hammer%2FSciTSR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Academic-Hammer%2FSciTSR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Academic-Hammer%2FSciTSR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Academic-Hammer","download_url":"https://codeload.github.com/Academic-Hammer/SciTSR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247814170,"owners_count":21000514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf-to-text","pdf2txt","table-structure-recognition"],"created_at":"2024-08-01T16:00:29.603Z","updated_at":"2025-04-08T09:32:40.127Z","avatar_url":"https://github.com/Academic-Hammer.png","language":"Python","funding_links":[],"categories":["Popular Datasets","2. Datasets"],"sub_categories":["2.1 Introduction"],"readme":"# SciTSR\n\n## Introduction\n\nSciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.\n\n**Download link** is [here](https://drive.google.com/file/d/1qXaJblBg9sbPN0xknWsYls1aGGtlp4ZN/view?usp=sharing).\n\nThere are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in `SciTSR-COMP.list`.\n\nThe statistics of SciTSR dataset is following:\n\n|                             |  Train |  Test |\n| --------------------------- | -----: | ----: |\n| \\# Tables                   | 12,000 | 3,000 |\n| \\# Complicated tables       |  2,885 |   716 |\n\n## Format and Example\n\nThe directory tree structure is as follow:\n\n```\nSciTSR\n├── SciTSR-COMP.list\n├── test\n│   ├── chunk\n│   ├── img\n│   ├── pdf\n│   └── structure\n└── train\n    ├── chunk\n    ├── img\n    ├── pdf\n    ├── rel\n    └── structure\n```\n\nThe input PDF files are stored in `pdf`, and the structure labels are stored in the `structure` directory.\n\nFor convenience, we provide the input in image format stored in `img`, which are converted from PDFs by `pdfcairo`.\n\nWe also provide the extracted chunks stored in `chunk`, which are pre-processed by [Tabby](https://github.com/cellsrg/tabbypdf/).\n\nFor training data, we provide the our constructed relation labels for our GraphTSR model, which are generated by matching chunks and the texts of structure labels.\n\n**Note that our pre-processed chunk and relation data may contain noise. The original input files are in PDF.**\n\n### Text Chunks\n\nFile: chunk/[ID].chunk\n\nThe `pos` array contains the `x1`, `x2`, `y1` and `y2` coordinates (in PDF) of the chunk.\n\n```json\n{\"chunks\": [\n  {\n    \"pos\": [\n      147.96600341796875,\n      205.49998474121094,\n      475.7929992675781,\n      480.4206237792969\n    ],\n    \"text\": \"Probability\"\n  },\n  {\n    \"pos\": [\n      217.45510864257812,\n      290.6802673339844,\n      475.7929992675781,\n      480.4206237792969\n    ],\n    \"text\": \"Generated Text\"\n  },\n  ...\n ]}\n```\n\n### Relations\n\nFile rel/[ID].rel\n\nA line of `CHUNK_ID_1 CHUNK_ID_2 RELATION_ID:NUM_BLANK` represents the relation between CHUNK_ID_1-th chunk and CHUNK_ID_2-th chunk is RELATION_ID, and there are NUM_BLANK blank cells between them.\nFor RELATION_ID, 1 and 2 represents horizontal and vertical, respectively.\n\n```\n0 1 1:0\n1 2 1:0\n0 9 2:0\n...\n```\n\n### Structure Labels\n\nFile: structure/[ID].json\n\nA table is stored as a list of cells. For each cell, we provide its original tex code, content (split by space) and position in the table (start/end row/column number, started from 0).\n\n```json\n{\"cells\": [\n  {\n    \"id\": 21,\n    \"tex\": \"959\",\n    \"content\": [\n      \"959\"\n    ],\n    \"start_row\": 5,\n    \"end_row\": 5,\n    \"start_col\": 1,\n    \"end_col\": 1\n  },\n  {\n    \"id\": 1,\n    \"tex\": \"Training set\",\n    \"content\": [\n      \"Training\",\n      \"set\"\n    ],\n    \"start_row\": 0,\n    \"end_row\": 0,\n    \"start_col\": 1,\n    \"end_col\": 1\n  },\n  ...\n]}\n```\n\n## Implementation Details\n\n### Features\n\nThe codes for vertex and edge features are at `./scitsr/graph.py`.\n\nYou can get vertex features by `Vertex(vid, chunk, tab_h, tab_w).features` and edge features by `Edge(vertex1, vertex2).features`.\n\n`tab_h` and `tab_w` denotes the height (y-axis) and width (x-axis) of the table.\n\nSee `./scitsr/graph.py` for more details.\n\n### Evaluation\n\nIn the evaluation procedure, a table should be converted to a list of horizontally/vertically adjacent relations. Then we make a comparison between ground truth relations and output relations.\n\nWe release the evaluation scripts for comparing horizontally and vertically adjacent relations. In the following example (`./examples/eval.py`), we show how to use the scripts to calculate precision/recall/F1 for an output table.\n\n\n\n```python\nwith open(json_path) as fp: json_obj = json.load(fp)\n# convert the structure labels (a table in json format) to a list of relations\nground_truth_relations = json2Relations(json_obj, splitted_content=True)\n# your_relations should be a List of Relation.\n# Here we directly use the ground truth relations in the example.\nyour_relations = ground_truth_relations\nprecision, recall = eval_relations(\n  gt=[ground_truth_relations], res=[your_relations], cmp_blank=True)\n```\n\nNote: Your output tables should be represented as `List[Relation]`. You can also store a table as a `Table` object and then convert it to `List[Relation]` by using `scitsr.eval.Table2Relations`.\n\n## Citation\n\nPlease cite the paper if you found the resources useful.\n\n```\n@article{chi2019complicated,\n  title={Complicated Table Structure Recognition},\n  author={Chi, Zewen and Huang, Heyan and Xu, Heng-Da and Yu, Houjin and Yin, Wanxuan and Mao, Xian-Ling},\n  journal={arXiv preprint arXiv:1908.04729},\n  year={2019}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAcademic-Hammer%2FSciTSR","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAcademic-Hammer%2FSciTSR","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAcademic-Hammer%2FSciTSR/lists"}