{"id":13589414,"url":"https://github.com/IBM/SynthTabNet","last_synced_at":"2025-04-08T09:32:45.580Z","repository":{"id":39304005,"uuid":"463208313","full_name":"IBM/SynthTabNet","owner":"IBM","description":"Dataset of PNG images from synthetically generated table layouts with annotations in JSONL files","archived":false,"fork":false,"pushed_at":"2023-11-17T09:00:06.000Z","size":6674,"stargazers_count":129,"open_issues_count":3,"forks_count":11,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-12-14T11:16:24.221Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IBM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-02-24T15:53:10.000Z","updated_at":"2024-11-15T10:31:51.000Z","dependencies_parsed_at":"2023-11-17T10:29:53.846Z","dependency_job_id":"8b4beccd-d6ab-445e-83b3-708d6b25b204","html_url":"https://github.com/IBM/SynthTabNet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2FSynthTabNet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2FSynthTabNet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2FSynthTabNet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2FSynthTabNet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IBM","download_url":"https://codeload.github.com/IBM/SynthTabNet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247814192,"owners_count":21000516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:00:29.847Z","updated_at":"2025-04-08T09:32:45.189Z","avatar_url":"https://github.com/IBM.png","language":"Jupyter Notebook","funding_links":[],"categories":["Popular Datasets"],"sub_categories":[],"readme":"# SynthTabNet\n\nSynthTabNet is a dataset of 600k `png` images from synthetically generated table layouts with annotations in `jsonl` files.\n\n\n## Overview\n\nSynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.\n\nIt [has been shown](https://arxiv.org/abs/2203.01017) that other non-synthetic datasets like [PubTabNet](https://developer.ibm.com/exchanges/data/all/pubtabnet/), [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/) and [TableBank](https://doc-analysis.github.io/tablebank-page/index.html) suffer from many limitations:\n\n- Their table distributions are skewed towards simpler structures with fewer number of rows/columns.\n- There is a very limited variance in the appearance styles.\n- The content is sometimes restricted to certain domains.\n- The bounding boxes are omitted for non-empty cells or they are completely absent.\n\nSynthTabNet aims to overcome these limitations by providing:\n\n- A broad range of table sizes and richer combinations of row spans /column spans.\n- A variety of domain specific styling appearances (e.g. financial data, marketing data, sparse tables etc.)\n- Content generated out of the most frequently used terms appearing in non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.)\n- Bounding boxes for all table cells, including the empty ones.\n- Rectangular table structures. For each table, every row has the same number of columns after taking into account any row spans /column spans.\n\nSynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as `png` images and the annotations are in `jsonl` format.\n\nA detailed description of the data synthesis process can be found in the [paper](https://arxiv.org/abs/2203.01017).\n\n\n## Download\n\nv2.0.0\n\n| Appearance style | Records | Size(GB) | URL v2.0.0  |\n|------------------|---------|----------|-------------|\n| Fintabnet        | 150k    | 10     | [SynthTabNet-part1](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/fintabnet.zip) |\n| Marketing        | 150k    | 8      | [SynthTabNet-part2](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/marketing.zip) |\n| PubTabNet        | 150k    | 6      | [SynthTabNet-part3](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/pubtabnet.zip) |\n| Sparse           | 150k    | 3      | [SynthTatNet-part4](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/sparse.zip) |\n\n[v2.0.0 MD5 checksums](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/md5sum.txt)\n\n[v2.0.0 SHA1 checksums](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/sha1sum.txt)\n\n\n\u003cdetails\u003e\n\u003csummary\u003ev1.0.0\u003c/summary\u003e\n\n| Appearance style | Records | Size(GB) | URL v1.0.0  |\n|------------------|---------|----------|-------------|\n| Fintabnet        | 150k    | 10     | [SynthTabNet-part1](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/fintabnet.zip) |\n| Marketing        | 150k    | 8      | [SynthTabNet-part2](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/marketing.zip) |\n| PubTabNet        | 150k    | 6      | [SynthTabNet-part3](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/pubtabnet.zip) |\n| Sparse           | 150k    | 3      | [SynthTatNet-part4](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/sparse.zip) |\n\n[v1.0.0 MD5 checksums](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/md5sum.txt)\n\n[v1.0.0 SHA1 checksums](https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v1.0.0/sha1sum.txt)\n\n\n\u003c/details\u003e\n\n## Data format\n\nEach part of the dataset corresponds to a top level directory (`fintabnet`, `marketing`, `pubtabnet`, `sparse`) and has the following structure:\n\n```\n├── images\n│   ├── test\n│   ├── train\n│   └── val\n├── synthetic_data.jsonl\n```\n\nThe annotations for each part are in the `synthetic_data.jsonl` file. Each line is a `json` object that corresponds to a `png` image and has the following structure:\n\n```\n\"filename\": \"png image filename inside one of the 'test', 'train', 'val' directories\",\n\"split\": \"One of 'test', 'train', 'val'\",\n\"html\": \"Table structure and content\",\n    \"cells\": \"Array with all table cells\",\n        \"cell_id\": \"Zero based cell counter\",\n        \"is_header\": \"true if that cell is part of the table header\",\n        \"span\": \"In case there is a rowspan / columnspan\",\n            \"spantype\": \"One of 'rowspan', 'colspan', '2dspan'. The '2dspan' is used in case there is a rowspan and colspan in the same cell\",\n            \"rowspan\": \"Number of rowspans for this cell\",\n            \"colspan\": \"Number of colspans for this cell\"\n        \"tokens\": \"Array with the tokenized content of the cell\",\n        \"bbox\": \"The bounding bbox and the class of the cell in [x1, y1, x2, y2, class] format\"\n    \"structure\":\n        \"tokens\": \"Array with html tags that describe the table structure\"\n```\n\nRegarding the `bbox` parameter notice that:\n\n- The coordinates origin is the top left corner of the image.\n- Each bbox is described by its top left corner `(x1, y1)` and bottom right corner `(x2, y2)`.\n- The bbox `class` can have the values:\n  - `1`: An empty cell\n  - `2`: A non-empty cell\n\nThe `tokens` can be one of:\n\n```\n\" colspan=\\\"10\\\"\", \" colspan=\\\"2\\\"\", \" colspan=\\\"3\\\"\", \" colspan=\\\"4\\\"\", \" colspan=\\\"5\\\"\",\n\" colspan=\\\"6\\\"\", \" colspan=\\\"7\\\"\", \" colspan=\\\"8\\\"\", \" colspan=\\\"9\\\"\", \" rowspan=\\\"10\\\"\",\n\" rowspan=\\\"2\\\"\", \" rowspan=\\\"3\\\"\", \" rowspan=\\\"4\\\"\", \" rowspan=\\\"5\\\"\", \" rowspan=\\\"6\\\"\",\n\" rowspan=\\\"7\\\"\", \" rowspan=\\\"8\\\"\", \" rowspan=\\\"9\\\"\", \"\u003c/tbody\u003e\", \"\u003c/td\u003e\", \"\u003c/thead\u003e\",\n\"\u003c/tr\u003e\", \"\u003cend\u003e\", \"\u003cpad\u003e\", \"\u003cstart\u003e\", \"\u003ctbody\u003e\", \"\u003ctd\", \"\u003ctd\u003e\", \"\u003cthead\u003e\", \"\u003ctr\u003e\", \"\u003cunk\u003e\", \"\u003e\"\n```\n\n\n## Example data\n\n![pubtabnet](pics/image_000005_1634629104.274936.png)\n\n![sparse](pics/image_000005_1634629370.551275.png)\n\n![fintabnet](pics/image_000014_1634629328.541362.png)\n\n![marketing](pics/image_000024_1634629424.186544.png)\n\n\n## Jupyter notebook\n\nHere is a jupyter notebook that demonstrates how to download and use the dataset:\n\n[Demo Notebook](synthtabnet_demo.ipynb)\n\n\n## Paper\n\n**\"TableFormer: Table Structure Understanding with Transformers\"** (CVPR 2022).\n- Ahmed Nassar (ahn@zurich.ibm.com)\n- Nikolaos Livathinos (nli@zurich.ibm.com)\n- Maksym Lysak (mly@zurich.ibm.com)\n- Peter Staar (taa@zurich.ibm.com)\n\nArXiv link: https://arxiv.org/abs/2203.01017\n\n**Citation:**\n\n```\n@article{nassar2022tableformer,\n  title={TableFormer: Table Structure Understanding with Transformers},\n  author={Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},\n  journal={arXiv preprint arXiv:2203.01017},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIBM%2FSynthTabNet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIBM%2FSynthTabNet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIBM%2FSynthTabNet/lists"}