{"id":14974208,"url":"https://github.com/kasnerz/tabgenie","last_synced_at":"2026-03-03T21:01:17.980Z","repository":{"id":65558753,"uuid":"550176783","full_name":"kasnerz/tabgenie","owner":"kasnerz","description":"A multi-purpose toolkit for table-to-text generation: web interface, Python bindings, CLI commands.","archived":false,"fork":false,"pushed_at":"2024-04-30T07:38:19.000Z","size":9921,"stargazers_count":57,"open_issues_count":21,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-12-04T00:52:21.344Z","etag":null,"topics":["flask","python","table-to-text"],"latest_commit_sha":null,"homepage":"https://quest.ms.mff.cuni.cz/nlg/tabgenie","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kasnerz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-12T10:23:29.000Z","updated_at":"2025-11-13T11:22:06.000Z","dependencies_parsed_at":"2024-04-30T08:43:14.539Z","dependency_job_id":"1f431836-c929-4851-892d-b2812fa649b0","html_url":"https://github.com/kasnerz/tabgenie","commit_stats":{"total_commits":293,"total_committers":4,"mean_commits":73.25,"dds":0.3617747440273038,"last_synced_commit":"8205573933235ea7f40067c915c77a21d453004e"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kasnerz/tabgenie","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kasnerz%2Ftabgenie","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kasnerz%2Ftabgenie/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kasnerz%2Ftabgenie/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kasnerz%2Ftabgenie/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kasnerz","download_url":"https://codeload.github.com/kasnerz/tabgenie/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kasnerz%2Ftabgenie/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30060624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T18:21:05.932Z","status":"ssl_error","status_checked_at":"2026-03-03T18:20:59.341Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flask","python","table-to-text"],"created_at":"2024-09-24T13:50:09.600Z","updated_at":"2026-03-03T21:01:17.961Z","avatar_url":"https://github.com/kasnerz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧞 TabGenie: A Toolkit for Table-to-Text Generation \n\n**Demo 👉️ https://quest.ms.mff.cuni.cz/nlg/tabgenie**\n\nTabGenie provides tools for working with data-to-text generation datasets in a unified tabular format. \n\nTabGenie allows you to:\n  - **explore** the content of the datasets\n  - **interact** with table-to-text generation models \n  - **load and preprocess** the datasets in a unified format\n  - **prepare spreadsheets** for error analysis\n  - **export tables** to various file formats\n\nTabGenie is equipped with user-friendly web interface, Python bindings and command-line processing tools.\n\n\n\n ### Frontend Preview\n![](https://raw.githubusercontent.com/kasnerz/tabgenie/main/img/preview.png)\n\n\n## Quickstart\n```\npip install tabgenie\ntabgenie run --host=127.0.0.1 --port 8890\nxdg-open http://127.0.0.1:8890\n```\n\nOr try the demo at:\n\n**👉️ https://quest.ms.mff.cuni.cz/nlg/tabgenie**\n\n\n## Datasets\n\nThe datasets are loaded from the [HuggingFace datasets](https://huggingface.co/datasets).\n\nInput data in each dataset is preprocessed into a tabular format:\n- each table contains M rows and N columns,\n- cells may span multiple columns or rows,\n- cells may be marked as headings (indicated by bold font),\n- cells may be highlighted (indicated by yellow background).\n\nAdditionally, each example may contain metadata (such as title, url, etc.) which are displayed next to the main table as *properties*.\n\n| Dataset                                                                              | Source                                                                                                                                          | Data type      | # train | # dev  | # test | License     |\n| ------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | ------- | ------ | ------ | ----------- |\n| **[CACAPO](https://huggingface.co/datasets/kasnerz/cacapo)**                         | [van der Lee et al. (2020)](https://aclanthology.org/2020.inlg-1.10.pdf)                                                                        | Key-value      | 15,290  | 1,831  | 3,028  | CC BY       |\n| **[DART](https://huggingface.co/datasets/GEM/dart)**                                 | [Nan et al. (2021)](https://aclanthology.org/2021.naacl-main.37/)                                                                               | Graph          | 62,659  | 2,768  | 5,097  | MIT         |\n| **[E2E](https://huggingface.co/datasets/GEM/e2e_nlg)**                               | [Dušek et al. (2019)](https://aclanthology.org/W19-8652/)                                                                                       | Key-value      | 33,525  | 1,484  | 1,847  | CC BY-SA    |\n| **[EventNarrative](https://huggingface.co/datasets/kasnerz/eventnarrative)**         | [Colas et al. (2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/a3f390d88e4c41f2747bfa2f1b5f87db-Abstract-round1.html) | Graph          | 179,544 | 22,442 | 22,442 | CC BY       |\n| **[HiTab](https://huggingface.co/datasets/kasnerz/hitab)**                           | [Cheng et al. (2021)](https://aclanthology.org/2022.acl-long.78/)                                                                               | Table          | 7,417   | 1,671  | 1,584  | C-UDA       |\n| **[Chart-to-text](https://huggingface.co/datasets/kasnerz/charttotext-s)**           | [Kantharaj et al. (2022)](https://aclanthology.org/2022.acl-long.277/)                                                                          | Chart          | 24,368  | 5,221  | 5,222  | GNU GPL     |\n| **[Logic2Text](https://huggingface.co/datasets/kasnerz/logic2text)**                 | [Chen et al. (2020b)](https://aclanthology.org/2020.findings-emnlp.190/)                                                                        | Table  + Logic | 8,566   | 1,095  | 1,092  | MIT         |\n| **[LogicNLG](https://huggingface.co/datasets/kasnerz/logicnlg)**                     | [Chen et al. (2020a)](https://aclanthology.org/2020.acl-main.708/)                                                                              | Table          | 28,450  | 4,260  | 4,305  | MIT         |\n| **[NumericNLG](https://huggingface.co/datasets/kasnerz/numericnlg)**                 | [Suadaa et al. (2021)](https://aclanthology.org/2021.acl-long.115.pdf)                                                                          | Table          | 1,084   | 136    | 135    | CC BY-SA    |\n| **[SciGen](https://huggingface.co/datasets/kasnerz/scigen)**                         | [Moosavi et al. (2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/149e9677a5989fd342ae44213df68868-Paper-round2.pdf)   | Table          | 13,607  | 3,452  | 492    | CC BY-NC-SA |\n| **[SportSett:Basketball](https://huggingface.co/datasets/GEM/sportsett_basketball)** | [Thomson et al. (2020)](https://aclanthology.org/2020.intellang-1.4)                                                                            | Table          | 3,690   | 1,230  | 1,230  | MIT         |\n| **[ToTTo](https://huggingface.co/datasets/totto)**                                   | [Parikh et al. (2020)](https://aclanthology.org/2020.emnlp-main.89.pdf)                                                                         | Table          | 121,153 | 7,700  | 7,700  | CC BY-SA    |\n| **[WebNLG](https://huggingface.co/datasets/GEM/web_nlg)**                            | [Ferreira et al. (2020)](https://aclanthology.org/2020.webnlg-1.7/)                                                                             | Graph          | 35,425  | 1,666  | 1,778  | CC BY-NC    |\n| **[WikiBio](https://huggingface.co/datasets/wiki_bio)**                              | [Lebret et al. (2016)](https://aclanthology.org/D16-1128/)                                                                                      | Key-value      | 582,659 | 72,831 | 72,831 | CC BY-SA    |\n| **[WikiSQL](https://huggingface.co/datasets/wikisql)**                               | [Zhong et al. (2017)](https://arxiv.org/abs/1709.00103)                                                                                         | Table + SQL    | 56,355  | 8,421  | 15,878 | BSD         |\n| **[WikiTableText](https://huggingface.co/datasets/kasnerz/wikitabletext)**           | [Bao et al. (2018)](https://aaai.org/papers/11944-table-to-text-describing-table-region-with-natural-language/)                                 | Key-value      | 10,000  | 1,318  | 2,000  | CC BY       |\n\nSee `loaders/data.py` for an up-to-date list of available datasets.\n\n## Requirements\n- Python 3\n- Flask\n- HuggingFace datasets\n\nSee `setup.py` for the full list of requirements.\n\n## Installation\n- **pip**: `pip install tabgenie`\n- **development**: `pip install -e .[dev]`\n- **deployment**: `pip install -e .[deploy]`\n\n## Web interface\n- **local development**: `tabgenie [app parameters] run [--port=PORT] [--host=HOSTNAME]`\n- **deployment**: `gunicorn \"tabgenie.cli:create_app([app parameters])\"`\n\n## Command-line Interface\n### Export data\nExports individual tables to file.\n\nUsage:\n```\ntabgenie export \\\n  --dataset DATASET_NAME \\\n  --split SPLIT \\\n  --out_dir OUT_DIR \\\n  --export_format EXPORT_FORMAT\n```\nSupported formats: `json`, `csv`, `xlsx`, `html`, `txt`.\n\n### Generate a spreadsheet for error analysis\nGenerates a spreadsheet with system outputs and randomly selected examples for manual error analysis.\n\nUsage:\n```\ntabgenie sheet \\\n  --dataset DATASET  \\\n  --split SPLIT \\\n  --in_file IN_FILE  \\\n  --out_file OUT_FILE \\\n  --count EXAMPLE_COUNT\n```\n\n### Show dataset details\nDisplays information about the dataset in YAML format (or the list of available datasets if no argument is provided).\n\nUsage:\n```\ntabgenie info [-d DATASET]\n```\n\n## Python\n\nTabGenie can preprocess the datasets without dataset-specific preprocessing methods.\n\nSee the [examples](./examples) directory for a tutorial on using TabGenie for finetuning sequence-to-sequence models.\n\n\n\n## HuggingFace Integration\nThe datasets are stored to `HF_DATASETS_CACHE` directory which defaults to `~/.cache/huggingface/`. \n\n**Set the `HF_DATASETS_CACHE` environment variable before launching `tabgenie` if you want to store the (potentially very large) datasets in a different directory.** \n\n\nThe datasets are all loaded from [HuggingFace datasets](https://huggingface.co/datasets) instead of their original repositories which allows to use preprocessed datasets and a single unified loader.\n\n**Note that preparing the datasets for the first time may take some time since the datasets have to be downloaded to cache and preprocessed.** This process takes several minutes based on the dataset size. However, it only a one-time process (until the dataset is updated or the cache is deleted).\n\nAlso note that there may be some minor changes in the data w.r.t. to the original datasets due to unification, such as adding \"subject\", \"predicate\" and \"object\" headings to RDF triple-to-text datasets.\n\n## Adding datasets\nFor adding a new dataset:\n- prepare the dataset\n  - [add the dataset to Huggingface Datasets](https://huggingface.co/docs/datasets/upload_dataset)\n  - OR download the dataset locally\n- create the dataset loader in `loaders`\n  - a subclass of `HFTabularDataset` for HF datasets\n  - a subclass of `TabularDataset` for local datasets\n- create a mapping between the dataset name and the class name in `loaders/__init__.py`\n- add the dataset name to `tabgenie/config.yml`.\n\nEach dataset should contain the `prepare_table(entry)` method which instantiates a `Table` object from the original `entry`.\n\nThe `Table` object is automatically exported to HTML and other formats (the methods may be overridden).\n\nIf a dataset is an instance of `HFTabularDataset` (i.e. is loaded from Huggingface Datasets), it should contain a `self.hf_id` attribute. The attribute is used to automatically load the dataset via `datasets` package.\n\n## Interactive mode\nPipelines are used for processing the tables and producing outputs.\n\nSee `processing/processing.py` for an up-to-date list of available pipelines.\n\nCurrently integrated:\n- **model_api** - a pipeline which generates a textual description of a table by calling a table-to-text generation model through API,\n- **graph** - a pipeline which creates a knowledge graph by extracting RDF triples from a table and visualizes the output using D3.js library,\n\n### Adding pipelines\nFor adding a new pipeline:\n- create a file in `processing/pipelines` containing the pipeline class,\n- create file(s) in `processing/processors` with processors needed for the pipeline,\n- add the mapping between pipeline name and class name to `get_pipeline_class_by_name()` in `processing/processing.py`. \n\nEach pipeline should define `self.processors` in the `__init__()` method, instantiating the processors needed for the pipeline.\n\nThe input to each pipeline is a `content` object containing several fields needed for table processing. This interface may be subject to change (see `__init.py_:run_pipeline()` for more details).\n\nThe processors serve as modules, i.e. existing processors can be combined to create new pipelines. The interface between the processors may vary, it is however expected that the last processor in the pipeline outputs HTML code which is displayed on the page.\n\n\n### Pipeline config\nThis is an example pipeline configuration in `tabgenie/config.yml`:\n```\nrdf_triples:\n  pipeline: graph\n  interactive: true\n  datasets:\n    - webnlg\n    - dart\n    - e2e\n```\nThe key `rdf_triples` is the name of the pipeline which will be displayed in the web interface. It should contain only letters of English alphabet, underscores `_` or dashes `-`.\n\nRequired arguments:\n- `pipeline` : `str` - the name of the pipeline as defined in `processing/processing.py`, will be mapped to pipeline class\n- `interactive`: `bool` - whether the pipeline will be displayed in the interactive mode in the web interface\n\nOptional arguments:\n- `datasets` : `list` - the list of datasets for which the pipeline will be active in the web interface (all datasets by default)\n- any other argument, will be passed to the pipeline in `pipeline_args`\n\n\n\n## Configuration\nThe global configuration is stored in the `tabgenie/config.yml` file.\n\n- `datasets` - datasets which will be available in the web interface,\n- `default_dataset` - the dataset which is loaded by default,\n- `host_prefix` - subdirectory on which the app is deployed (used for loading static files and sending POST requests),\n- `cache_dev_splits` - whether to preload all available dev sets after startup,\n- `generated_outputs_dir` - directory from which the generated outputs are loaded,\n- `pipelines` - pipelines which will be available in the web interface (see the *Interactive mode* section for more info).\n\n## Paper \u0026 Citation\nFor citing our work, please use the following:\n```\n@inproceedings{kasner-etal-2023-tabgenie,\n    title = \"{T}ab{G}enie: A Toolkit for Table-to-Text Generation\",\n    author = \"Kasner, Zden{\\v{e}}k  and\n      Garanina, Ekaterina  and\n      Platek, Ondrej  and\n      Dusek, Ondrej\",\n    booktitle = \"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)\",\n    year = \"2023\",\n    address = \"Toronto, Canada\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.acl-demo.42\",\n    pages = \"444--455\",\n}\n```\n- Link for the paper: https://aclanthology.org/2023.acl-demo.42/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkasnerz%2Ftabgenie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkasnerz%2Ftabgenie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkasnerz%2Ftabgenie/lists"}