{"id":16459816,"url":"https://nextplusplus.github.io/TAT-QA/","last_synced_at":"2025-10-27T09:31:23.851Z","repository":{"id":50281895,"uuid":"368023885","full_name":"NExTplusplus/TAT-QA","owner":"NExTplusplus","description":"TAT-QA (Tabular And Textual dataset for Question Answering) contains 16,552 questions associated with 2,757 hybrid contexts from real-world financial reports. ","archived":false,"fork":false,"pushed_at":"2024-12-09T11:06:29.000Z","size":4776,"stargazers_count":98,"open_issues_count":2,"forks_count":24,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-12-09T12:20:10.178Z","etag":null,"topics":["financial-reports","hybrid","tabular","tat-qa","textual"],"latest_commit_sha":null,"homepage":"https://nextplusplus.github.io/TAT-QA/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NExTplusplus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-17T01:38:07.000Z","updated_at":"2024-12-09T11:06:34.000Z","dependencies_parsed_at":"2023-10-15T06:15:34.320Z","dependency_job_id":"4e1586ea-ea20-4cc2-90d7-ccb85e5e19f0","html_url":"https://github.com/NExTplusplus/TAT-QA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NExTplusplus%2FTAT-QA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NExTplusplus%2FTAT-QA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NExTplusplus%2FTAT-QA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NExTplusplus%2FTAT-QA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NExTplusplus","download_url":"https://codeload.github.com/NExTplusplus/TAT-QA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238471999,"owners_count":19478141,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["financial-reports","hybrid","tabular","tat-qa","textual"],"created_at":"2024-10-11T11:00:53.688Z","updated_at":"2025-10-27T09:31:17.411Z","avatar_url":"https://github.com/NExTplusplus.png","language":"Python","funding_links":[],"categories":["LLM Leaderboard"],"sub_categories":[],"readme":"\nTAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance\n====================\n\n**TAT-QA** (**T**abular **A**nd **T**extual dataset for **Q**uestion **A**nswering) contains 16,552 questions associated with 2,757 hybrid contexts \nfrom real-world financial reports. \n\nYou can download our TAT-QA dataset via [TAT-QA dataset](https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw).\n                \nFor more information, please refer to our [TAT-QA website](https://nextplusplus.github.io/TAT-QA/) or read our ACL2021 paper [PDF](https://aclanthology.org/2021.acl-long.254.pdf).\n\n## Updates \n\n**${\\color{red}Jan 2024:}$**  We release the ground truth for the TAT-QA Test set under the folder [TAT-QA dataset](https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw), to facilitate future research on this task!\n\n**${\\color{red}May 2023:}$** **[TAT-DQA](https://nextplusplus.github.io/TAT-DQA/)** is released! TAT-DQA is a large-scale Document Visual QA (VQA) dataset, which is constructed by extending the TAT-QA. Please check out it if you are interested in the new task. \n\n\n\n## TagOp Model\n\n### Requirements\n\nTo create an environment with [MiniConda](https://docs.conda.io/en/latest/miniconda.html) and activate it.\n\n```bash\nconda create -n tat-qa python==3.7\nconda activate tat-qa\npip install -r requirement.txt\npip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+${CUDA}.html\n```\n\nWe adopt `RoBERTa` as our encoder to develop our TagOp and use the following commands to prepare RoBERTa model \n\n```bash\ncd dataset_tagop\nmkdir roberta.large \u0026\u0026 cd roberta.large\nwget -O pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.bin\nwget -O config.json https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json\nwget -O vocab.json https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\nwget -O merges.txt https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\n```\n\n### Training \u0026 Testing\n\n#### Preprocessing dataset\n\nWe heuristicly generate the \"facts\" and \"mapping\" fields based on raw dataset, which are stored under the folder of `dataset_tagop`.\n\n\n#### Prepare dataset\n\n```bash\nPYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/tag_op python tag_op/prepare_dataset.py --mode [train/dev/test]\n```\n\nNote: The result will be written into the folder `./tag_op/cache` default.\n\n#### Train \u0026 Evaluation \n```bash\nCUDA_VISIBLE_DEVICES=2 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/trainer.py --data_dir tag_op/cache/ \\\n--save_dir ./checkpoint --batch_size 48 --eval_batch_size 8 --max_epoch 50 --warmup 0.06 --optimizer adam --learning_rate 5e-4 \\\n--weight_decay 5e-5 --seed 123 --gradient_accumulation_steps 4 --bert_learning_rate 1.5e-5 --bert_weight_decay 0.01 \\\n--log_per_updates 50 --eps 1e-6 --encoder roberta\n```\n\n#### Testing\n```bash\nCUDA_VISIBLE_DEVICES=2 PYTHONPATH=$PYTHONPATH:$(pwd) python tag_op/predictor.py --data_dir tag_op/cache/ --test_data_dir tag_op/cache/ \\\\\n--save_dir tag_op/ --eval_batch_size 32 --model_path ./checkpoint --encoder roberta\n```\n\nNote: The training process may take around 2 days using a single 32GB v100.\n\n#### Checkpoint\nYou may download this checkpoint of the trained TagOp model vai [TagOp Checkpoint](https://drive.google.com/file/d/1Ttyh1xyulsGcOt_JmFsAhPuxx7G3fyha/view?usp=share_link)\n\n\n### Citation\n\n__Please kindly cite our work if you use our dataset or codes, thank you.__\n```bash\n@inproceedings{zhu-etal-2021-tat,\n    title = \"{TAT}-{QA}: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance\",\n    author = \"Zhu, Fengbin  and\n      Lei, Wenqiang  and\n      Huang, Youcheng  and\n      Wang, Chao  and\n      Zhang, Shuo  and\n      Lv, Jiancheng  and\n      Feng, Fuli  and\n      Chua, Tat-Seng\",\n    booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)\",\n    month = aug,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.acl-long.254\",\n    doi = \"10.18653/v1/2021.acl-long.254\",\n    pages = \"3277--3287\"\n}\n```\n### License\n\nThe TAT-QA dataset is under the license of [Creative Commons (CC BY) Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/)\n            \n### Any Questions?\n\nFor any issues please create an issue [here](https://github.com/nextplusplus/tat-qa/issues) or kindly drop an email to the author: Fengbin Zhu [zhfengbin@gmail.com](mailto:zhfengbin@gmail.com), thank you.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/nextplusplus.github.io%2FTAT-QA%2F","html_url":"https://awesome.ecosyste.ms/projects/nextplusplus.github.io%2FTAT-QA%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/nextplusplus.github.io%2FTAT-QA%2F/lists"}