{"id":14181793,"url":"https://github.com/facebookresearch/tabert","last_synced_at":"2025-08-07T14:31:17.923Z","repository":{"id":66082772,"uuid":"269777071","full_name":"facebookresearch/TaBERT","owner":"facebookresearch","description":"This repository contains source code for the TaBERT model, a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. TaBERT is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).","archived":true,"fork":false,"pushed_at":"2021-08-26T01:24:53.000Z","size":2977,"stargazers_count":576,"open_issues_count":25,"forks_count":63,"subscribers_count":30,"default_branch":"main","last_synced_at":"2024-03-04T16:48:57.998Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-06-05T21:06:10.000Z","updated_at":"2024-02-28T07:27:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"74ca278e-ba75-44fd-b571-49a42c712261","html_url":"https://github.com/facebookresearch/TaBERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FTaBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FTaBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FTaBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FTaBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/TaBERT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":215857679,"owners_count":15940684,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-18T11:04:13.547Z","updated_at":"2024-08-18T11:04:21.764Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["💬 Classic Model"],"sub_categories":[],"readme":"# TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables\n\nThis repository contains source code for the [`TaBERT` model](https://arxiv.org/abs/2005.08314), a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. `TaBERT` is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns).\n\n## Installation\n\nFirst, install the conda environment `tabert` with supporting libraries.\n\n```bash\nbash scripts/setup_env.sh\n```\n\nOnce the conda environment is created, install `TaBERT` using the following command:\n\n```bash\nconda activate tabert\npip install --editable .\n```\n\n**Integration with HuggingFace's pytorch-transformers Library** is still WIP. While all the pre-trained models were developed based on the old version of the library `pytorch-pretrained-bert`, they are compatible with the the latest version `transformers`. The conda environment will install both versions of the transformers library, and `TaBERT` will use `pytorch-pretrained-bert` by default. You could uninstall the `pytorch-pretrained-bert` library if you prefer using `TaBERT` with the latest version of `transformers`.\n\n## Pre-trained Models\n\nPre-trained models could be downloaded from this [Google Drive shared folder](https://drive.google.com/drive/folders/1fDW9rLssgDAv19OMcFGgFJ5iyd9p7flg?usp=sharing).\nPlease uncompress the tarball files before usage.\n\nPre-trained models could be downloaded from command line as follows:\n```shell script\npip install gdown\n\n# TaBERT_Base_(k=1)\ngdown 'https://drive.google.com/uc?id=1-pdtksj9RzC4yEqdrJQaZu4-dIEXZbM9'\n\n# TaBERT_Base_(K=3)\ngdown 'https://drive.google.com/uc?id=1NPxbGhwJF1uU9EC18YFsEZYE-IQR7ZLj'\n\n# TaBERT_Large_(k=1)\ngdown 'https://drive.google.com/uc?id=1eLJFUWnrJRo6QpROYWKXlbSOjRDDZ3yZ'\n\n# TaBERT_Large_(K=3)\ngdown 'https://drive.google.com/uc?id=17NTNIqxqYexAzaH_TgEfK42-KmjIRC-g'\n```\n\n## Using a Pre-trained Model\n\nTo load a pre-trained model from a checkpoint file:\n\n```python\nfrom table_bert import TableBertModel\n\nmodel = TableBertModel.from_pretrained(\n    'path/to/pretrained/model/checkpoint.bin',\n)\n```\n\nTo produce representations of natural language text and and its associated table:\n```python\nfrom table_bert import Table, Column\n\ntable = Table(\n    id='List of countries by GDP (PPP)',\n    header=[\n        Column('Nation', 'text', sample_value='United States'),\n        Column('Gross Domestic Product', 'real', sample_value='21,439,453')\n    ],\n    data=[\n        ['United States', '21,439,453'],\n        ['China', '27,308,857'],\n        ['European Union', '22,774,165'],\n    ]\n).tokenize(model.tokenizer)\n\n# To visualize table in an IPython notebook:\n# display(table.to_data_frame(), detokenize=True)\n\ncontext = 'show me countries ranked by GDP'\n\n# model takes batched, tokenized inputs\ncontext_encoding, column_encoding, info_dict = model.encode(\n    contexts=[model.tokenizer.tokenize(context)],\n    tables=[table]\n)\n```\n\nFor the returned tuple, `context_encoding` and `column_encoding` are PyTorch tensors \nrepresenting utterances and table columns, respectively. `info_dict` contains useful \nmeta information (e.g., context/table masks, the original input tensors to BERT) for \ndownstream application.\n\n```python\ncontext_encoding.shape\n\u003e\u003e\u003e torch.Size([1, 7, 768])\n\ncolumn_encoding.shape\n\u003e\u003e\u003e torch.Size([1, 2, 768])\n```\n\n**Use Vanilla BERT** To initialize a TaBERT model from the parameters of BERT:\n\n```python\nfrom table_bert import TableBertModel\n\nmodel = TableBertModel.from_pretrained('bert-base-uncased')\n```\n\n## Example Applications\n\nTaBERT could be used as a general-purpose representation learning layer for semantic parsing tasks over database tables. \nExample applications could be found under the `examples` folder.\n\n## Extract/Preprocess Table Corpora from CommonCrawl and Wikipedia\n\n### Prerequisite\n\nThe following libraries are used for data extraction:\n\n* [`jnius`](https://pyjnius.readthedocs.io/en/stable/)\n* [`info.bliki.wiki`](https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Mediawiki2HTML)\n* wikitextparser\n* Beautiful Soup 4\n* Java Wikipedia code located at `contrib/wiki_extractor`\n    * It compiles to a `.jar` file using maven, which is also included in the folder\n* `jdk` 12+\n\n### Installation\nFist, you need to install Java JDK. \nThen use the following command to install necessary Python libraries. \n\n```\npip install -r preprocess/requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n### Training Table Corpora Extraction\n\n#### CommonCrawl WDC Web Table Corpus 2015\n\nDetails of the dataset could be found at [here](http://webdatacommons.org/webtables/2015/downloadInstructions.html).\nWe used the English relational tables split, which could be downloaded at [here](http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/englishCorpus/compressed/).\n\nThe script to preprocess the data is at `scripts/preprocess_commoncrawl_tables.sh`.\nThe following command pre-processes [a sample](http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/sample.gz) \nof the whole WDC dataset. To preprocess the whole dataset, simply replace \nthe `input_file` with the root folder of the downloaded tar ball files.\n```shell script\nmkdir -p data/datasets\nwget http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/sample.gz -P data/datasets\ngzip -d \u003c data/datasets/sample.gz \u003e data/datasets/commoncrawl.sample.jsonl\n\npython \\\n    -m preprocess.common_crawl \\\n    --worker_num 12 \\\n    --input_file data/datasets/commoncrawl.sample.jsonl \\\n    --output_file data/preprocessed_data/common_crawl.preprocessed.jsonl\n```\n\n#### Wikipedia Tables\n\nThe script to extract Wiki tables is at `scripts/extract_wiki_tables.sh`. It demonstrates\nextracting tables from a sampled Wikipedia dump. Again, you may need the full Wikipedida dump\nto perform data extraction.\n\n### Notes for Table Extraction\n\n**Extract Tables from Scraped HTML Pages** \nMost code in `preprocess.extract_wiki_data` is for extracting surrounding \nnatural language sentences around tables. If you are only interested in \nextracting tables (e.g., from scraped Wiki Web pages), you could just use \nthe `extract_table_from_html` function. See the comments for more details. \n\n## Training Data Generation\n\nThis section documents how to generate training data for masked language modeling training \nfrom extracted and preprocessed tables. \n\nThe scripts to generate training data for our vanilla `TaBERT(K=1)` and vertical attention\n`TaBERT(k=3)` models are `utils/generate_vanilla_tabert_training_data.py` and \n`utils/generate_vertical_tabert_training_data.py`. They are heavily optimized for generating \ndata in parallel in a distributed compute environment, but could still be used locally. \n\nThe following script assumes you have concatenated\nthe `.jsonl` files obtained from running the data extraction scripts on Wikipedia and CommonCrawl\ncorpora and saved to `data/preprocessed_data/tables.jsonl`\n\n```shell script\ncd data/preprocessed_data\ncat common_crawl.preprocessed.jsonl wiki_tables.jsonl \u003e tables.jsonl\n```\n\nThe following script generates training data for a vanilla `TaBERT(K=1)` model:\n```shell script\noutput_dir=data/train_data/vanilla_tabert\nmkdir -p ${output_dir}\n\npython -m utils.generate_vanilla_tabert_training_data \\\n    --output_dir ${output_dir} \\\n    --train_corpus data/preprocessed_data/tables.jsonl \\\n    --base_model_name bert-base-uncased \\\n    --do_lower_case \\\n    --epochs_to_generate 15 \\\n    --max_context_len 128 \\\n    --table_mask_strategy column \\\n    --context_sample_strategy concate_and_enumerate \\\n    --masked_column_prob 0.2 \\\n    --masked_context_prob 0.15 \\\n    --max_predictions_per_seq 200 \\\n    --cell_input_template 'column|type|value' \\\n    --column_delimiter \"[SEP]\"\n```\n\nThe following script generates training data for a `TaBERT(K=3)` model with \nvertical self-attention:\n```shell script\noutput_dir=data/train_data/vertical_tabert\nmkdir -p ${output_dir}\n\npython -m utils.generate_vertical_tabert_training_data \\\n    --output_dir ${output_dir} \\\n    --train_corpus data/preprocessed_data/tables.jsonl \\\n    --base_model_name bert-base-uncased \\\n    --do_lower_case \\\n    --epochs_to_generate 15 \\\n    --max_context_len 128 \\\n    --table_mask_strategy column \\\n    --context_sample_strategy concate_and_enumerate \\\n    --masked_column_prob 0.2 \\\n    --masked_context_prob 0.15 \\\n    --max_predictions_per_seq 200 \\\n    --cell_input_template 'column|type|value' \\\n    --column_delimiter \"[SEP]\"\n```\n\n**Parallel Data Generation** The script has two additional arguments, `--global_rank` and \n`--world_size`. To generate training data in parallel using `N` processes, just fire up \n`N` processes with the same set of arguments and `--world_size=N`. The argument `--global_rank` \nis set to `[1, 2, ..., N]` for each process.\n\n## Model Training\nOur models are trained on a cluster of 32GB Tesla V100 GPUs. The following script demonstrates \ntraining a vanilla `TaBERT(k=1)` model using a single GPU with gradient accumulation:\n```shell script\nmkdir -p data/runs/vanilla_tabert\n\npython train.py \\\n    --task vanilla \\\n    --data-dir data/train_data/vanilla_tabert \\\n    --output-dir data/runs/vanilla_tabert \\\n    --table-bert-extra-config '{}' \\\n    --train-batch-size 8 \\\n    --gradient-accumulation-steps 32 \\\n    --learning-rate 2e-5 \\\n    --max-epoch 10 \\\n    --adam-eps 1e-08 \\\n    --weight-decay 0.0 \\\n    --fp16 \\\n    --clip-norm 1.0 \\\n    --empty-cache-freq 128\n```\n\nThe following script shows training a `TaBERT(k=3)` model with vertical self-attention:\n```shell script\nmkdir -p data/runs/vertical_tabert\n\npython train.py \\\n    --task vertical_attention \\\n    --data-dir data/train_data/vertical_tabert \\\n    --output-dir data/runs/vertical_tabert \\\n    --table-bert-extra-config '{\"base_model_name\": \"bert-base-uncased\", \"num_vertical_attention_heads\": 6, \"num_vertical_layers\": 3, \"predict_cell_tokens\": true}' \\\n    --train-batch-size 8 \\\n    --gradient-accumulation-steps 64 \\\n    --learning-rate 4e-5 \\\n    --max-epoch 10 \\\n    --adam-eps 1e-08 \\\n    --weight-decay 0.01 \\\n    --fp16 \\\n    --clip-norm 1.0 \\\n    --empty-cache-freq 128\n```\n\nDistributed training with multiple GPUs is similar to [XLM](https://github.com/facebookresearch/XLM).\n\n## Reference\n\nIf you plan to use `TaBERT` in your project, please consider citing [our paper](https://arxiv.org/abs/2005.08314):\n```\n@inproceedings{yin20acl,\n    title = {Ta{BERT}: Pretraining for Joint Understanding of Textual and Tabular Data},\n    author = {Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel},\n    booktitle = {Annual Conference of the Association for Computational Linguistics (ACL)},\n    month = {July},\n    year = {2020}\n}\n```\n\n## License\n\nTaBERT is CC-BY-NC 4.0 licensed as of now.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Ftabert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Ftabert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Ftabert/lists"}