{"id":17311180,"url":"https://github.com/guenthermi/table-embeddings","last_synced_at":"2025-04-14T14:12:34.539Z","repository":{"id":119971291,"uuid":"348638212","full_name":"guenthermi/table-embeddings","owner":"guenthermi","description":"Tools for training schema-aware Web table embedding for unsupervised and supervised machine learning on tabular data","archived":false,"fork":false,"pushed_at":"2024-04-14T10:07:46.000Z","size":69,"stargazers_count":19,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-28T03:12:18.436Z","etag":null,"topics":["embeddings","fasttext","ml","neural-network","schema","schema-data","tables","unsupervised-learning","web-table","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guenthermi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-03-17T08:46:28.000Z","updated_at":"2025-03-25T10:17:21.000Z","dependencies_parsed_at":"2024-04-14T11:24:05.274Z","dependency_job_id":"030047c6-44cb-4ff4-8c5a-0b05db3a068a","html_url":"https://github.com/guenthermi/table-embeddings","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Ftable-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Ftable-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Ftable-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Ftable-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guenthermi","download_url":"https://codeload.github.com/guenthermi/table-embeddings/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248894938,"owners_count":21179152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","fasttext","ml","neural-network","schema","schema-data","tables","unsupervised-learning","web-table","word-embeddings"],"created_at":"2024-10-15T12:39:44.767Z","updated_at":"2025-04-14T14:12:34.516Z","avatar_url":"https://github.com/guenthermi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pretrained Web Table Embeddings\n\nThis repository contains tools for training and evaluating Web table embedding with word embedding techniques.\nThose models can generate embeddings for schema terms and instance data terms making them especially useful for representing schema and class information as well as for ML tasks on tabular text data.\nFurthermore, this repository contains links to pre-trained web table models and the code for several tasks the models can be used for.\n\n## Install Package\n\nIf you want to install the package to encode text (from tables) into embedding representations, you can run\n\n```\npip install .\n```\n\nand load a pre-trained model as follows:\n\n```\nfrom table_embeddings import TableEmbeddingModel\nmodel = TableEmbeddingModel.load_model('ddrg/web_table_embeddings_combo64')\n\nembedding = model.get_header_vector('headline')\n```\n\nFor installing all dependencies to run the evaluation tasks you can run:\n\n```\npip install \".[full]\"\n```\n\n## Embedding Training\n\nThis repository provides tools for training four different types of Web table embedding models: *W-base*, *W-row*, *W-tax*, and *W-combo*.\nFor pre-training those embedding models the [DWTC Web Table Corpus](https://wwwdb.inf.tu-dresden.de/misc/dwtc/]) can be used.\nAll modules required to run the python scripts in this repository can be installed via pip.\n\nThe training data used to be available on https://wwwdb.inf.tu-dresden.de/research-projects/dresden-web-table-corpus/\nsIf you need the training data contact the university with the contact information you can find on this website.\n\n#### Download DWTC Dump\n\nThe corpus can be downloaded as follows:\n```\nfor i in $(seq -w 0 500); do wget http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz -P data/; done\n```\n\n### Filter Dump\n\nThe DWTC dump can be filtered with `embedding/filter_dump.py` and `embedding/filter_columns.py` to create a dump containing only columns of English tables with a table header.\nYou may adjust the path of the DWTC corpus in `config/dump_filter.json`.\n\n```\npython3 embedding/filter_dump.py -c config/dump_filter.json\npython3 embedding/filter_columns.py -c config/column_filter.json\n```\n\n\n### Construct Graph Representation\n\nTo train *W-tax* and *W-combo* embedding models, a header-data term graph needs to be constructed.\nFirst, an index file is constructed:\n\n```\npython3 embedding/build_index.py -i data/column_dump.json.gz -o data/indexes.json.gz\n```\n\nAfterward, the graph can be constructed:\n\n```\npython3 embedding/graph_generation.py -i data/indexes.json.gz -c config/header_data_graph_config.json\n```\n\n### Training of Embedding Models\n\nTo run the actual embedding training, one can execute `embedding/fasttext_web_table_embeddings.py` with one of the embedding configuration files in the config folder:\n\n```\npython3 embedding/fasttext_web_table_embeddings.py -c config/embedding_config_combo.json -o data/combo_model.bin -w\n```\n\n\n## Pre-Trained Models\n\nBelow you can find links to models trained on the DWTC corpus:\n\n| Model Type | Description | Download-Links |\n| ---------- | ----------- | -------------- |\n| W-tax      | Model of relations between table header and table body | ([64dim](https://huggingface.co/ddrg/web_table_embeddings_tax64), [150dim](https://huggingface.co/ddrg/web_table_embeddings_tax150))\n| W-row      | Model of row-wise relations in tables | ([64dim](https://huggingface.co/ddrg/web_table_embeddings_row64), [150dim](https://huggingface.co/ddrg/web_table_embeddings_row150))\n| W-combo      | Model of row-wise relations and relations between table header and table body | ([64dim](https://huggingface.co/ddrg/web_table_embeddings_combo64), [150dim](https://huggingface.co/ddrg/web_table_embeddings_combo150))\n| W-plain      | Model of row-wise relations in tables without pre-processing | ([64dim](https://huggingface.co/ddrg/web_table_embeddings_plain64), [150dim](https://huggingface.co/ddrg/web_table_embeddings_plain150))\n\nTo use the models, you can use the `FastTextWebTableModel.load_model` function in `embedding/fasttext_web_table_embeddings.py`.\n\n## Evaluation\n\nBesides the embedding training, this repository contains the code of four evaluation tasks:\n\n* Representation of instance-of relations found in YAGO (`yago_class_evaluation/`)\n* Unionable Table Search (`unionability_search/`)\n* Table layout classification on Web tables (`table_layout_classification/`)\n* Spreadsheet cell classification (`deco_classifier/`)\n\nA detailed description, how to run the evaluation, is provided in the respective folders.\n\n## References\n[Pre-Trained Web Table Embeddings for Table Discovery](https://dl.acm.org/doi/10.1145/3464509.3464892)\n```\n@inproceedings{gunther2021pre,\n  title={Pre-Trained Web Table Embeddings for Table Discovery},\n  author={G{\\\"u}nther, Michael and Thiele, Maik and Gonsior, Julius and Lehner, Wolfgang},\n  booktitle={Fourth Workshop in Exploiting AI Techniques for Data Management},\n  pages={24--31},\n  year={2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguenthermi%2Ftable-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguenthermi%2Ftable-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguenthermi%2Ftable-embeddings/lists"}