{"id":13638841,"url":"https://github.com/malteos/pytorch-bert-document-classification","last_synced_at":"2025-12-30T00:26:26.565Z","repository":{"id":87910381,"uuid":"198645878","full_name":"malteos/pytorch-bert-document-classification","owner":"malteos","description":"Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)","archived":false,"fork":false,"pushed_at":"2019-10-15T13:42:23.000Z","size":6090,"stargazers_count":156,"open_issues_count":0,"forks_count":23,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-03T01:13:36.132Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1909.08402","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/malteos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-07-24T13:55:24.000Z","updated_at":"2024-06-07T16:05:22.000Z","dependencies_parsed_at":"2024-01-14T08:58:54.698Z","dependency_job_id":"67aebce4-af0c-48d7-88fa-8d1b4e95fc03","html_url":"https://github.com/malteos/pytorch-bert-document-classification","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Fpytorch-bert-document-classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Fpytorch-bert-document-classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Fpytorch-bert-document-classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/malteos%2Fpytorch-bert-document-classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/malteos","download_url":"https://codeload.github.com/malteos/pytorch-bert-document-classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223810213,"owners_count":17206716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:00:54.455Z","updated_at":"2025-12-30T00:26:26.511Z","avatar_url":"https://github.com/malteos.png","language":"Jupyter Notebook","funding_links":[],"categories":["Tasks"],"sub_categories":["Classification"],"readme":"# PyTorch BERT Document Classification\n\nImplementation and pre-trained models of the paper *Enriching BERT with Knowledge Graph Embedding for Document Classification* ([PDF](https://arxiv.org/abs/1909.08402)).\nA submission to the [GermEval 2019 shared task](https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc.html) on hierarchical text classification.\nIf you encounter any problems, feel free to contact us or submit a GitHub issue.\n\n## Content\n\n- CLI script to run all experiments\n- WikiData author embeddings ([view on Tensorboard Projector](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/malteos/pytorch-bert-document-classification/master/extras/projector_config.json))\n- Data preparation\n- Requirements\n- Trained model weights as [release files](https://github.com/malteos/pytorch-bert-document-classification/releases)\n\n## Model architecture\n\n![BERT + Knowledge Graph Embeddings](https://github.com/malteos/pytorch-bert-document-classification/raw/master/images/architecture.png)\n\n\n## Installation\n\nRequirements:\n- Python 3.6\n- CUDA GPU\n- Jupyter Notebook\n\nInstall dependencies:\n```\npip install -r requirements.txt\n```\n\n## Prepare data\n\n### GermEval data\n\n- Download from shared-task website: [here](https://competitions.codalab.org/competitions/20139)\n- Run all steps in Jupyter Notebook: [germeval-data.ipynb](#)\n\n### Author Embeddings\n\n- [Download pre-trained Wikidata embedding (30GB): Facebook PyTorch-BigGraph](https://github.com/facebookresearch/PyTorch-BigGraph#pre-trained-embeddings)\n- [Download WikiMapper index files (de+en)](https://github.com/jcklie/wikimapper#precomputed-indices)\n\n```\npython wikidata_for_authors.py run ~/datasets/wikidata/index_enwiki-20190420.db \\\n    ~/datasets/wikidata/index_dewiki-20190420.db \\\n    ~/datasets/wikidata/torchbiggraph/wikidata_translation_v1.tsv.gz \\\n    ~/notebooks/bert-text-classification/authors.pickle \\\n    ~/notebooks/bert-text-classification/author2embedding.pickle\n\n# OPTIONAL: Projector format\npython wikidata_for_authors.py convert_for_projector \\\n    ~/notebooks/bert-text-classification/author2embedding.pickle\n    extras/author2embedding.projector.tsv \\\n    extras/author2embedding.projector_meta.tsv\n\n```\n\n\n## Reproduce paper results\n\n\nDownload pre-trained models: [GitHub releases](https://github.com/malteos/pytorch-bert-document-classification/releases)\n\n\n### Available experiment settings\n\nDetailed settings for each experiment can found in `cli.py`.\n\n```\ntask-a__bert-german_full\ntask-a__bert-german_manual_no-embedding\ntask-a__bert-german_no-manual_embedding\ntask-a__bert-german_text-only\ntask-a__author-only\ntask-a__bert-multilingual_text-only\n\ntask-b__bert-german_full\ntask-b__bert-german_manual_no-embedding\ntask-b__bert-german_no-manual_embedding\ntask-b__bert-german_text-only\ntask-b__author-only\ntask-b__bert-multilingual_text-only\n```\n\n### Enviroment variables\n\n- `TRAIN_DF_PATH`: Path to Pandas Dataframe (pickle)\n- `GPU_ID`: Run experiments on this GPU (used for `CUDA_VISIBLE_DEVICES`)\n- `OUTPUT_DIR`: Directory to store experiment output\n- `EXTRAS_DIR`: Directory where author embeddings and [gender data](https://data.world/howarder/gender-by-name) is located\n- `BERT_MODELS_DIR`: Directory where pre-trained BERT models are located \n\n### Validation set\n\n```\npython cli.py run_on_val \u003cname\u003e $GPU_ID $EXTRAS_DIR $TRAIN_DF_PATH $VAL_DF_PATH $OUTPUT_DIR --epochs 5\n```\n\n### Test set\n\n```\npython cli.py run_on_test \u003cname\u003e $GPU_ID $EXTRAS_DIR $FULL_DF_PATH $TEST_DF_PATH $OUTPUT_DIR --epochs 5\n```\n\n### Evaluation\n\nThe scores from the result table can be reproduced with the `evaluation.ipynb` notebook.\n\n## How to cite\n\nIf you are using our code, please cite [our paper](https://arxiv.org/abs/1909.08402):\n```\n@inproceedings{Ostendorff2019,\n    address = {Erlangen, Germany},\n    author = {Ostendorff, Malte and Bourgonje, Peter and Berger, Maria and Moreno-Schneider, Julian and Rehm, Georg},\n    booktitle = {Proceedings of the GermEval 2019 Workshop},\n    title = {{Enriching BERT with Knowledge Graph Embedding for Document Classification}},\n    year = {2019}\n}\n```\n\n## References\n\n- [GermEval 2019 Task 1 on Codalab](https://competitions.codalab.org/competitions/20139)\n- [Google BERT Tensorflow](https://github.com/google-research/bert)\n- [Huggingface PyTorch Transformer](https://github.com/huggingface/pytorch-transformers)\n- [Deepset AI - BERT-german](https://deepset.ai/german-bert)\n- [Facebook PyTorch BigGraph](https://github.com/facebookresearch/PyTorch-BigGraph)\n\n## License\n\nMIT\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmalteos%2Fpytorch-bert-document-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmalteos%2Fpytorch-bert-document-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmalteos%2Fpytorch-bert-document-classification/lists"}