{"id":26145668,"url":"https://github.com/timoschick/bertram","last_synced_at":"2026-03-04T07:32:39.496Z","repository":{"id":44027737,"uuid":"258862161","full_name":"timoschick/bertram","owner":"timoschick","description":"This repository contains the code for \"BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Representations\".","archived":false,"fork":false,"pushed_at":"2020-08-13T15:05:06.000Z","size":34,"stargazers_count":63,"open_issues_count":4,"forks_count":12,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-14T03:09:30.205Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1910.07181","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timoschick.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-25T19:57:32.000Z","updated_at":"2024-11-04T08:10:45.000Z","dependencies_parsed_at":"2022-07-09T14:30:56.937Z","dependency_job_id":null,"html_url":"https://github.com/timoschick/bertram","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/timoschick/bertram","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timoschick%2Fbertram","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timoschick%2Fbertram/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timoschick%2Fbertram/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timoschick%2Fbertram/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timoschick","download_url":"https://codeload.github.com/timoschick/bertram/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timoschick%2Fbertram/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30075429,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T05:31:57.858Z","status":"ssl_error","status_checked_at":"2026-03-04T05:31:38.462Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-11T04:54:17.371Z","updated_at":"2026-03-04T07:32:39.473Z","avatar_url":"https://github.com/timoschick.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BERTRAM (BERT for Attentive Mimicking)\n\nThis repository contains the code for [BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Representations](https://arxiv.org/abs/1910.07181). The paper introduces **BERTRAM**, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture.\n\n## 📑 Contents\n\n**[⚙️ Setup](#%EF%B8%8F-setup)**\n\n**[💬 Usage](#-usage)**\n\n**[💡 Training BERTRAM from Scratch](#-training-bertram-from-scratch)**\n\n**[💾 Pre-Trained Models](#-pre-trained-models)**\n\n**[📕 Citation](#-citation)**\n\n## ⚙️ Setup\n\nBERTRAM requires `Python\u003e=3.7`, `jsonpickle`, `numpy`, `pytorch`, `torchvision`, `scipy`, `gensim`, `visdom` and `transformers==2.1`. If you use `conda`, you can simply create an environment with all required dependencies from the `environment.yml` file found in the root of this repository. \n\n## 💬 Usage\n\nTo use BERTRAM for downstream tasks, you can either [download a pretrained model](#-pre-trained-models) or [train your own instance of BERTRAM](#-training-bertram-from-scratch). Note that each instance of BERTRAM can only be used in combination with the pretrained transformer model for which it was trained.\n\nTo use a pretrained BERTRAM instance, first initialize a `BertramWrapper` object as follows:\n\n```python\nbertram = BertramWrapper('../models/bertram-add-for-bert-base-uncased', device='cpu')\n```\n\nYou can infer embeddings for words from their surface-form and a (possibly empty) list of contexts using BERTRAM as follows:\n\n```python\nword = 'kumquat'\ncontexts =  ['litchi, pineapple and kumquat is planned for the greenhouse.', 'kumquat and cranberry sherbet']\nbertram.infer_vector(word, contexts)\n```\n\nTo directly inject a BERTRAM vector into a language model, you can use the `add_word_vectors_to_model()` method:\n\n```python\nmodel = BertForMaskedLM.from_pretrained('bert-base-uncased')\ntokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\nwords_with_contexts = {\n    'kumquat': ['litchi, pineapple and kumquat is planned for the greenhouse.', 'kumquat and cranberry sherbet'],\n    'resigntaion': []\n}\nbertram.add_word_vectors_to_model(words_with_contexts, tokenizer, model)\n```\n\nFor each word `w` in the `words_with_contexts` dictionary, this adds a new token `\u003cBERTRAM:w\u003e` to the `tokenizer`'s vocabulary and adds the corresponding BERTRAM vector to the `model`'s embedding matrix. This way, the language model's original representation of `w` does not get lost. You can now represent each word `w` in various ways:\n\n```python\ninput_standard = 'A kumquat is a [MASK]'                     # this uses the LM's default representation of 'kumquat'\ninput_bertram  = 'A \u003cBERTRAM:kumquat\u003e is a [MASK]'           # this uses only the BERTRAM vector for 'kumquat'\ninput_slash    = 'A kumquat / \u003cBERTRAM:kumquat\u003e is a [MASK]' # this uses both representations\n```\n\nIn our experiments, we found the last variant (also called `BERTRAM-slash` in the paper) to perform best. A more detailed example can be found in `examples/use_bertram_for_mlm.py`.\n\n## 💡 Training BERTRAM from Scratch\n\nAs described in the paper, training a new BERTRAM instance requires the following steps: (1) training a context-only model, (2) training a form-only model, (3) combining both models and training the combined model. \n\n### Preparing a Training Corpus\n\nBefore training a BERTRAM model, you need (1) a large plain-text file and (2) a set of target vectors that BERTRAM is trained to mimic. \n\n#### Handling the Plain-Text File\n\nThe plain-text file needs to be preprocessed using the script found [here](https://github.com/timoschick/form-context-model) as follows:\n```\npython3 fcm/preprocess.py train --input $PATH_TO_YOUR_TEXT_CORPUS --output $TRAIN_DIR\n```\nThis creates various files in `$TRAIN_DIR`; the important ones are `train.vwc100` and all files of the form `train.bucket\u003cX\u003e`. The former contains words and their number of occurrences and is used by BERTRAM to build an *n*-gram vocabulary. The latter are used to generate contexts for training. Move all `train.bucket\u003cX\u003e` files into a separate folder `/buckets` inside `$TRAIN_DIR`.\n\n#### Obtaining Target Vectors\n\nTraining BERTRAM requires a file `$EMBEDDING_FILE` where each line is of the form `\u003cword\u003e \u003cembedding\u003e`. You can initialize this file simply by iterating over the entire (uncontextualized) embedding matrix of a pretrained language model (an example for `bert-base-uncased` can be found [here](https://www.cis.uni-muenchen.de/~schickt/embeddings-bert-base-uncased.txt)). Note that the training procedure described in the paper makes use of [One-Token-Approximation](https://github.com/timoschick/one-token-approximation) to also obtain embeddings for frequent *multi-token* words; these embeddings are used as additional training targets.\n\n### Training a Context-Only Model\n\nUse the following command to train a context-only BERTRAM model:\n\n```\npython3 train_bertram.py \\\n   --model_cls $MODEL_CLS \\\n   --bert_model $MODEL_NAME \\\n   --output_dir $CONTEXT_OUTPUT_DIR \\\n   --train_dir $TRAIN_DIR/buckets/ \\\n   --vocab $TRAIN_DIR/train.vwc100 \\\n   --emb_file $EMBEDDING_FILE \\\n   --num_train_epochs 5 \\\n   --emb_dim $EMB_DIM \\\n   --max_seq_length $MAX_SEQ_LENGTH \\\n   --mode context \\\n   --train_batch_size $TRAIN_BATCH_SIZE \\\n   --no_finetuning \\\n   --smin 4 \\\n   --smax 32\n```\nwhere\n- `$MODEL_CLS` is the class of the underlying language model (either `bert` or `roberta`)\n- `$MODEL_NAME` is the name of the underlying language model (e.g., `bert-base-uncased`, `roberta-large`)\n- `$CONTEXT_OUTPUT_DIR` is the output directory for the context-only model\n- `$TRAIN_DIR` is the training dir from the previous step\n- `$EMBEDDING_FILE` is the embedding file from the previous step\n- `$EMB_DIM` is the word embedding dimension of the target vectors (e.g., `768` for `bert-base-uncased`)\n- `$MAX_SEQ_LENGTH` is the maximum token length for each context\n- `$TRAIN_BATCH_SIZE` is the batch size to be used during training\n\n### Training a Form-Only Model\n\nUse the following command to train a form-only BERTRAM model:\n\n```\npython3 train_bertram.py \\\n   --model_cls $MODEL_CLS \\\n   --bert_model $MODEL_NAME \\\n   --output_dir $FORM_OUTPUT_DIR \\\n   --train_dir $TRAIN_DIR/buckets/ \\\n   --vocab $TRAIN_DIR/train.vwc100 \\\n   --emb_file $EMBEDDING_FILE \\\n   --num_train_epochs 20 \\\n   --emb_dim $EMB_DIM \\\n   --train_batch_size $TRAIN_BATCH_SIZE \\\n   --smin 1 \\\n   --smax 1 \\\n   --max_seq_length 10 \\\n   --mode form \\\n   --learning_rate 0.01 \\\n   --dropout 0.1 \\\n```\nwhere `$MODEL_CLS`, `$MODEL_NAME`, `$TRAIN_DIR`, `$EMBEDDING_FILE`, `$EMB_DIM` and `$TRAIN_BATCH_SIZE` are as for the context-only model and `$FORM_OUTPUT_DIR` is the output directory for the form-only model.\n\n### Combining Both Models\n\nFuse both models as follows:\n\n```\npython3 fuse_models.py \\\n   --form_model $FORM_OUTPUT_DIR \\\n   --context_model $CONTEXT_OUTPUT_DIR \\\n   --mode $MODE \\\n   --output $FUSED_DIR\n```\nwhere `$FORM_OUTPUT_DIR` and `$CONTEXT_OUTPUT_DIR` are as before, `$MODE` is the configuration for the fused model (either `add` or `replace`) and `$FUSED_DIR` is the output directory for the fused model.\n\nThe fused model can then be trained as follows:\n\n```\npython3 train_bertram.py \\\n   --model_cls $MODEL_CLS \\\n   --bert_model $FUSED_DIR \\\n   --output_dir $OUTPUT_DIR \\ \n   --train_dir $TRAIN_DIR/buckets/ \\\n   --vocab $TRAIN_DIR/train.vwc100 \\\n   --emb_file $EMBEDDING_FILE \\\n   --emb_dim $EMB_DIM \\\n   --mode $MODE \\\n   --train_batch_size $TRAIN_BATCH_SIZE \\\n   --max_seq_length $MAX_SEQ_LENGTH \\\n   --num_train_epochs 3 \\\n   --smin 4 \\\n   --smax 32 \\\n   --optimize_only_combinator \n```\nwhere `$MODEL_CLS`, `$FUSED_DIR`, `$TRAIN_DIR`, `$EMBEDDING_FILE`, `$EMB_DIM`, `$MODE`, `$MAX_SEQ_LENGTH` and `$TRAIN_BATCH_SIZE` are as before and `$OUTPUT_DIR` is the output directory for the final model.\n\n## 💾 Pre-Trained Models\n\n🚨 All pre-trained BERTRAM models released here were trained on significantly less data than BERT/RoBERTa (6GB vs 16GB/160GB). To get better results for downstream task applications, consider [training your own instance of BERTRAM](#-training-bertram-from-scratch).\n\n| BERTRAM Model Name                  | Configuration | Corresponding LM    | Link |\n| :---------------------------------- | :------------ | :------------------ | :--- |\n| `bertram-add-for-bert-base-uncased` | `ADD`         | `bert-base-uncased` | [📥 Download](https://www.cis.uni-muenchen.de/~schickt/bertram-add-for-bert-base-uncased.zip) |\n| `bertram-add-for-roberta-large`     | `ADD`         | `roberta-large`     | [📥 Download](https://www.cis.uni-muenchen.de/~schickt/bertram-add-for-roberta-large.zip)\n\n## 📕 Citation\n\nIf you make use of the code in this repository, please cite the following paper:\n\n    @inproceedings{schick2020bertram,\n      title={{BERTRAM}: Improved Word Embeddings Have Big Impact on Contextualized Representations},\n      author={Schick, Timo and Sch{\\\"u}tze, Hinrich},\n      url={https://arxiv.org/abs/1910.07181},\n      booktitle={Proceedings of the 2020 Annual Conference of the Association for Computational Linguistics (ACL)},\n      year={2019}\n    } ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimoschick%2Fbertram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimoschick%2Fbertram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimoschick%2Fbertram/lists"}