{"id":13535300,"url":"https://github.com/facebookresearch/LAMA","last_synced_at":"2025-04-02T01:30:33.343Z","repository":{"id":37601732,"uuid":"178342783","full_name":"facebookresearch/LAMA","owner":"facebookresearch","description":" LAnguage Model Analysis","archived":false,"fork":false,"pushed_at":"2024-07-07T07:13:09.000Z","size":482,"stargazers_count":1333,"open_issues_count":39,"forks_count":180,"subscribers_count":72,"default_branch":"main","last_synced_at":"2024-08-02T08:10:08.802Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-03-29T06:06:05.000Z","updated_at":"2024-07-31T10:48:23.000Z","dependencies_parsed_at":"2022-07-12T16:33:08.679Z","dependency_job_id":"e6901c2c-89dd-470f-9035-0cdd7f8a3915","html_url":"https://github.com/facebookresearch/LAMA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FLAMA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FLAMA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FLAMA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FLAMA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/LAMA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222788514,"owners_count":17037777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T08:00:53.042Z","updated_at":"2024-11-02T23:31:17.433Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Factual Knowledge Probes","BERT language model and embedding:","Python","Papers","**Programming (learning)**"],"sub_categories":["Language Models","**Developer\\'s Tools**"],"readme":"# LAMA: LAnguage Model Analysis\n\u003cimg align=\"middle\" src=\"img/logo.png\" height=\"256\" alt=\"LAMA\"\u003e\n\nLAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. \u003cbr\u003e\n#### The dataset for the LAMA probe is available at https://dl.fbaipublicfiles.com/LAMA/data.zip \u003cbr\u003e\nLAMA contains a set of connectors to pretrained language models. \u003cbr\u003e\nLAMA exposes a transparent and unique interface to use:\n\n- Transformer-XL (Dai et al., 2019)\n- BERT (Devlin et al., 2018)\n- ELMo (Peters et al., 2018)\n- GPT (Radford et al., 2018)\n- RoBERTa (Liu et al., 2019)\n\nActually, LAMA is also a beautiful animal.\n\n## Reference:\n\nThe LAMA probe is described in the following papers:\n\n```bibtex\n@inproceedings{petroni2019language,\n  title={Language Models as Knowledge Bases?},\n  author={F. Petroni, T. Rockt{\\\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},\n  booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},\n  year={2019}\n}\n\n@inproceedings{petroni2020how,\n  title={How Context Affects Language Models' Factual Predictions},\n  author={Fabio Petroni and Patrick Lewis and Aleksandra Piktus and Tim Rockt{\\\"a}schel and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel},\n  booktitle={Automated Knowledge Base Construction},\n  year={2020},\n  url={https://openreview.net/forum?id=025X0zPfn}\n}\n```\n\n## The LAMA probe\n\nTo reproduce our results:\n\n### 1. Create conda environment and install requirements\n\n(optional) It might be a good idea to use a separate conda environment. It can be created by running:\n```\nconda create -n lama37 -y python=3.7 \u0026\u0026 conda activate lama37\npip install -r requirements.txt\n```\n\n### 2. Download the data\n\n```bash\nwget https://dl.fbaipublicfiles.com/LAMA/data.zip\nunzip data.zip\nrm data.zip\n```\n\n### 3. Download the models\n\n#### DISCLAIMER: ~55 GB on disk\n\nInstall spacy model\n```bash\npython3 -m spacy download en\n```\n\nDownload the models\n```bash\nchmod +x download_models.sh\n./download_models.sh\n```\n\nThe script will create and populate a _pre-trained_language_models_ folder.\nIf you are interested in a particular model please edit the script.\n\n\n### 4. Run the experiments\n\n```bash\npython scripts/run_experiments.py\n```\n\nresults will be logged in _output/_ and  _last_results.csv_.\n\n## Other versions of LAMA\n\n### LAMA-UHN\n\nThis repository also provides a script (`scripts/create_lama_uhn.py`) to create the data used in (Poerner et al., 2019).\n\n### Negated-LAMA\nThis repository also gives the option to evalute how pretrained language models handle negated probes (Kassner et al., 2019), set the flag `use_negated_probes` in `scripts/run_experiments.py`. Also, you should use this version of the LAMA probe https://dl.fbaipublicfiles.com/LAMA/negated_data.tar.gz\n\n## What else can you do with LAMA?\n\n### 1. Encode a list of sentences\nand use the vectors in your downstream task!\n\n```bash\npip install -e git+https://github.com/facebookresearch/LAMA#egg=LAMA\n```\n\n```python\nimport argparse\nfrom lama.build_encoded_dataset import encode, load_encoded_dataset\n\nPARAMETERS= {\n        \"lm\": \"bert\",\n        \"bert_model_name\": \"bert-large-cased\",\n        \"bert_model_dir\":\n        \"pre-trained_language_models/bert/cased_L-24_H-1024_A-16\",\n        \"bert_vocab_name\": \"vocab.txt\",\n        \"batch_size\": 32\n        }\n\nargs = argparse.Namespace(**PARAMETERS)\n\nsentences = [\n        [\"The cat is on the table .\"],  # single-sentence instance\n        [\"The dog is sleeping on the sofa .\", \"He makes happy noises .\"],  # two-sentence\n        ]\n\nencoded_dataset = encode(args, sentences)\nprint(\"Embedding shape: %s\" % str(encoded_dataset[0].embedding.shape))\nprint(\"Tokens: %r\" % encoded_dataset[0].tokens)\n\n# save on disk the encoded dataset\nencoded_dataset.save(\"test.pkl\")\n\n# load from disk the encoded dataset\nnew_encoded_dataset = load_encoded_dataset(\"test.pkl\")\nprint(\"Embedding shape: %s\" % str(new_encoded_dataset[0].embedding.shape))\nprint(\"Tokens: %r\" % new_encoded_dataset[0].tokens)\n```\n\n### 2. Fill a sentence with a gap.\n\nYou should use the symbol ```[MASK]``` to specify the gap.\nOnly single-token gap supported - i.e., a single ```[MASK]```.\n```bash\npython lama/eval_generation.py  \\\n--lm \"bert\"  \\\n--t \"The cat is on the [MASK].\"\n```\n\u003cimg align=\"middle\" src=\"img/cat_on_the_phone.png\" height=\"470\" alt=\"cat_on_the_phone\"\u003e\n\u003cimg align=\"middle\" src=\"img/cat_on_the_phone.jpg\" height=\"190\" alt=\"cat_on_the_phone\"\u003e\n\u003csub\u003e\u003csup\u003esource: https://commons.wikimedia.org/wiki/File:Bluebell_on_the_phone.jpg\u003c/sup\u003e\u003c/sub\u003e\n\nNote that you could use this functionality to answer _cloze-style_ questions, such as:\n\n```bash\npython lama/eval_generation.py  \\\n--lm \"bert\"  \\\n--t \"The theory of relativity was developed by [MASK] .\"\n```\n\n\n## Install LAMA with pip\n\nClone the repo\n```bash\ngit clone git@github.com:facebookresearch/LAMA.git \u0026\u0026 cd LAMA\n```\nInstall as an editable package:\n```bash\npip install --editable .\n```\n\nIf you get an error in mac os x, please try running this instead\n```bash\nCFLAGS=\"-Wno-deprecated-declarations -std=c++11 -stdlib=libc++\" pip install --editable .\n```\n\n\n## Language Model(s) options\n\nOption to indicate which language model(s) to use:\n* __--language-models/--lm__ : comma separated list of language models (__REQUIRED__)\n\n### BERT\nBERT pretrained models can be loaded both: (i) passing the name of the model and using huggingface cached versions or (ii) passing the folder containing the vocabulary and the PyTorch pretrained model (look at convert_tf_checkpoint_to_pytorch in [here](https://github.com/huggingface/pytorch-pretrained-BERT) to convert the TensorFlow model to PyTorch).\n\n* __--bert-model-dir/--bmd__ : directory that contains the BERT pre-trained model and the vocabulary\n* __--bert-model-name/--bmn__ : name of the huggingface cached versions of the BERT pre-trained model (default = 'bert-base-cased')\n* __--bert-vocab-name/--bvn__ : name of vocabulary used to pre-train the BERT model (default = 'vocab.txt')\n\n\n### RoBERTa\n\n* __--roberta-model-dir/--rmd__ : directory that contains the RoBERTa pre-trained model and the vocabulary (__REQUIRED__)\n* __--roberta-model-name/--rmn__ : name of the RoBERTa pre-trained model (default = 'model.pt')\n* __--roberta-vocab-name/--rvn__ : name of vocabulary used to pre-train the RoBERTa model (default = 'dict.txt')\n\n\n### ELMo\n\n* __--elmo-model-dir/--emd__ : directory that contains the ELMo pre-trained model and the vocabulary (__REQUIRED__)\n* __--elmo-model-name/--emn__ : name of the ELMo pre-trained model (default = 'elmo_2x4096_512_2048cnn_2xhighway')\n* __--elmo-vocab-name/--evn__ : name of vocabulary used to pre-train the ELMo model (default = 'vocab-2016-09-10.txt')\n\n\n### Transformer-XL\n\n* __--transformerxl-model-dir/--tmd__ : directory that contains the pre-trained model and the vocabulary (__REQUIRED__)\n* __--transformerxl-model-name/--tmn__ : name of the pre-trained model (default = 'transfo-xl-wt103')\n\n\n### GPT\n\n* __--gpt-model-dir/--gmd__ : directory that contains the gpt pre-trained model and the vocabulary (__REQUIRED__)\n* __--gpt-model-name/--gmn__ : name of the gpt pre-trained model (default = 'openai-gpt')\n\n\n## Evaluate Language Model(s) Generation\n\noptions:\n* __--text/--t__ : text to compute the generation for\n* __--i__ : interactive mode \u003cbr\u003e\none of the two is required\n\nexample considering both BERT and ELMo:\n```bash\npython lama/eval_generation.py \\\n--lm \"bert,elmo\" \\\n--bmd \"pre-trained_language_models/bert/cased_L-24_H-1024_A-16/\" \\\n--emd \"pre-trained_language_models/elmo/original/\" \\\n--t \"The cat is on the [MASK].\"\n```\n\nexample considering only BERT with the default pre-trained model, in an interactive fashion:\n```bash\npython lamas/eval_generation.py  \\\n--lm \"bert\"  \\\n--i\n```\n\n\n## Get Contextual Embeddings\n\n```bash\npython lama/get_contextual_embeddings.py \\\n--lm \"bert,elmo\" \\\n--bmn bert-base-cased \\\n--emd \"pre-trained_language_models/elmo/original/\"\n```\n\n## Unified vocabulary\nThe intersection of the vocabularies for all considered models\n- [cased](https://dl.fbaipublicfiles.com/LAMA/common_vocab_cased.txt)\n- [lowercased](https://dl.fbaipublicfiles.com/LAMA/common_vocab_lowercased.txt)\n\n## Troubleshooting\n\nIf the module cannot be found, preface the python command with `PYTHONPATH=.`\n\nIf the experiments fail on GPU memory allocation, try reducing batch size.\n\n## Acknowledgements\n\n* [https://github.com/huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)\n* [https://github.com/allenai/allennlp](https://github.com/allenai/allennlp)\n* [https://github.com/pytorch/fairseq](https://github.com/pytorch/fairseq)\n\n\n## Other References\n\n- __(Kassner et al., 2019)__ Nora Kassner, Hinrich Schütze. _Negated LAMA: Birds cannot fly_. arXiv preprint arXiv:1911.03343, 2019.\n\n- __(Poerner et al., 2019)__ Nina Poerner, Ulli Waltinger, and Hinrich Schütze. _BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA_. arXiv preprint arXiv:1911.03681, 2019.\n\n- __(Dai et al., 2019)__ Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdi. _Transformer-xl: Attentive language models beyond a fixed-length context_. CoRR, abs/1901.02860.\n\n- __(Peters et al., 2018)__ Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. _Deep contextualized word representations_. NAACL-HLT 2018\n\n- __(Devlin et al., 2018)__ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. _BERT: pre-training of deep bidirectional transformers for language understanding_. CoRR, abs/1810.04805.\n\n- __(Radford et al., 2018)__ Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. _Improving language understanding by generative pre-training_.\n\n- __(Liu et al., 2019)__ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019. _RoBERTa: A Robustly Optimized BERT Pretraining Approach_. arXiv preprint arXiv:1907.11692.\n\n\n## Licence\n\nLAMA is licensed under the CC-BY-NC 4.0 license. The text of the license can be found [here](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FLAMA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FLAMA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FLAMA/lists"}