{"id":13766300,"url":"https://github.com/TalSchuster/CrossLingualContextualEmb","last_synced_at":"2025-05-10T21:33:23.581Z","repository":{"id":89679721,"uuid":"172386134","full_name":"TalSchuster/CrossLingualContextualEmb","owner":"TalSchuster","description":"Cross-Lingual Alignment of Contextual Word Embeddings","archived":false,"fork":false,"pushed_at":"2020-02-12T22:10:00.000Z","size":46,"stargazers_count":99,"open_issues_count":3,"forks_count":9,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-19T18:10:07.265Z","etag":null,"topics":["allennlp","bert","contextual-embeddings","crosslingual","elmo","nlp","pytorch","wordembeddings","zeroshot-learning"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TalSchuster.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-02-24T20:25:10.000Z","updated_at":"2024-11-28T18:23:42.000Z","dependencies_parsed_at":"2024-01-25T17:13:15.405Z","dependency_job_id":null,"html_url":"https://github.com/TalSchuster/CrossLingualContextualEmb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalSchuster%2FCrossLingualContextualEmb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalSchuster%2FCrossLingualContextualEmb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalSchuster%2FCrossLingualContextualEmb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalSchuster%2FCrossLingualContextualEmb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TalSchuster","download_url":"https://codeload.github.com/TalSchuster/CrossLingualContextualEmb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253486336,"owners_count":21916136,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allennlp","bert","contextual-embeddings","crosslingual","elmo","nlp","pytorch","wordembeddings","zeroshot-learning"],"created_at":"2024-08-03T16:00:53.619Z","updated_at":"2025-05-10T21:33:23.319Z","avatar_url":"https://github.com/TalSchuster.png","language":"Python","funding_links":[],"categories":["Vector Mapping"],"sub_categories":[],"readme":"# CrossLingualELMo\nCross-Lingual Alignment of Contextual Word Embeddings\n\nThis repo will contain the code and models for the NAACL19 paper - [Cross-Lingual Alignment of Contextual Word Embeddings,  with Applications to Zero-shot Dependency Parsing](https://arxiv.org/abs/1902.09492)\n\nMore pieces of the code will be released soon.\n\n## Updates:\n\n* Computed anchors for English (to help with the alignment computation for more languages)\n\n* Alignment matrices for all layers of ELMO.\n\n* A script to compute anchors for a BERT model is now available.\n\n* The Multilingual ELMo is now merged to the [AllenNLP framework](https://github.com/allenai/allennlp) (version \u003e= 0.8.5). Anchors for other models can be computed using the code here.\n\n* \u003cdel\u003e We are working on merging the Multilingual ELMo to the AllenNLP framework. Hopefully we will get to finish it soon.\n\n* \u003cdel\u003e The multi-lingual parser code is now available at this [allennlp fork](https://github.com/TalSchuster/allennlp-MultiLang) (`requirements.txt` file of this repo is updated accordingly). See more details in the **Usage** section below.\n\n\n# Aligned Multi Lingual Deep Contextual Word Embeddings\n\n## Embeddings\n\nThe following models were trained on Wikipedia. We provide the alignment of the first LSTM output of ELMo to English. The English file contains the identity matrix divided by the average norm for that layer.\n\n| Language        | Model weights | Alignment matrix (First LSTM layer) *  |\n| ------------- |:-------------:| :-----:|\n| English     | [weights.hdf5](https://www.dropbox.com/s/1h62kc1qdcuyy2u/en_weights.hdf5) | [en_best_mapping.pth](https://www.dropbox.com/s/nufj4pxxgv5838r/en_best_mapping.pth) |\n| Spanish     | [weights.hdf5](https://www.dropbox.com/s/ygfjm7zmufl5gu2/es_weights.hdf5) | [es_best_mapping.pth](https://www.dropbox.com/s/6kqot8ssy66d5u0/es_best_mapping.pth) |\n| French     | [weights.hdf5](https://www.dropbox.com/s/mm64goxb8wbawhj/fr_weights.hdf5) | [fr_best_mapping.pth](https://www.dropbox.com/s/0zdlanjhajlgflm/fr_best_mapping.pth) |\n| Italian     | [weights.hdf5](https://www.dropbox.com/s/owfou7coi04dyxf/it_weights.hdf5) | [it_best_mapping.pth](https://www.dropbox.com/s/gg985snnhajhm5i/it_best_mapping.pth) |\n| Portuguese     | [weights.hdf5](https://www.dropbox.com/s/ul82jsal1khfw5b/pt_weights.hdf5) | [pt_best_mapping.pth](https://www.dropbox.com/s/skdfz6zfud24iup/pt_best_mapping.pth) |\n| Swedish     | [weights.hdf5](https://www.dropbox.com/s/boptz21zrs4h3nw/sv_weights.hdf5) | [sv_best_mapping.pth](https://www.dropbox.com/s/o7v64hciyifvs8k/sv_best_mapping.pth) |\n| German     | [weights.hdf5](https://www.dropbox.com/s/2kbjnvb12htgqk8/de_weights.hdf5) | [de_best_mapping.pth](https://www.dropbox.com/s/u9cg19o81lpm0h0/de_best_mapping.pth) |\n\n\\* Alignments for layer 0 (pre LSTM) and layer 2 (post LSTM) for all above languages - [alignments_0_2.zip](https://www.dropbox.com/s/ymnyptj3lupvcw7/alignments_0_2.zip)\n\n* Unsupervised alignments for layer 1 - [alignments_unsupervised.zip](https://www.dropbox.com/s/sgi86uc8stu70bg/alignments_unsupervised.zip)\n\n* Options file (for all models) - [options262.json](https://www.dropbox.com/s/ypjuzlf7kj957g3/options262.json)\n\n* Computed anchors for the Enlgish model - [english_anchors.zip](https://www.dropbox.com/s/8ad5oqhbh3xlnnf/english_anchors.zip)\n\n#### Download helpers:\n\n* To download all the ELMo models in the table, use `get_models.sh`\n\n* To download all of the alignment matrices in the table, use `get_alignments.sh`.\n\n* Alternatively, If you are interested in applying it in an Allennlp model, you can just add the path to the configuration file (check the examples in `allen_configs`)\n### Generating anchors\n\nIn order to generate your own anchors - use the `gen_anchors.py` script to generate your own anchors. You will need a trained ELMo model, text files with one sentence per line, and vocab file with token per line containing the tokens that you wish to calculate for.\nrun `gen_anchors.py -h` for more details.\n\n## Usage\n\n### Generating aligned contextual embeddings\n\nGiven the output of a specific layer from ELMo (the contextual embeddings), run:\n```\naligning  = torch.load(aligning_matrix_path)\naligned_embeddings = np.matmul(embeddings, aligning.transpose())\n```\n\nAn example can be seen in `demo.py`. \n\n### Replicating the zero-shot cross-lingual dependency parsing results\n\n1. Create an environment to install our fork of allennlp:\n\n```\nvirtualenv -p /usr/bin/python3.6 allennlp_env\n```\nor, if you are using conda:\n```\nconda create -n allennlp_env python=3.6\n```\n\n2. Activate the environment and install allennlp:\n\n```\nsource allennlp_env/bin/activate\npip install -r requirements.txt\n```\n\n3. Download the [uni-dep-tb](https://github.com/ryanmcd/uni-dep-tb) dataset (version 2) and follow the instructions to generate the [English PTB data](https://catalog.ldc.upenn.edu/LDC99T42)\n4. Train the model (the provided configuration is for 'es' as a target language):\n```\nTRAIN_PATHNAME='universal_treebanks_v2.0/std/**/*train.conll' \\\nDEV_PATHNAME='universal_treebanks_v2.0/std/**/*dev.conll' \\\nTEST_PATHNAME='universal_treebanks_v2.0/std/**/*test.conll' \\\nallennlp train training_config/multilang_dependency_parser.jsonnet -s path_to_output_dir;\n```\n\n\n### Using in any model\n\nThe aligments can be used with the [AllenNLP](https://allennlp.org) framework by simply using any model with ELMo embeddings and replacing the paths in the configuration with our provided models.\n\nEach ELMo model was trained on Wikipedia of the relevant language. To align the models, you will need to add the following code to your model:\n\nLoad the alignment matrix in the `__init__()` function:\n\n```\naligning_matrix_path = ... (pth file)\nself.aligning_matrix = torch.FloatTensor(torch.load(aligning_matrix_path))\nself.aligning = torch.nn.Linear(self.aligning_matrix[0], self.aligning_matrix[1], bias=False)\nself.aligning.weight = torch.nn.Parameter(self.aligning_matrix, requires_grad=False)\n```\n\nThen, simply apply the alignment on the embedded tokens in the `forward()` pass:\n```\nembedded_text = self.aligning(embedded_text)\n```\n\n\n\n\n# Citation\n\nIf you find this repo useful, please cite our paper.\n\n```\n@InProceedings{Schuster2019,\n    title = \"Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing\",\n    author = \"Schuster, Tal  and\n      Ram, Ori  and\n      Barzilay, Regina  and\n      Globerson, Amir\",\n    booktitle = \"Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)\",\n    month = jun,\n    year = \"2019\",\n    address = \"Minneapolis, Minnesota\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/N19-1162\",\n    pages = \"1599--1613\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTalSchuster%2FCrossLingualContextualEmb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTalSchuster%2FCrossLingualContextualEmb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTalSchuster%2FCrossLingualContextualEmb/lists"}