{"id":13711595,"url":"https://github.com/HIT-SCIR/ELMoForManyLangs","last_synced_at":"2025-05-06T21:31:43.781Z","repository":{"id":41070651,"uuid":"140356522","full_name":"HIT-SCIR/ELMoForManyLangs","owner":"HIT-SCIR","description":"Pre-trained ELMo Representations for Many Languages","archived":false,"fork":true,"pushed_at":"2021-05-19T07:05:17.000Z","size":105,"stargazers_count":1461,"open_issues_count":53,"forks_count":241,"subscribers_count":44,"default_branch":"master","last_synced_at":"2025-04-22T08:57:12.787Z","etag":null,"topics":["elmo","multilingual","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"bozheng-hit/ELMo","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HIT-SCIR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-10T00:31:06.000Z","updated_at":"2025-02-26T01:41:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/HIT-SCIR/ELMoForManyLangs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HIT-SCIR%2FELMoForManyLangs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HIT-SCIR%2FELMoForManyLangs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HIT-SCIR%2FELMoForManyLangs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HIT-SCIR%2FELMoForManyLangs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HIT-SCIR","download_url":"https://codeload.github.com/HIT-SCIR/ELMoForManyLangs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252772042,"owners_count":21801838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elmo","multilingual","nlp"],"created_at":"2024-08-02T23:01:09.712Z","updated_at":"2025-05-06T21:31:43.511Z","avatar_url":"https://github.com/HIT-SCIR.png","language":"Python","funding_links":[],"categories":["Repositories","Models","Language models","NLP per Language"],"sub_categories":["Word embeddings","Language Models and Word Embeddings"],"readme":"Pre-trained ELMo Representations for Many Languages\n===================================================\n\nWe release our ELMo representations trained on many languages\nwhich helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html)\naccording to LAS.\n\n## Technique Details\n\nWe use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM\nand the character CNN.\nWe train their parameters\non a set of 20-million-words data randomly\nsampled from the raw text released by the shared task (wikidump + common crawl) for each language.\nWe largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes:\n\n* We support unicode characters;\n* We use the *sample softmax* technique\nto make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)).\nHowever, we use a window of words surrounding the target word\nas negative samples and it shows better performance in our preliminary experiments.\n\nThe training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.\n\n\n## Downloads\n\n|   |   |   |   |\n|---|---|---|---|\n| [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) |\n| [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) |\n| [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) |\n| [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) |\n| [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) |\n| [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) |\n| [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) |\n| [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) |\n| [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) |\n| [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) |\n| [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) |\n\nThe models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home).\n\n**ELMo for Simplified Chinese**\n\nWe also provided [simplified-Chinese ELMo](http://39.96.43.154/zhs.model.tar.bz2).\nIt was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13),\nwhich is different from the Wikipedia for traditional Chinese ELMo.\n\n## Pre-requirements\n\n* **must** python \u003e= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8)\n* pytorch 0.4\n* other requirements from allennlp\n\n## Usage\n\n\n### Install the package\n\nYou need to install the package to use the embeddings with the following commends\n```\npython setup.py install\n```\n\n### Set up the `config_path`\nAfter unzip the model, you will find a JSON file `${lang}.model/config.json`.\nPlease change the `\"config_path\"` field to the relative path to \nthe model configuration `cnn_50_100_512_4096_sample.json`.\nFor example, if your ELMo model is `zht.model/config.json` and your model configuration\nis `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `\"config_path\"`\nin `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`.\n\nIf there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`,\nyou can copy the `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`,\nor change the `\"config_path\"` into  `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json`.\n\nSee [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details. \n\n\n### Use ELMoForManyLangs in command line\n\nPrepare your input file in the [conllu format](http://universaldependencies.org/format.html), like\n```\n1   Sue    Sue    _   _   _   _   _   _   _\n2   likes  like   _   _   _   _   _   _   _\n3   coffee coffee _   _   _   _   _   _   _\n4   and    and    _   _   _   _   _   _   _\n5   Bill   Bill   _   _   _   _   _   _   _\n6   tea    tea    _   _   _   _   _   _   _\n```\nFileds should be separated by `'\\t'`. We only use the second column and space (`' '`) is supported in\nthis field (for Vietnamese, a word can contains spaces).\nDo remember tokenization!\n\nWhen it's all set, run\n\n```\n$ python -m elmoformanylangs test \\\n    --input_format conll \\\n    --input /path/to/your/input \\\n    --model /path/to/your/model \\\n    --output_prefix /path/to/your/output \\\n    --output_format hdf5 \\\n    --output_layer -1\n```\n\nIt will dump an hdf5 encoded `dict` onto the disk, where the key is `'\\t'` separated\nwords in the sentence and the value is it's 3-layer averaged ELMo representation.\nYou can also dump the cnn encoded word with `--output_layer 0`,\nthe first layer of the LsTM with `--output_layer 1` and the second layer\nof the LSTM with `--output_layer 2`.  \nWe are actively changing the interface to make it more adapted to the \nAllenNLP ELMo and more programmatically friendly.\n\n### Use ELMoForManyLangs programmatically\n\nThanks @voidism for contributing the API.\nBy using `Embedder` python object, you can use ELMo into your own code like this:\n\n```python\nfrom elmoformanylangs import Embedder\n\ne = Embedder('/path/to/your/model/')\n\nsents = [['今', '天', '天氣', '真', '好', '阿'],\n['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]\n# the list of lists which store the sentences \n# after segment if necessary.\n\ne.sents2elmo(sents)\n# will return a list of numpy arrays \n# each with the shape=(seq_len, embedding_size)\n```\n\n#### the parameters to init Embedder:\n```python\nclass Embedder(model_dir='/path/to/your/model/', batch_size=64):\n```\n- **model_dir**: the absolute path from the repo top dir to you model dir.\n- **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)\n\n#### the parameters of the function sents2elmo:\n```python\ndef sents2elmo(sents, output_layer=-1):\n```\n- **sents**: the list of lists which store the sentences after segment if necessary.\n- **output_layer**: the target layer to output. \n    -  0 for the word encoder\n    -  1 for the first LSTM hidden layer\n    -  2 for the second LSTM hidden layer\n    -  -1 for an average of 3 layers. (default)\n    -  -2 for all 3 layers\n\n## Training Your Own ELMo\n\nPlease run \n```\n$ python -m elmoformanylangs.biLM train -h\n```\nto get more details about the ELMo training. \n\nHere is an example for training English ELMo.\n```\n$ less data/en.raw\n... (snip) ...\nNotable alumni\nAris Kalafatis ( Acting )\nLabour Party\nThey build an open nest in a tree hole , or man - made nest - boxes .\nLegacy\n... (snip) ...\n\n$ python -m elmoformanylangs.biLM train \\\n    --train_path data/en.raw \\\n    --config_path elmoformanylangs/configs/cnn_50_100_512_4096_sample.json \\\n    --model output/en \\\n    --optimizer adam \\\n    --lr 0.001 \\\n    --lr_decay 0.8 \\\n    --max_epoch 10 \\\n    --max_sent_len 20 \\\n    --max_vocab_size 150000 \\\n    --min_count 3\n```\nHowever, we\nneed to add that the training process is not very stable.\nIn some cases, we end up with a loss of `nan`. We are actively working on that and hopefully\nimprove it in the future.\n\n## Citation\n\nIf our ELMo gave you nice improvements, please cite us.\n\n```\n@InProceedings{che-EtAl:2018:K18-2,\n  author    = {Che, Wanxiang  and  Liu, Yijia  and  Wang, Yuxuan  and  Zheng, Bo  and  Liu, Ting},\n  title     = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},\n  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},\n  month     = {October},\n  year      = {2018},\n  address   = {Brussels, Belgium},\n  publisher = {Association for Computational Linguistics},\n  pages     = {55--64},\n  url       = {http://www.aclweb.org/anthology/K18-2005}\n}\n```\n\nPlease also cite the \n[NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home)\nfor hosting the models.\n```\n@InProceedings{fares-EtAl:2017:NoDaLiDa,\n  author    = {Fares, Murhaf  and  Kutuzov, Andrey  and  Oepen, Stephan  and  Velldal, Erik},\n  title     = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},\n  booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},\n  month     = {May},\n  year      = {2017},\n  address   = {Gothenburg, Sweden},\n  publisher = {Association for Computational Linguistics},\n  pages     = {271--276},\n  url       = {http://www.aclweb.org/anthology/W17-0237}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHIT-SCIR%2FELMoForManyLangs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHIT-SCIR%2FELMoForManyLangs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHIT-SCIR%2FELMoForManyLangs/lists"}