{"id":13564776,"url":"https://github.com/facebookresearch/XLM","last_synced_at":"2025-04-03T21:31:46.654Z","repository":{"id":39674212,"uuid":"168776578","full_name":"facebookresearch/XLM","owner":"facebookresearch","description":"PyTorch original implementation of Cross-lingual Language Model Pretraining.","archived":true,"fork":false,"pushed_at":"2023-02-14T14:44:13.000Z","size":188,"stargazers_count":2900,"open_issues_count":127,"forks_count":495,"subscribers_count":56,"default_branch":"main","last_synced_at":"2025-02-23T00:14:07.681Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-02-02T00:15:33.000Z","updated_at":"2025-02-19T13:03:40.000Z","dependencies_parsed_at":"2022-07-13T12:00:57.895Z","dependency_job_id":"518a54c2-656d-4569-a14d-22cf5f9c5611","html_url":"https://github.com/facebookresearch/XLM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FXLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FXLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FXLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FXLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/XLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247083701,"owners_count":20880895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:35.713Z","updated_at":"2025-04-03T21:31:46.131Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Python","Industrial Strength NLP","AutoML NLP","Natural Language Processing","Industry Strength NLP","Paper implementations｜论文实现","Paper implementations","Urdu NLP Tools, Libraries and Models"],"sub_categories":["Conversation \u0026 Translation","Other libraries｜其他库:","Other libraries:","Language Models"],"readme":"# XLM\n\n**NEW:** Added [XLM-R](https://arxiv.org/abs/1911.02116) model.\n\nPyTorch original implementation of [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291). Includes:\n- [Monolingual language model pretraining (BERT)](#i-monolingual-language-model-pretraining-bert)\n- [Cross-lingual language model pretraining (XLM)](#ii-cross-lingual-language-model-pretraining-xlm)\n- [Applications: Supervised / Unsupervised MT (NMT / UNMT)](#iii-applications-supervised--unsupervised-mt)\n- [Applications: Cross-lingual text classification (XNLI)](#iv-applications-cross-lingual-text-classification-xnli)\n- [Product-Key Memory Layers (PKM)](#v-product-key-memory-layers-pkm)\n\n\n\u003cbr\u003e\n\u003cbr\u003e\n\n![Model](https://dl.fbaipublicfiles.com/XLM/xlm_figure.jpg)\n\n\u003cbr\u003e\n\u003cbr\u003e\n\nXLM supports multi-GPU and multi-node training, and contains code for:\n- **Language model pretraining**:\n    - **Causal Language Model** (CLM)\n    - **Masked Language Model** (MLM)\n    - **Translation Language Model** (TLM)\n- **GLUE** fine-tuning\n- **XNLI** fine-tuning\n- **Supervised / Unsupervised MT** training:\n    - Denoising auto-encoder\n    - Parallel data training\n    - Online back-translation\n\n## Installation\n\nInstall the python package in editable mode with\n```bash\npip install -e .\n```\n\n## Dependencies\n\n- Python 3\n- [NumPy](http://www.numpy.org/)\n- [PyTorch](http://pytorch.org/) (currently tested on version 0.4 and 1.0)\n- [fastBPE](https://github.com/facebookresearch/XLM/tree/master/tools#fastbpe) (generate and apply BPE codes)\n- [Moses](https://github.com/facebookresearch/XLM/tree/master/tools#tokenizers) (scripts to clean and tokenize text only - no installation required)\n- [Apex](https://github.com/nvidia/apex#quick-start) (for fp16 training)\n\n\n## I. Monolingual language model pretraining (BERT)\nIn what follows we explain how you can download and use our pretrained XLM (English-only) BERT model. Then we explain how you can train your own monolingual model, and how you can fine-tune it on the GLUE tasks.\n\n### Pretrained English model\nWe provide our pretrained **XLM_en** English model, trained with the MLM objective.\n\n| Languages        | Pretraining | Model                                                               | BPE codes                                                     | Vocabulary                                                     |\n| ---------------- | ----------- |:-------------------------------------------------------------------:|:-------------------------------------------------------------:| --------------------------------------------------------------:|\n| English          |     MLM     | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_en_2048.pth)         | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_en)      | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_en)    |\n\nwhich obtains better performance than BERT (see the [GLUE benchmark](https://gluebenchmark.com/leaderboard)) while trained on the same data:\n\nModel | Score | CoLA | SST2 | MRPC | STS-B | QQP | MNLI_m | MNLI_mm | QNLI | RTE | WNLI | AX\n|:---: |:---: |:---: | :---: |:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n`BERT` | 80.5 | 60.5 | 94.9 | 89.3/85.4 | 87.6/86.5 | 72.1/89.3 | 86.7 | 85.9 | 92.7 | 70.1 | 65.1 | 39.6\n`XLM_en` | **82.8** | **62.9** | **95.6** | **90.7/87.1** | **88.8/88.2** | **73.2/89.8** | **89.1** | **88.5** | **94.0** | **76.0** | **71.9** | **44.7**\n\nIf you want to **play around with the model and its representations**, just download the model and take a look at our **[ipython notebook](https://github.com/facebookresearch/XLM/blob/master/generate-embeddings.ipynb)** demo.\n\nOur **XLM** PyTorch English model is trained on the same data than the pretrained **BERT** [TensorFlow](https://github.com/google-research/bert) model (Wikipedia + Toronto Book Corpus). Our implementation does not use the next-sentence prediction task and has only 12 layers but higher capacity (665M parameters). Overall, our model achieves a better performance than the original BERT on all GLUE tasks (cf. table above for comparison).\n\n### Train your own monolingual BERT model\nNow it what follows, we will explain how you can train a similar model on your own data.\n\n### 1. Preparing the data\nFirst, get the monolingual data (English Wikipedia, the [TBC corpus](https://yknzhu.wixsite.com/mbweb) is not hosted anymore).\n```\n# Download and tokenize Wikipedia data in 'data/wiki/en.{train,valid,test}'\n# Note: the tokenization includes lower-casing and accent-removal\n./get-data-wiki.sh en\n```\n\n[Install fastBPE](https://github.com/facebookresearch/XLM/tree/master/tools#fastbpe) and **learn BPE** vocabulary (with 30,000 codes here):\n```\nOUTPATH=data/processed/XLM_en/30k  # path where processed files will be stored\nFASTBPE=tools/fastBPE/fast  # path to the fastBPE tool\n\n# create output path\nmkdir -p $OUTPATH\n\n# learn bpe codes on the training set (or only use a subset of it)\n$FASTBPE learnbpe 30000 data/wiki/txt/en.train \u003e $OUTPATH/codes\n```\n\nNow **apply BPE** tokenization to train/valid/test files:\n```\n$FASTBPE applybpe $OUTPATH/train.en data/wiki/txt/en.train $OUTPATH/codes \u0026\n$FASTBPE applybpe $OUTPATH/valid.en data/wiki/txt/en.valid $OUTPATH/codes \u0026\n$FASTBPE applybpe $OUTPATH/test.en data/wiki/txt/en.test $OUTPATH/codes \u0026\n```\n\nand get the post-BPE vocabulary:\n```\ncat $OUTPATH/train.en | $FASTBPE getvocab - \u003e $OUTPATH/vocab \u0026\n```\n\n**Binarize the data** to limit the size of the data we load in memory:\n```\n# This will create three files: $OUTPATH/{train,valid,test}.en.pth\n# After that we're all set\npython preprocess.py $OUTPATH/vocab $OUTPATH/train.en \u0026\npython preprocess.py $OUTPATH/vocab $OUTPATH/valid.en \u0026\npython preprocess.py $OUTPATH/vocab $OUTPATH/test.en \u0026\n```\n\n### 2. Train the BERT model\nTrain your BERT model (without the next-sentence prediction task) on the preprocessed data:\n\n```\n\npython train.py\n\n## main parameters\n--exp_name xlm_en                          # experiment name\n--dump_path ./dumped                       # where to store the experiment\n\n## data location / training objective\n--data_path $OUTPATH                       # data location\n--lgs 'en'                                 # considered languages\n--clm_steps ''                             # CLM objective (for training GPT-2 models)\n--mlm_steps 'en'                           # MLM objective\n\n## transformer parameters\n--emb_dim 2048                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)\n--n_layers 12                              # number of layers\n--n_heads 16                               # number of heads\n--dropout 0.1                              # dropout\n--attention_dropout 0.1                    # attention dropout\n--gelu_activation true                     # GELU instead of ReLU\n\n## optimization\n--batch_size 32                            # sequences per batch\n--bptt 256                                 # sequences length  (streams of 256 tokens)\n--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001  # optimizer (training is quite sensitive to this parameter)\n--epoch_size 300000                        # number of sentences per epoch\n--max_epoch 100000                         # max number of epochs (~infinite here)\n--validation_metrics _valid_en_mlm_ppl     # validation metric (when to save the best model)\n--stopping_criterion _valid_en_mlm_ppl,25  # stopping criterion (if criterion does not improve 25 times)\n--fp16 true                                # use fp16 training\n\n## bert parameters\n--word_mask_keep_rand '0.8,0.1,0.1'        # bert masking probabilities\n--word_pred '0.15'                         # predict 15 percent of the words\n\n## There are other parameters that are not specified here (see train.py).\n```\n\nTo [train with multiple GPUs](https://github.com/facebookresearch/XLM#how-can-i-run-experiments-on-multiple-gpus) use:\n```\nexport NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py\n```\n\n**Tips**: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.\n\n### 3. Fine-tune a pretrained model on GLUE tasks\nNow that the model is pretrained, let's **finetune** it. First, download and preprocess the **GLUE tasks**:\n\n```\n# Download and tokenize GLUE tasks in 'data/glue/{MNLI,QNLI,SST-2,STS-B}'\n\n./get-data-glue.sh\n\n# Preprocessing should be the same than for training.\n# If you removed lower-casing/accent-removal, it sould be reflected here as well.\n```\n\nand **prepare the GLUE data** using the codes and vocab:\n```\n# by default this script uses the BPE codes and vocab of pretrained XLM_en. Modify in script if needed.\n./prepare-glue.sh\n```\n\nIn addition to the **train.py** script, we provide a complementary script **glue-xnli.py** to fine-tune a model on either GLUE or XNLI.\n\nYou can now **fine-tune the pretrained model** on one of the English GLUE tasks using this config:\n\n```\n# Config used for fine-tuning our pretrained English BERT model (mlm_en_2048.pth)\npython glue-xnli.py\n--exp_name test_xlm_en_glue              # experiment name\n--dump_path ./dumped                     # where to store the experiment\n--model_path mlm_en_2048.pth             # model location\n--data_path $OUTPATH                     # data location\n--transfer_tasks MNLI-m,QNLI,SST-2       # transfer tasks (GLUE tasks)\n--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \\in [0.000005, 0.000025, 0.000125])\n--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \\in [0.000005, 0.000025, 0.000125])\n--finetune_layers \"0:_1\"                 # fine-tune all layers\n--batch_size 8                           # batch size (\\in [4, 8])\n--n_epochs 250                           # number of epochs\n--epoch_size 20000                       # number of sentences per epoch (relatively small on purpose)\n--max_len 256                            # max number of words in sentences\n--max_vocab -1                           # max number of words in vocab\n```\n**Tips**: You should sweep over the batch size (4 and 8) and the learning rate (5e-6, 2.5e-5, 1.25e-4) parameters.\n\n## II. Cross-lingual language model pretraining (XLM)\n\n### **XLM-R (new model)**\n[XLM-R](https://arxiv.org/abs/1911.02116) is the new state-of-the-art XLM model. XLM-R shows the possibility of training one model for many languages while not sacrificing per-language performance. It is trained on 2.5 TB of CommonCrawl data, in 100 languages. You can load XLM-R from torch.hub (Pytorch \u003e= 1.1):\n\n```python\n# XLM-R model\nimport torch\nxlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')\nxlmr.eval()\n```\n\nApply sentence-piece-model (SPM) encoding to input text:\n```python\nen_tokens = xlmr.encode('Hello world!')\nassert en_tokens.tolist() == [0, 35378,  8999, 38, 2]\nxlmr.decode(en_tokens)  # 'Hello world!'\n\nar_tokens = xlmr.encode('مرحبا بالعالم')\nassert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]\nxlmr.decode(ar_tokens) # 'مرحبا بالعالم'\n\nzh_tokens = xlmr.encode('你好，世界')\nassert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]\nxlmr.decode(zh_tokens)  # '你好，世界'\n```\n\nExtract features from XLM-R:\n\n```python\n# Extract the last layer's features\nlast_layer_features = xlmr.extract_features(zh_tokens)\nassert last_layer_features.size() == torch.Size([1, 6, 1024])\n\n# Extract all layer's features (layer 0 is the embedding layer)\nall_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)\nassert len(all_layers) == 25\nassert torch.all(all_layers[-1] == last_layer_features)\n```\n\nXLM-R handles the following 100 languages: *Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish*.\n\n### Pretrained cross-lingual language models\n\nWe provide large pretrained models for the 15 languages of [XNLI](https://github.com/facebookresearch/XNLI), and two other models in [17 and 100 languages](#the-17-and-100-languages).\n\n|Languages|Pretraining|Tokenization                          |  Model                                                              | BPE codes                                                            | Vocabulary                                                            |\n|---------|-----------|--------------------------------------| ------------------------------------------------------------------- | -------------------------------------------------------------------- | --------------------------------------------------------------------- |\n|15       |    MLM    |tokenize + lowercase + no accent + BPE| [Model](https://dl.fbaipublicfiles.com/XLM/mlm_xnli15_1024.pth)     | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_xnli_15) (80k)  | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_xnli_15) (95k)  |\n|15       | MLM + TLM |tokenize + lowercase + no accent + BPE| [Model](https://dl.fbaipublicfiles.com/XLM/mlm_tlm_xnli15_1024.pth) | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_xnli_15) (80k)  | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_xnli_15) (95k)  |\n|17       |    MLM    |tokenize + BPE                        | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_17_1280.pth)         | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_xnli_17) (175k) | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_xnli_17) (200k) |\n|100      |    MLM    |tokenize + BPE                        | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_100_1280.pth)        | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_xnli_100) (175k)| [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_xnli_100) (200k)|\n\nwhich obtains better performance than mBERT on the [XNLI cross-lingual classification task](https://arxiv.org/abs/1809.05053):\n\nModel | lg | en | es | de | ar | zh | ur\n|:---: |:---: |:---: | :---: |:---: | :---: | :---: | :---: |\n`mBERT` | 102 | 81.4 | 74.3 | 70.5 | 62.1 | 63.8 | 58.3\n`XLM (MLM)` | 15 | 83.2 | 76.3 | 74.2 | 68.5 | 71.9 | 63.4\n`XLM (MLM+TLM)` | 15 | **85.0** | 78.9 | **77.8** | **73.1** | **76.5** | **67.3**\n`XLM (MLM)` | 17 | 84.8 | **79.4** | 76.2 | 71.5 | 75 | - \n`XLM (MLM)` | 100 | 83.7 | 76.6 | 73.6 | 67.4 | 71.7 | 62.9\n\nIf you want to play around with the model and its representations, just download the model and take a look at our [ipython notebook](https://github.com/facebookresearch/XLM/blob/master/generate-embeddings.ipynb) demo.\n\n#### The 17 and 100 Languages\n\nThe XLM-17 model includes these languages: en-fr-es-de-it-pt-nl-sv-pl-ru-ar-tr-zh-ja-ko-hi-vi\n\nThe XLM-100 model includes these languages: en-es-fr-de-zh-ru-pt-it-ar-ja-id-tr-nl-pl-simple-fa-vi-sv-ko-he-ro-no-hi-uk-cs-fi-hu-th-da-ca-el-bg-sr-ms-bn-hr-sl-zh_yue-az-sk-eo-ta-sh-lt-et-ml-la-bs-sq-arz-af-ka-mr-eu-tl-ang-gl-nn-ur-kk-be-hy-te-lv-mk-zh_classical-als-is-wuu-my-sco-mn-ceb-ast-cy-kn-br-an-gu-bar-uz-lb-ne-si-war-jv-ga-zh_min_nan-oc-ku-sw-nds-ckb-ia-yi-fy-scn-gan-tt-am\n\n### Train your own XLM model with MLM or MLM+TLM\nNow in what follows, we will explain how you can train an XLM model on your own data.\n\n### 1. Preparing the data\n**Monolingual data (MLM)**: Follow the same procedure as in [I.1](https://github.com/facebookresearch/XLM#1-preparing-the-data), and download multiple monolingual corpora, such as the Wikipedias.\n\nNote that we provide a [tokenizer script](https://github.com/facebookresearch/XLM/blob/master/tools/tokenize.sh):\n\n```\nlg=en\ncat my_file.$lg | ./tools/tokenize.sh $lg \u003e my_tokenized_file.$lg \u0026\n```\n\n**Parallel data (TLM)**: We provide download scripts for some language pairs in the *get-data-para.sh* script.\n```\n# Download and tokenize parallel data in 'data/wiki/para/en-zh.{en,zh}.{train,valid,test}'\n./get-data-para.sh en-zh \u0026\n```\n\nFor other language pairs, look at the [OPUS collection](http://opus.nlpl.eu/), and modify the get-data-para.sh script [here)(https://github.com/facebookresearch/XLM/blob/master/get-data-para.sh#L179-L180) to add your own language pair.\n\nNow create you training set for the BPE vocabulary, for instance by taking 100M sentences from each monolingua corpora.\n```\n# build the training set for BPE tokenization (50k codes)\nOUTPATH=data/processed/XLM_en_zh/50k\nmkdir -p $OUTPATH\nshuf -r -n 10000000 data/wiki/train.en \u003e\u003e $OUTPATH/bpe.train\nshuf -r -n 10000000 data/wiki/train.zh \u003e\u003e $OUTPATH/bpe.train\n```\nAnd learn the 50k BPE code as in the previous section on the bpe.train file. Apply BPE tokenization on the monolingual and parallel corpora, and binarize everything using *preprocess.py*:\n\n```\npair=en-zh\n\nfor lg in $(echo $pair | sed -e 's/\\-/ /g'); do\n  for split in train valid test; do\n    $FASTBPE applybpe $OUTPATH/$pair.$lg.$split data/wiki/para/$pair.$lg.$split $OUTPATH/codes\n    python preprocess.py $OUTPATH/vocab $OUTPATH/$pair.$lg.$split\n  done\ndone\n```\n\n### 2. Train the XLM model\nTrain your XLM (MLM only) on the preprocessed data:\n\n```\npython train.py\n\n## main parameters\n--exp_name xlm_en_zh                       # experiment name\n--dump_path ./dumped                       # where to store the experiment\n\n## data location / training objective\n--data_path $OUTPATH                       # data location\n--lgs 'en-zh'                              # considered languages\n--clm_steps ''                             # CLM objective (for training GPT-2 models)\n--mlm_steps 'en,zh'                        # MLM objective\n\n## transformer parameters\n--emb_dim 1024                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)\n--n_layers 12                              # number of layers\n--n_heads 16                               # number of heads\n--dropout 0.1                              # dropout\n--attention_dropout 0.1                    # attention dropout\n--gelu_activation true                     # GELU instead of ReLU\n\n## optimization\n--batch_size 32                            # sequences per batch\n--bptt 256                                 # sequences length  (streams of 256 tokens)\n--optimizer adam,lr=0.0001                 # optimizer (training is quite sensitive to this parameter)\n--epoch_size 300000                        # number of sentences per epoch\n--max_epoch 100000                         # max number of epochs (~infinite here)\n--validation_metrics _valid_mlm_ppl        # validation metric (when to save the best model)\n--stopping_criterion _valid_mlm_ppl,25     # stopping criterion (if criterion does not improve 25 times)\n--fp16 true                                # use fp16 training\n\n## There are other parameters that are not specified here (see [here](https://github.com/facebookresearch/XLM/blob/master/train.py#L24-L198)).\n```\n\nHere the validation metrics *_valid_mlm_ppl* is the average of MLM perplexities.\n\n**MLM+TLM model**: If you want to **add TLM on top of MLM**, just add \"en-zh\" language pair in mlm_steps:\n```\n--mlm_steps 'en,zh,en-zh'                  # MLM objective\n```\n\n**Tips**: You can also pretrain your model with MLM-only, and then continue training with MLM+TLM with the *--reload_model* parameter.\n\n\n### 3. Fine-tune XLM models (Applications, see below)\n\nCross-lingual language model (XLM) provides a strong pretraining method for cross-lingual understanding (XLU) tasks. In what follows, we present applications to machine translation (unsupervised and supervised) and cross-lingual classification (XNLI).\n\n\n## III. Applications: Supervised / Unsupervised MT\n\nXLMs can be used as a pretraining method for unsupervised or supervised neural machine translation.\n\n### Pretrained XLM(MLM) models\nThe English-French, English-German and English-Romanian models are the ones we used in the paper for MT pretraining. They are trained with monolingual data only, with the MLM objective. If you use these models, you should use the same data preprocessing / BPE codes to preprocess your data. See the preprocessing commands in [get-data-nmt.sh](https://github.com/facebookresearch/XLM/blob/master/get-data-nmt.sh).\n\n| Languages        | Pretraining | Model                                                               | BPE codes                                                     | Vocabulary                                                     |\n| ---------------- | ----------- |:-------------------------------------------------------------------:|:-------------------------------------------------------------:| --------------------------------------------------------------:|\n| English-French   |     MLM     | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth)       | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_enfr)    | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_enfr)    |\n| English-German   |     MLM     | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_ende_1024.pth)       | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_ende)    | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_ende)    |\n| English-Romanian |     MLM     | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enro_1024.pth)       | [BPE codes](https://dl.fbaipublicfiles.com/XLM/codes_enro)    | [Vocabulary](https://dl.fbaipublicfiles.com/XLM/vocab_enro)    |\n\n\n### Download / preprocess data\n\nTo download the data required for the unsupervised MT experiments, simply run:\n\n```\ngit clone https://github.com/facebookresearch/XLM.git\ncd XLM\n```\n\nAnd one of the three commands below:\n\n```\n./get-data-nmt.sh --src en --tgt fr\n./get-data-nmt.sh --src de --tgt en\n./get-data-nmt.sh --src en --tgt ro\n```\n\nfor English-French, German-English, or English-Romanian experiments. The script will successively:\n- download Moses scripts, download and compile fastBPE\n- download, extract, tokenize, apply BPE to monolingual and parallel test data\n- binarize all datasets\n\nIf you want to use our pretrained models, you need to have an exactly identical vocabulary. Since small differences can happen during preprocessing, we recommend that you use our BPE codes and vocabulary (although you should get something almost identical if you learn the codes and compute the vocabulary yourself). This will ensure that the vocabulary of your preprocessed data perfectly matches the one of our pretrained models, and that there is not a word / index mismatch. To do so, simply run:\n\n```\nwget https://dl.fbaipublicfiles.com/XLM/codes_enfr\nwget https://dl.fbaipublicfiles.com/XLM/vocab_enfr\n\n./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr\n```\n\n`get-data-nmt.sh` contains a few parameters defined at the beginning of the file:\n- `N_MONO` number of monolingual sentences for each language (default 5000000)\n- `CODES` number of BPE codes (default 60000)\n- `N_THREADS` number of threads in data preprocessing (default 16)\n\nThe default number of monolingual data is 5M sentences, but using more monolingual data will significantly improve the quality of pretrained models. In practice, the models we release for MT are trained on all NewsCrawl data available, i.e. about 260M, 200M and 65M sentences for German, English and French respectively.\n\nThe script should output a data summary that contains the location of all files required to start experiments:\n\n```\n===== Data summary\nMonolingual training data:\n    en: ./data/processed/en-fr/train.en.pth\n    fr: ./data/processed/en-fr/train.fr.pth\nMonolingual validation data:\n    en: ./data/processed/en-fr/valid.en.pth\n    fr: ./data/processed/en-fr/valid.fr.pth\nMonolingual test data:\n    en: ./data/processed/en-fr/test.en.pth\n    fr: ./data/processed/en-fr/test.fr.pth\nParallel validation data:\n    en: ./data/processed/en-fr/valid.en-fr.en.pth\n    fr: ./data/processed/en-fr/valid.en-fr.fr.pth\nParallel test data:\n    en: ./data/processed/en-fr/test.en-fr.en.pth\n    fr: ./data/processed/en-fr/test.en-fr.fr.pth\n```\n\n### Pretrain a language model (with MLM)\n\nThe following script will pretrain a model with the MLM objective for English and French:\n\n```\npython train.py\n\n## main parameters\n--exp_name test_enfr_mlm                # experiment name\n--dump_path ./dumped/                   # where to store the experiment\n\n## data location / training objective\n--data_path ./data/processed/en-fr/     # data location\n--lgs 'en-fr'                           # considered languages\n--clm_steps ''                          # CLM objective\n--mlm_steps 'en,fr'                     # MLM objective\n\n## transformer parameters\n--emb_dim 1024                          # embeddings / model dimension\n--n_layers 6                            # number of layers\n--n_heads 8                             # number of heads\n--dropout 0.1                           # dropout\n--attention_dropout 0.1                 # attention dropout\n--gelu_activation true                  # GELU instead of ReLU\n\n## optimization\n--batch_size 32                         # sequences per batch\n--bptt 256                              # sequences length\n--optimizer adam,lr=0.0001              # optimizer\n--epoch_size 200000                     # number of sentences per epoch\n--validation_metrics _valid_mlm_ppl     # validation metric (when to save the best model)\n--stopping_criterion _valid_mlm_ppl,10  # end experiment if stopping criterion does not improve\n```\n\nIf parallel data is available, the TLM objective can be used with `--mlm_steps 'en-fr'`. To train with both the MLM and TLM objective, you can use `--mlm_steps 'en,fr,en-fr'`. We provide models trained with the MLM objective for English-French, English-German and English-Romanian, along with the BPE codes and vocabulary used to preprocess the data.\n\n### Train on unsupervised MT from a pretrained model\n\nYou can now use the pretrained model for Machine Translation. To download a model trained with the command above on the MLM objective, and the corresponding BPE codes, run:\n\n```\nwget -c https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth\n```\n\nIf you preprocessed your dataset in `./data/processed/en-fr/` with the provided BPE codes `codes_enfr` and vocabulary `vocab_enfr`, you can pretrain your NMT model with `mlm_enfr_1024.pth` and run:\n\n```\npython train.py\n\n## main parameters\n--exp_name unsupMT_enfr                                       # experiment name\n--dump_path ./dumped/                                         # where to store the experiment\n--reload_model 'mlm_enfr_1024.pth,mlm_enfr_1024.pth'          # model to reload for encoder,decoder\n\n## data location / training objective\n--data_path ./data/processed/en-fr/                           # data location\n--lgs 'en-fr'                                                 # considered languages\n--ae_steps 'en,fr'                                            # denoising auto-encoder training steps\n--bt_steps 'en-fr-en,fr-en-fr'                                # back-translation steps\n--word_shuffle 3                                              # noise for auto-encoding loss\n--word_dropout 0.1                                            # noise for auto-encoding loss\n--word_blank 0.1                                              # noise for auto-encoding loss\n--lambda_ae '0:1,100000:0.1,300000:0'                         # scheduling on the auto-encoding coefficient\n\n## transformer parameters\n--encoder_only false                                          # use a decoder for MT\n--emb_dim 1024                                                # embeddings / model dimension\n--n_layers 6                                                  # number of layers\n--n_heads 8                                                   # number of heads\n--dropout 0.1                                                 # dropout\n--attention_dropout 0.1                                       # attention dropout\n--gelu_activation true                                        # GELU instead of ReLU\n\n## optimization\n--tokens_per_batch 2000                                       # use batches with a fixed number of words\n--batch_size 32                                               # batch size (for back-translation)\n--bptt 256                                                    # sequence length\n--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001  # optimizer\n--epoch_size 200000                                           # number of sentences per epoch\n--eval_bleu true                                              # also evaluate the BLEU score\n--stopping_criterion 'valid_en-fr_mt_bleu,10'                 # validation metric (when to save the best model)\n--validation_metrics 'valid_en-fr_mt_bleu'                    # end experiment if stopping criterion does not improve\n```\n\nThe parameters of your Transformer model have to be identical to the ones used for pretraining (or you will have to slightly modify the code to only reload existing parameters). After 8 epochs on 8 GPUs, the above command should give you something like this:\n\n```\nepoch               -\u003e     7\nvalid_fr-en_mt_bleu -\u003e 28.36\nvalid_en-fr_mt_bleu -\u003e 30.50\ntest_fr-en_mt_bleu  -\u003e 34.02\ntest_en-fr_mt_bleu  -\u003e 36.62\n```\n\n## IV. Applications: Cross-lingual text classification (XNLI)\nXLMs can be used to build cross-lingual classifiers. After fine-tuning an XLM model on an English training corpus for instance (e.g. of sentiment analysis, natural language inference), the model is still able to make accurate predictions at test time in other languages, for which there is very little or no training data. This approach is usually referred to as \"zero-shot cross-lingual classification\".\n\n### Get the right tokenizers\n\nBefore running the scripts below, make sure you download the tokenizers from the [tools/](https://github.com/facebookresearch/XLM/tree/master/tools) directory.\n\n### Download / preprocess monolingual data\n\nFollow a similar approach than in section 1 for the 15 languages:\n```\nfor lg in ar bg de el en es fr hi ru sw th tr ur vi zh; do\n  ./get-data-wiki.sh $lg\ndone\n```\n\nDownloading the Wikipedia dumps make take several hours. The *get-data-wiki.sh* script will automatically download Wikipedia dumps, extract raw sentences, clean and tokenize them. Note that in our experiments we also concatenated the [Toronto Book Corpus](http://yknzhu.wixsite.com/mbweb) to the English Wikipedia, but this dataset is no longer hosted.\n\nFor Chinese and Thai you will need a special tokenizer that you can install using the commands below. For all other languages, the data will be tokenized with Moses scripts.\n\n```\n# Thai - https://github.com/PyThaiNLP/pythainlp\npip install pythainlp\n\n# Chinese\ncd tools/\nwget https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip\nunzip stanford-segmenter-2018-10-16.zip\n```\n\n### Download parallel data\n\nThis script will download and tokenize the parallel data used for the TLM objective:\n\n```\nlg_pairs=\"ar-en bg-en de-en el-en en-es en-fr en-hi en-ru en-sw en-th en-tr en-ur en-vi en-zh\"\nfor lg_pair in $lg_pairs; do\n  ./get-data-para.sh $lg_pair\ndone\n```\n\n### Apply BPE and binarize\nApply BPE and binarize data similar to section 2.\n\n### Pretrain a language model (with MLM and TLM)\n\nThe following script will pretrain a model with the MLM and TLM objectives for the 15 XNLI languages:\n\n```\npython train.py\n\n## main parameters\n--exp_name train_xnli_mlm_tlm            # experiment name\n--dump_path ./dumped/                    # where to store the experiment\n\n## data location / training objective\n--data_path ./data/processed/XLM15/                   # data location\n--lgs 'ar-bg-de-el-en-es-fr-hi-ru-sw-th-tr-ur-vi-zh'  # considered languages\n--clm_steps ''                                        # CLM objective\n--mlm_steps 'ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh,en-ar,en-bg,en-de,en-el,en-es,en-fr,en-hi,en-ru,en-sw,en-th,en-tr,en-ur,en-vi,en-zh,ar-en,bg-en,de-en,el-en,es-en,fr-en,hi-en,ru-en,sw-en,th-en,tr-en,ur-en,vi-en,zh-en'  # MLM objective\n\n## transformer parameters\n--emb_dim 1024                           # embeddings / model dimension\n--n_layers 12                            # number of layers\n--n_heads 8                              # number of heads\n--dropout 0.1                            # dropout\n--attention_dropout 0.1                  # attention dropout\n--gelu_activation true                   # GELU instead of ReLU\n\n## optimization\n--batch_size 32                          # sequences per batch\n--bptt 256                               # sequences length\n--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0  # optimizer\n--epoch_size 200000                      # number of sentences per epoch\n--validation_metrics _valid_mlm_ppl      # validation metric (when to save the best model)\n--stopping_criterion _valid_mlm_ppl,10   # end experiment if stopping criterion does not improve\n```\n\n### Download XNLI data\n\nThis script will download and tokenize the XNLI corpus:\n```\n./get-data-xnli.sh\n```\n\n### Preprocess data\nThis script will apply BPE using the XNLI15 bpe codes, and binarize data.\n```\n./prepare-xnli.sh\n```\n\n### Fine-tune your XLM model on cross-lingual classification (XNLI)\n\nYou can now use the pretrained model for cross-lingual classification. To download a model trained with the command above on the MLM-TLM objective, run:\n\n```\nwget -c https://dl.fbaipublicfiles.com/XLM/mlm_tlm_xnli15_1024.pth\n```\n\nYou can now fine-tune the pretrained model on XNLI, or on one of the English GLUE tasks:\n\n```\npython glue-xnli.py\n--exp_name test_xnli_mlm_tlm             # experiment name\n--dump_path ./dumped/                    # where to store the experiment\n--model_path mlm_tlm_xnli15_1024.pth     # model location\n--data_path ./data/processed/XLM15       # data location\n--transfer_tasks XNLI,SST-2              # transfer tasks (XNLI or GLUE tasks)\n--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \\in [0.000005, 0.000025, 0.000125])\n--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \\in [0.000005, 0.000025, 0.000125])\n--finetune_layers \"0:_1\"                 # fine-tune all layers\n--batch_size 8                           # batch size (\\in [4, 8])\n--n_epochs 250                           # number of epochs\n--epoch_size 20000                       # number of sentences per epoch\n--max_len 256                            # max number of words in sentences\n--max_vocab 95000                        # max number of words in vocab\n```\n\n## V. Product-Key Memory Layers (PKM)\n\nXLM also implements the Product-Key Memory layer (PKM) described in [[4]](https://arxiv.org/abs/1907.05242). To add a memory in (for instance) the layers 4 and 7 of an encoder, you can simply provide `--use_memory true --mem_enc_positions 4,7` as argument of `train.py` (and similarly for `--mem_dec_positions` and the decoder). All memory layer parameters can be found [here](https://github.com/facebookresearch/XLM/blob/master/xlm/model/memory/memory.py#L225).\nA minimalist and simple implementation of the PKM layer, that uses the same configuration as in the paper, can be found in this **[ipython notebook](https://github.com/facebookresearch/XLM/blob/master/PKM-layer.ipynb)**.\n\n\n## Frequently Asked Questions\n\n### How can I run experiments on multiple GPUs?\n\nXLM supports both multi-GPU and multi-node training, and was tested with up to 128 GPUs. To run an experiment with multiple GPUs on a single machine, simply replace `python train.py` in the commands above with:\n\n```\nexport NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py\n```\n\nThe multi-node is automatically handled by SLURM.\n\n## References\n\nPlease cite [[1]](https://arxiv.org/abs/1901.07291) if you found the resources in this repository useful.\n\n### Cross-lingual Language Model Pretraining\n\n[1] G. Lample *, A. Conneau * [*Cross-lingual Language Model Pretraining*](https://arxiv.org/abs/1901.07291)\n\n\\* Equal contribution. Order has been determined with a coin flip.\n\n```\n@article{lample2019cross,\n  title={Cross-lingual Language Model Pretraining},\n  author={Lample, Guillaume and Conneau, Alexis},\n  journal={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2019}\n}\n```\n\n### XNLI: Evaluating Cross-lingual Sentence Representations\n\n[2] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, V. Stoyanov [*XNLI: Evaluating Cross-lingual Sentence Representations*](https://arxiv.org/abs/1809.05053)\n\n```\n@inproceedings{conneau2018xnli,\n  title={XNLI: Evaluating Cross-lingual Sentence Representations},\n  author={Conneau, Alexis and Lample, Guillaume and Rinott, Ruty and Williams, Adina and Bowman, Samuel R and Schwenk, Holger and Stoyanov, Veselin},\n  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},\n  year={2018}\n}\n```\n\n### Phrase-Based \\\u0026 Neural Unsupervised Machine Translation\n\n[3] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato [*Phrase-Based \u0026 Neural Unsupervised Machine Translation*](https://arxiv.org/abs/1804.07755)\n\n```\n@inproceedings{lample2018phrase,\n  title={Phrase-Based \\\u0026 Neural Unsupervised Machine Translation},\n  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},\n  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},\n  year={2018}\n}\n```\n\n### Large Memory Layers with Product Keys\n\n[4] G. Lample, A. Sablayrolles, MA. Ranzato, L. Denoyer, H. Jégou [*Large Memory Layers with Product Keys*](https://arxiv.org/abs/1907.05242)\n\n```\n@article{lample2019large,\n  title={Large Memory Layers with Product Keys},\n  author={Lample, Guillaume and Sablayrolles, Alexandre and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\\'e}gou, Herv{\\'e}},\n  journal={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2019}\n}\n```\n\n### Unsupervised Cross-lingual Representation Learning at Scale\n\n[5] A. Conneau *, K. Khandelwal *, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov [*Unsupervised Cross-lingual Representation Learning at Scale*](https://arxiv.org/abs/1911.02116)\n\n\\* Equal contribution\n\n```\n@article{conneau2019unsupervised,\n  title={Unsupervised Cross-lingual Representation Learning at Scale},\n  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},\n  journal={arXiv preprint arXiv:1911.02116},\n  year={2019}\n}\n```\n\n## License\n\nSee the [LICENSE](LICENSE) file for more details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FXLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FXLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FXLM/lists"}