{"id":19542056,"url":"https://github.com/bigscience-workshop/multilingual-modeling","last_synced_at":"2025-04-26T17:31:04.310Z","repository":{"id":36965376,"uuid":"402805733","full_name":"bigscience-workshop/multilingual-modeling","owner":"bigscience-workshop","description":"BLOOM+1: Adapting BLOOM model to support a new unseen language","archived":false,"fork":false,"pushed_at":"2024-03-02T07:54:24.000Z","size":337,"stargazers_count":71,"open_issues_count":19,"forks_count":15,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-04-04T16:41:45.636Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2212.09535","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigscience-workshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-03T14:54:41.000Z","updated_at":"2025-03-02T16:41:49.000Z","dependencies_parsed_at":"2024-03-02T08:46:28.409Z","dependency_job_id":null,"html_url":"https://github.com/bigscience-workshop/multilingual-modeling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fmultilingual-modeling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fmultilingual-modeling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fmultilingual-modeling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fmultilingual-modeling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigscience-workshop","download_url":"https://codeload.github.com/bigscience-workshop/multilingual-modeling/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251025670,"owners_count":21524843,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T03:13:00.736Z","updated_at":"2025-04-26T17:31:03.998Z","avatar_url":"https://github.com/bigscience-workshop.png","language":"Python","readme":"# README\n\n\n### Notes\nThis repository is no longer actively maintained. This repo was created when BLOOM+1 paper was written, where we had to engineered the adapter modules due to the new BLOOM architecture.\n\nBut now, adapters for BLOOM models are readily available (see [peft](https://github.com/huggingface/peft)), and language adaptation of these models (i.e., training of LLMs on monolingual corpora of a particular language) can be done by following official documentations such as [peft-blog](https://huggingface.co/blog/peft) using the same pretraining objective, next-token-prediction. \n\n---\n\nThis repository contains code for performing language adaptation of multilingual pretrained large language model BLOOM-{560m,1b1,1b7,3b,7b1} to new unseen languages. Please refer to our ACL 2023 paper [BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting](https://aclanthology.org/2023.acl-long.653/).\n\nOur implementations support the following features:\n- finetuning new tokenizers and embedding layers to support new script of unseen languages.\n- different embedding stategies where we replace the entire embedding by training from scratch, reinitialize embedding layers but initialize seen vocabulary, or extend the embedding layer to support new tokens. \n- more than 15 language adaptation strategies for pretrained BLOOM model, including continued-pretraining and parameter-efficient finetuning such as BitFit ([Zaken et al., 2021](https://arxiv.org/abs/2106.10199)), (IA)^3 ([Liu et al., 2022](https://arxiv.org/abs/2205.05638)), LoRA ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)), MAD-X ([Pfeiffer et al., 2020](https://aclanthology.org/2020.emnlp-main.617/)), composible sparse finetuning ([Ansell et al., 2022](https://github.com/cambridgeltl/composable-sft)), etc.\n- different evaluation settings:\n    - supervised fine-tuning or cross-lingual transfer: task-finetuning with (English) task adapters on the following tasks: WikiANN (NER tagging), XLSum (abstractive summarization) and XNLI (natural language inference). This is an artefact that is used for preliminary experiments of our BLOOM+1 work.\n    - zero-shot prompting on adapted language models, which is carried out on our [BLOOM+1](https://arxiv.org/abs/2212.09535) paper. This is done with forked and modified EleutherAI's lm-eval-harness library. See branch [`bigscience-lm-adapt`](https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt).\n\n\n## Installation\n1. Install the packages from [composable-sft](https://github.com/cambridgeltl/composable-sft). This is used for composable-SFT finetuning.\n2. Install the packages from [rational_activations](https://github.com/ml-research/rational_activations). You would need to follow the [Other CUDA/PyTorch] section for installation. This is used for adaptable-adapters. \n3. Install the packages from this repo using `pip install -r requirements.txt`. \n\nIf encounter error with the `import transformer`, uninstall transformers using the command `pip uninstall transformers` and rerun step 3 to reinstall `transformers` supported by `adapter-transformers` library.\n\n## Experimental Setup (Language Adaptation)\n\n### Tokenizer and Tokenization of Dataset\nRun `tokenized4clm_sampled.py` to train the tokenizer on the subset of OSCAR dataset.\n- `lang`: language name (e.g., \"de\", \"th\")\n- `model`: original tokenizer (e.g., \"bigscience/bloom-1b3\")\n- `tokenizer_dir`: path directory to save the tokenizer. The tokenizer will be saved as `tok_${model}_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_{replace/extend}`\n- `cache_dir` (default is \"~/.cache/huggingface/transformers\"): cache directory for downloading the OSCAR dataset and GPT2 tokenizer.\n- `vocab_size`: vocab size of the tokenizer\n- `sample_size`: the amount of samples to use to train the tokenizer (randomly selected)\n- `tok_strategy`: extend, replace or overlap-replace\n\n```\ncache_dir=...\noutput_dir=...\nlang=...  # language\nsample_size=...  # training sample size\nvocab_size=...  # vocab size of tokenizer\ntok_strategy=...  # extend, replace, overlap-replace\nbigs_model=\"bigscience/bloom-1b3\"\n\ntokenizer_dir=\"${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}\"\n\npython ./scripts/lang_adapt/tokenized4clm_sampled.py \\\n--lang $lang \\\n--model $bigs_model \\\n--tokenizer_dir $tokenizer_dir \\\n--hf_cache_dir $cache_dir \\\n--vocab_size $vocab_size \\\n--sample_size $sample_size \\\n--tok_strategy $tok_strategy\n```\n---\n\n### Language Adaptation\nRun `madx_run_clm.py` to finetune language model on a new language. \n- `LANG`: language name (e.g., \"de\", \"th\") on OSCAR\n- `DATA_SAMPLES`: training sample size\n- `VOCAB_SIZE`: vocab size of the tokenizer\n- `BIGS_MODEL`: bigscience model\n- `ADPT_STRATEGY`: language adaptation strategy \n    - `\"emb\"`: train only embedding\n    - `\"continual-pretrain\"`: continued pretraining of the entire BLOOM model\n    - `\"emb-then-adpt\"`: train embedding then Pfeiffer adapter later (sequential training)\n    - `\"pfeiffer\"`, `\"pfeiffer+inv\"`: Pfeiffer adapters in transformers block. ([Houlsby et al., 2019](https://arxiv.org/abs/1902.00751)) Without or with invertible adapters in embedding layer. This is also known as MAD-X ([Pfeiffer et al., 2020](https://aclanthology.org/2020.emnlp-main.617/)). \n    - `\"lora\"`: LoRA adapters in transformers block ([Hu et al., 2021](https://arxiv.org/abs/2106.09685))\n    - `\"aa\"`: adaptable adapters ([Moosavi et al., 2022](https://arxiv.org/abs/2205.01549))\n    - `\"ia3\"`, `\"ia3+inv\"`:  (IA)^3 adapters in transformers block. Without or with invertible adapters in embedding layer. ([Liu et al., 2022](https://arxiv.org/abs/2205.05638))\n    - `\"prefix_tuning\"`, `\"prefix_tuning_flat\"`: Prefix tuning in input space, whether using MLP layers to initialize (without `flat`) or directly initialize tokens (with `flat`) as prefix tokens. ([Li \u0026 Liang, 2021](https://arxiv.org/abs/2101.00190))\n    - `\"prompt-tuning\"`: Prompt-tuning in transformer blocks ([Lester et al., 2021](https://arxiv.org/abs/2104.08691))\n    - `\"sft\"`: Composable sparse finetuning. ([Ansell et al., 2022](https://aclanthology.org/2022.acl-long.125/))\n    - `\"bitfit\"`, `\"bitfit+inv\"`: Finetuning bias layers. Without or with invertible adapters in embedding layer. ([Zaken et al., 2021](https://arxiv.org/abs/2106.10199))\n    - `\"fish\"`: Finetuning FISH masks. ([Sung et al., 2021](https://arxiv.org/abs/2111.09839))\n    - `\"compacter\"`, `\"compacterpp\"`: Compacter or compacter++ adapters in transformer blocks. ([Mahabadi et al., 2021](https://arxiv.org/abs/2106.04647))\n- `EMBD_SRATEGY`: embedding strategy. Either `\"replace\"` (replace the embedding layer entirely), `\"overlap-replace\"` (replace but initialize seen vocab with pretrained embedding), or `\"extend\"` (freeze seen vocab embeddings and add trainable embeddings for unseen vocab)\n- `TOK_STRATEGY`: tokenization strategy (either `\"replace\"` (for embedding strategy of \"replace\" and \"overlap-replace\") or `\"extend\"`)\n- `tokenizer_dir`: saved tokenizer directory (used in the tokenization script above)\n- `cache_dir`: (as above)\n- `output_dir`: directory to save adapted model\n- `logging_dir`: directory to log loss curves to tensorboard\n- `MAX_STEPS`: training steps\n- `EVAL_STEPS`: number of training steps between two evaluations\n- `SAVE_STEPS`: number of training steps between saving the checkpoints.\n```\nLANG=... # language\nDATA_SAMPLES=... # training sample size\nVOCAB_SIZE=... # vocab size of newly trained tokenizer\nBIGS_MODEL=\"bigscience/bloom-1b3\"\nADPT_STRATEGY=\"emb\"  # language adaptation strategy (train only embedding for now)\nEMBD_SRATEGY=...  # either \"replace\", \"overlap-replace\", or \"extend\"\nTOK_STRATEGY=... # either \"replace\" (for embedding strategy of \"replace\" and \"overlap-replace\") or \"extend\"\n\ntokenizer_dir=... # as above\ntokenizer_dir=\"${tokenizer_dir}/tok_${BIGS_MODEL##*/}_${LANG}_oscar_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${TOK_STRATEGY}\"\ncache_dir=... # as above\n\noutput_dir=... # directory to save adapted model\noutput_dir=\"${output_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}\"\nlogging_dir=... # directory to log loss curves to tensorboard\nlogging_dir=\"${logging_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}\"\n\nmkdir -p $output_dir\nmkdir -p $logging_dir\n\nMAX_STEPS=50000\nEVAL_STEPS=5000\nSAVE_STEPS=5000\n\npython ./scripts/lang_adapt/madx_run_clm.py \\\n    --seed 0 \\\n    --fp16 \\\n    --model_name_or_path $BIGS_MODEL \\\n    --tokenizer_name $tokenizer_dir \\\n    --dataset_name oscar \\\n    --cache_dir $cache_dir \\\n    --dataset_config_name \"unshuffled_deduplicated_${LANG}\" \\\n    --logging_dir $logging_dir \\\n    --report_to \"tensorboard\" \\\n    --learning_rate 0.001 \\\n    --do_train \\\n    --do_eval \\\n    --output_dir $output_dir \\\n    --preprocessing_num_workers 8 \\\n    --overwrite_output_dir \\\n    --per_device_train_batch_size 2 \\\n    --gradient_accumulation_steps 4 \\\n    --per_device_eval_batch_size 2 \\\n    --eval_accumulation_steps 4 \\\n    --eval_steps $EVAL_STEPS \\\n    --evaluation_strategy \"steps\" \\\n    --max_eval_samples 5000 \\\n    --save_steps $SAVE_STEPS \\\n    --save_strategy \"steps\" \\\n    --max_train_samples $DATA_SAMPLES \\\n    --max_steps $MAX_STEPS \\\n    --logging_steps 1000 \\\n    --lang_adapt_strategies $ADPT_STRATEGY \\\n    --embedding_strategies $EMBD_SRATEGY \\\n    --load_best_model_at_end \\\n    --gradient_checkpointing \\\n    --fp16\n```\n\n**BLOOM+1 Reproduction**: See `./scripts/lang_adapt/example_scripts/run_clm_ru_madx_560m.sh` to reproduce language adapation of BLOOM-560m models to Russian in our [BLOOM+1 paper](https://arxiv.org/abs/2212.09535).\n\n### Language Adaptation with DeepSpeed\n1. Replace `python ./scripts/lang_adapt/madx_run_clm.py` with `deepspeed --num_gpus=8 --master_port 60000`.\n2. Pass deepspeed config file argument `--deepspeed \"/home/zhengxinyong/multilingual-modeling/scripts/lang_adapt/ds_config_zero2.json\" `\n\nSee example file at `./scripts/lang_adapt/example_scripts/run_clm_ru_madx_7b1_deepspeed.sh`, which adapts BLOOM-7b1 model on Google Cloud 8 A100 GPUs. \n\n## Experimental Setup (Evaluation)\n\n### Zero-Shot Prompting\n\nPrompt the adapted language model in a zero-shot fashion without any finetuning. You'll need to `git clone https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt` to be able to run the experiments. \n\nHere shows the evaluation code for XNLI zero-shot prompting. You can find it in `lm-evaluation-harness/examples/`. \n\nFor BLOOM+1, the tasks used are: \n- `xnli` ([XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053))\n- `amnli` ([AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages](https://arxiv.org/abs/2104.08726))\n- `pawsx` ([PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828))\n- `xcopa` ([XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning](https://arxiv.org/abs/2005.00333))\n- `xstory` (Multilingual [Story Cloze Test and ROCStories Corpora](https://cs.rochester.edu/nlp/rocstories/))\n- `xwino`([Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution](https://aclanthology.org/2021.emnlp-main.670/))\n\n\n**Baseline or Model-Based (BitFit, FISH Mask, etc.)**\n```\npython3 lm-evaluation-harness/main.py \\\n--model bigscience \\\n--model_args tokenizer=\"bigscience/bloom-560m\",pretrained=\"ZYONG2/saved_models/bloom-560m_de_bitfit_100000samples_-1vocab_original-frozen\" \\\n--tasks xnli_de\n```\n\n**Using Adapters (MAD-X, Pfeiffer, IA3, LoRA, etc.)**\n```\npython3 m-evaluation-harness/main.py \\\n--model bigscience \\\n--model_args tokenizer=\"bigscience/bloom-560m\",pretrained=\"bigscience/bloom-560m\",adapter_ckpt_folder=\"ZYONG2/saved_models/bloom-560m_de_ia3_100000samples_-1vocab_original-frozen/oscar_ia3_de\" \\\n--tasks xnli_de\n```\n\n### Supervised Finetuning or Cross-Lingual Transfer (Only used for preliminary experiments with BLOOM is released)\n```\nOUTPUT_DIR=... # where you want to save checkpoints at\nLANG=\"de\"\nCACHE_DIR=... # cache dir for saving/loading HF models and XNLI datasets.\nLR=1e-5\nMODEL_NAME=\"ZYONG2/bigscience/tr5b-1B3-multilingual-alpha-checkpoints\" # previous version of BLOOM pre-release\nTOKENIZER_NAME=\"ZYONG2/processed/011/oscar-de-tokenizer\"\n\n# language adapters checkpoint folder\nMADX_LANG_ADAPTER_NAME=\".../oscar_de\"\n\n# we finetune task adapters for XNLI\nFT_STRATEGIES=\"task_adapters\"\n\nmkdir -p $OUTPUT_DIR\npython adapters_xnli_de.py \\\n$OUTPUT_DIR \\\n--lang $LANG \\\n--cache_dir $CACHE_DIR \\\n--num_train_epochs 2 \\\n--learning_rate $LR \\\n--per_device_train_batch_size 8 \\\n--gradient_accumulation_steps 4 \\\n--pretrained_model $MODEL_NAME \\\n--tokenizer $TOKENIZER_NAME \\\n--do_train \\\n--do_eval_after_train \\\n--madx_lang_adapter $MADX_LANG_ADAPTER_NAME \\\n--finetune_strategies $FT_STRATEGIES \\\n--zero_shot\n```\n\nRemove `--zero_shot` for supervised finetuning setting. \n\nSee example scripts in `./scripts/eval/task_ftscripts_xnli/`. `train_xnli_zero_shot.sh` is the batch script for XNLI finetuning, and `run_eval_xnli_zero_shot.sh` is for evaluating trained XNLI task adapters.\n\n## Citation\n```\n@inproceedings{yong-etal-2023-bloom,\n    title = \"{BLOOM}+1: Adding Language Support to {BLOOM} for Zero-Shot Prompting\",\n    author = \"Yong, Zheng Xin  and Schoelkopf, Hailey  and Muennighoff, Niklas  and Aji, Alham Fikri  and Adelani, David Ifeoluwa  and Almubarak, Khalid  and Bari, M Saiful  and Sutawika, Lintang  and Kasai, Jungo  and Baruwa, Ahmed  and Winata, Genta  and Biderman, Stella  and Raff, Edward  and Radev, Dragomir  and Nikoulina, Vassilina\",\n    editor = \"Rogers, Anna  and Boyd-Graber, Jordan  and Okazaki, Naoaki\",\n    booktitle = \"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = jul,\n    year = \"2023\",\n    address = \"Toronto, Canada\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.acl-long.653\",\n    doi = \"10.18653/v1/2023.acl-long.653\",\n    pages = \"11682--11703\",\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fmultilingual-modeling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigscience-workshop%2Fmultilingual-modeling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fmultilingual-modeling/lists"}