{"id":19542036,"url":"https://github.com/bigscience-workshop/bloom-dechonk","last_synced_at":"2025-04-26T17:31:04.699Z","repository":{"id":36953748,"uuid":"499670005","full_name":"bigscience-workshop/bloom-dechonk","owner":"bigscience-workshop","description":"A repo for running model shrinking experiments","archived":false,"fork":false,"pushed_at":"2022-06-21T10:46:35.000Z","size":71,"stargazers_count":10,"open_issues_count":0,"forks_count":4,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-04T16:41:49.389Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigscience-workshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-03T22:42:16.000Z","updated_at":"2024-07-22T05:45:08.000Z","dependencies_parsed_at":"2022-08-02T22:00:45.931Z","dependency_job_id":null,"html_url":"https://github.com/bigscience-workshop/bloom-dechonk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbloom-dechonk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbloom-dechonk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbloom-dechonk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbloom-dechonk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigscience-workshop","download_url":"https://codeload.github.com/bigscience-workshop/bloom-dechonk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251025668,"owners_count":21524842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T03:12:51.937Z","updated_at":"2025-04-26T17:31:04.436Z","avatar_url":"https://github.com/bigscience-workshop.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bloom-dechonk\nA repo for running model shrinking experiments.\n\n\n### References:\n* A PR that adds bloom model to HF Transformers: https://github.com/huggingface/transformers/pull/17474\n* Base model checkpoint: ([bloom-6b3](https://huggingface.co/bigscience/bloom-6b3/tree/e1f323d102aee6128c6e5045b99bb8e5015f828f))\n* Training logs \u0026 config for base model ([tr11f-6B3-logs](https://huggingface.co/bigscience/tr11f-6B3-logs/tensorboard))\n* The training code is based on [run_clm.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py)\nfrom transformers.\n* [__[public tensorboard with logs]__](https://huggingface.co/bigscience/dechonk-logs-1/tensorboard) - updated every 8 hours\n* [just in case] Cluster-specific script for updating tensorboard every 8 hours - [__[here]__](https://gist.github.com/justheuristic/ff549f7f6e0006469aa31bdcdcbb8855)\n* [Todo:] a list of model downsizing scripts - after merging https://github.com/bigscience-workshop/bloom-dechonk/pull/1\n* [Todo:] link to relevant discussion threads\n\n### Known issues:\n* warmup steps and total steps in the training script (below) were chosen by a random guess, they may be suboptimal,  \n* the training / validation splits are *not* the same as in the main bloom training,\n* batch skipping is not properly validated; if you restart training, you may (or may not) train on some batches twice,\n* would be better to make env.sh into a dockerfile, using ubuntu as parent layer\n\n\n### Setup\n\nThe code requires recent datasets and a development version of Transformers that implements the Bloom model:\n```\npip install https://github.com/younesbelkada/transformers/archive/ba1d9fc05fda160bda968cc77c4c5dbb21049aa9.zip\npip install datasets==2.2.2 accelerate==0.9.0\nDS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install deepspeed==0.6.5 \\\n  --global-option=\"build_ext\" --global-option=\"-j8\" --no-cache -v --disable-pip-version-check\n```\n\nThe full installation script can be found in [env.sh](./env.sh). It assumes clean ubuntu/debian installation and runs.\n__Please do not run this script before you look inside.__\n\n\n\n### Run experiment\n\n\nFirst, compress the model using arbitrary technique\n```python\nimport transformers\nmodel = transformers.BloomForCausalLM.from_pretrained(\"bigscience/bloom-6b3\", use_auth_token=True)\ntokenizer = transformers.AutoTokenizer.from_pretrained(\"bigscience/bloom-6b3\", use_auth_token=True)\nmodel = apply_your_model_compression_ideas(model, tokenizer)\nmodel.save_pretrained(\"./some/folder\")\ntokenizer.save_pretrained(\"./some/folder\")\n```\n\nThen, run the training script using the following command \n```bash\nexport RUN_NAME=TODO_EXP_NAME_HERE\nexport INPUT_PATH=. SNAPSHOT_PATH=./snapshots LOGS_PATH=./logs OMP_NUM_THREADS=32\nexport DATASET_NAME_OR_PATH=TODO DATASET_CONFIG_NAME=TODO INITIAL_MODEL_PATH=./some_folder\n\ndeepspeed --num_gpus 8 ./run_clm.py --do_train --do_eval \\\n    --model_name $INITIAL_MODEL_PATH --tokenizer_name $INITIAL_MODEL_PATH \\\n    --dataset_name $DATASET_NAME_OR_PATH --dataset_config_name $DATASET_CONFIG_NAME --run_name $RUN_NAME \\\n    --block_size 2048 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16 \\\n    --learning_rate 0.00008 --max_grad_norm 1.0 --lr_scheduler_type cosine --max_steps 31250 --warmup_steps 1000 \\\n    --adam_epsilon 1e-8 --weight_decay 0.1 --adam_beta1 0.9 --adam_beta2 0.95 --fp16=True --seed 42 \\\n    --cache_dir $INPUT_PATH/data/cache --output_dir $SNAPSHOT_PATH --overwrite_output_dir=True \\\n    --logging_dir $LOGS_PATH --report_to tensorboard --logging_first_step --logging_steps 100 \\\n    --evaluation_strategy steps --eval_steps 100 --prediction_loss_only --eval_subset_size 512 \\\n    --save_steps 500 --save_total_limit 2 --dataloader_num_workers 8 --deepspeed ds_config.json\n\n```\n\n__Note:__ depending on your training hardware, you may need to modify `ds_config.json` to enable zero-3 or offloading.\nThe default settings roughly correspond to zero-2.\n\nThe default training hyperparameters were adapted from https://huggingface.co/bigscience/tr11f-6B3-logs/tensorboard?scroll=1#text\nexcept learning rate and warmup steps, which were chosen based on model's learning rate during initial checkpoint \nthis code assumes 8 gpus. For a different setup, change gradient_accumulation_steps or  per_device_train_batch_size\nto get the global batch size of 512 sequences or 2^20 (~1M) tokens \n\n\n# Model shrinking code\n\nThe code requires recent datasets and a development version of Transformers that implements the Bloom model:\n```\npip install https://github.com/younesbelkada/transformers/archive/ba1d9fc05fda160bda968cc77c4c5dbb21049aa9.zip\n```\nOnce you have these dependencies you should be able to shrink any Bloom Model by using these arguments from the function `downsample_model.py`:\n| Parameter                 |Description   |\n| :------------------------ |:-------------|\n| ```--model_name``` | Name of the model to downsize - must be on the Hub |\n| ```--output_model_name```  | Name of the output model - Will be used to push it on the Hub or sve it locally |\n| ```--hidden_downsampling_rate```  | Downsampling rate of the hidden dimension|\n| ```--layer_downsampling_rate```  | Downsampling rate of the attention blocks|\n| ```--aggregation_strategy```  | Aggregation strategy of the weights matrices - must be in [`first` `last`, `mean`]|\n| ```--layer_selection_strategy```  | Layer selection strategy of the attention layers - must be in [`first` `last`, `step`, `mean`]|\n| ```--push_to_hub```  | Flag enabling pushing the shrinked the model on the Hub. It will push the model under the `bigscience` organization with the name `output_model_name` |\n\nThen run:\n```bash\npython downsample_model.py \\\n    --model_name [MODEL_NAME] --output_model_name [OUTPUT_MODEL_NAME] \\\n    --hidden_downsampling_rate [HIDDEN_DOWNSAMPLING_RATE] --layer_downsampling_rate [LAYER_DOWNSAMPLING_RATE] \\\n    --aggregation_strategy [AGGREGATION_STRATEGY] --layer_selection_strategy [LAYER_SELECTION_STRATEGY] \\\n    [--push_to_hub]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fbloom-dechonk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigscience-workshop%2Fbloom-dechonk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fbloom-dechonk/lists"}