{"id":13604156,"url":"https://github.com/MachineLearningSystem/Optimus-CC","last_synced_at":"2025-04-11T23:31:52.813Z","repository":{"id":185461895,"uuid":"558642693","full_name":"MachineLearningSystem/Optimus-CC","owner":"MachineLearningSystem","description":"[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression","archived":false,"fork":false,"pushed_at":"2022-10-19T03:30:03.000Z","size":1101,"stargazers_count":3,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-07T08:42:15.172Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-28T01:09:11.000Z","updated_at":"2024-06-27T12:38:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"a010af84-9e36-4b0c-a7a3-4b1f6e5f5a2c","html_url":"https://github.com/MachineLearningSystem/Optimus-CC","commit_stats":null,"previous_names":["machinelearningsystem/optimus-cc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FOptimus-CC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FOptimus-CC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FOptimus-CC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FOptimus-CC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/Optimus-CC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495018,"owners_count":21113548,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:40.713Z","updated_at":"2025-04-11T23:31:51.648Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# Optimus-CC [![DOI](https://zenodo.org/badge/553339236.svg)](https://zenodo.org/badge/latestdoi/553339236)\n[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression (Accepted, to appear)\n\nOur codes are based on Megatron-LM (https://github.com/NVIDIA/Megatron-LM, v2.5) and PowerSGD (https://github.com/epfml/powersgd).\n\n## Artifact Evaluation\nThis repository is for AE (Artifact Evaluation) process of ASPLOS'23.\n\nIn ASPLOS23/ folder, scripts for pretraining (TABLE 2), speedup check (TABLE 2, Fig. 10), memory consumption check (Fig. 12), comp/decomp throughput check (Fig. 14), and cosine similarity check (Fig. 11) are available.\nWe give a detailed guideline for these evaluations in `Evaluation Reproducing` section.\nFor accuracy check of zero-shot (TABLE 3 and TABLE 4), the process is quite complex, so please refer to `Zero-Shot Task Running` section. Note that training script for TABLE 4 is available in TABLE 2 training script folder.\nOther experiments (not main evaluation) for figures can be run by changing options in speedup check scripts.\n\nDataset making is explained in `Dataset Preprocessing`. Make the pretraining dataset based on the guideline and use the binarized dataset.\n\nFor detailed arguments and settings, please refer to below explanations.\n\n## Licenses\nFor the baseline codes (Megatron-LM and PowerSGD), please follow their licenses and guidelines.\nFor additional codes, please follow the MIT license.\n\n## Environment\n\nWe conducted pretraining experiments of 2.5B and 8.3B GPT models on a large data center cluster with NVIDIA A100 GPUs. Our GPU Box (Node) interconnects with 200Gbps Infiniband.\n\nWe used the NGC's PyTorch docker container version 20.12, python 3.8, PyTorch 1.8, CUDA 11.1, and NCCL 2.8.3.\n\nIn addition, we converted this image into a singularity image to use in the IBM LSF scheduler.\n(Use `singularity build --sandbox {singularity_name} {docker tar path}` after save docker into `.tar`.)\n\nRefer to https://github.com/NVIDIA/Megatron-LM/blob/main/README.md for distributed execution.\n\n## Megatron-CC Arguments\n\nBelow is the explanation of the arguments to use Megatron-CC's main schemes.\n\n### A. Compressed Backpropagation (CB)\n\n```shell\n--inter_grad_comp \\ # enable CB\n--inter_grad_comp_rank 12 \\ # set CB compression rank\n--inter_grad_comp_epilogue_only \\ # set CB epilogue only\n--use_error_feedback # use Lazy Error Propagation (LEP) on CB\n```\n\n### B. Fused Embedding Synchronization (FE)\n\n```shell\n--emb_comm_opt \\ # enable FE\n```\n\n### C. Selective Stage Compression (SC)\n\n```shell\n# SC need data-parallel gradient compression\n--grad_comp \\ # enable data-parallel gradient compression\n--grad_comp_rank 128 \\ # set data-parallel gradient compression rank\n--grad_comp_warm_up 0.1 \\ # set PowerSGD warm-up period\n--use_error_feedback \\ # use error feedback on PowerSGD\n--selective_grad_comp \\ # enable selective stage compression (SC)\n--selective_grad_comp_way # set how many stages you want to compress\n```\n\nIf you want to check the validity of `Lazy Error Propagation (LEP)`'s orthogonality and average, use below arguments.\n\n```shell\n--check_orth_error_acti\n```\n\nThis argument will print out the cosine similarity and averages.\n\n## Evaluation Reproducing\n\nAll evaluation reproducing scripts are in `ASPLOS23/`. Basic pretraining scripts are in `examples/`.\n\nThese scripts use IBM `LSF` Scheduler, but they can be changed into other scheduler formats (e.g., `Slurm`).\n\nBelow are the scripts that can reproduce the important evaluation results in our paper.\n\nUse them by `bsub \u003c lsf_job_sumit.sub` after replace the `{sh_script_for_experiment_to_execute}` in `lsf_job_submit.sub` by adequate `.sh` file.\n\nReplace `{some_argument}` with proper values.\n\n- Table2 and Fig. 9: `ASPLOS23/tbl2_fig9_tbl4/*.sh`\n  - The main experiment of our paper. Pretraining the GPT-2.5B and GPT-8.3B model for each scheme. Additionally, the training script for non-lep case is included.\n- Fig. 10: `ASPLOS23/fig10/*.sh`\n  - Time check of each scheme. This script only checks the overall time of each scheme. To break down the time, we should follow an approach similar to the CPI stack; comment on the communication code for each communication.\n- Fig. 11: `ASPLOS23/fig11/*.sh`\n  - This script checks the averages and the cosine similarity of errors and intermediate activations.\n- Fig. 12: `ASPLOS23/fig12/*.sh`\n  - This script shows the maximum memory allocation of baseline, CB, and CB+LEP.\n- Fig. 14: `ASPLOS23/fig14/*.md`\n  - Instruction for compression and decompression throughput check of GPT-2.5B, 8.3B and 175B.\n\n## Zero-Shot Task Running\n\nTo run the zero-shot task, split models should be merged into a single model.\n\nUse `tools/ckpt_convert.py` to make the single model.\n\nFor example, if you use TP8, PP4 setting for GPT-2.5B pertaining, use a command like below to make the single model.\n\n```shell\nWORLD_SIZE=8 python tools/ckpt_convert.py --model-type GPT --tensor-model-parallel-size 8 --pipeline-model-parallel-size 4 --target-pipeline-model-parallel-size 1 --vocab-file ~/student1/gpt2-vocab.json --merge-file ~/student1/gpt2-merges.txt --num-layers 52 --hidden-size 1920 --num-attention-heads 24 --seq-length 1024 --max-position-embeddings 1024 --load ~/student1/GPT-2.5B-Baseline/ --save ~/student1/GPT-2.5B-Baseline/merge/ --experiment_name merge_baseline --tokenizer-type GPT2BPETokenizer --activations-checkpoint-method uniform --data-impl mmap --DDP-impl local -i ~/student1/GPT-2.5B-Baseline/iter_0300000/ -o ~/student1/GPT-2.5B-Baseline/merged/ -i_g 1 -p 1\n```\n\nNow, we prepared a single model for running zero-shot tasks.\n\nWe'll use `lm-evaluation-harness` (https://github.com/EleutherAI/lm-evaluation-harness) for running zero-shot tasks, so we need to convert the Megatron-LM checkpoint to HF (HuggingFace) checkpoint. Clone `transformers` GitHub and run the below code for converting.\n\n```shell\npython transformers-4.17.0/src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py --config_file config.json {checkpoint_path}\n```\n\nClone `lm-evaluation-harness` GitHub and run below code for running zero-shot tasks.\n\n```shell\npython main.py --model gpt2 --model_args pretrained={checkpoint_path} --device cuda:0 --tasks lambada,hellaswag,piqa,mathqa,winogrande,race\n```\n\nNow, you can get zero-shot task results.\n\n\n\n## Dataset Preprocessing\n\nDataset preprocessing uses codes in `tools/` and `tools/openwebtext/`.\n\n### Libraries to install\n\n```\n    pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract \n    git clone https://github.com/mattilyra/LSH\n    cd LSH\n    python setup.py install\n```\n\n### Download the dataset\n\n1. Download the deduplicated URLs from [jcpeterson](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ!cc4RgQQZ)\n2. Remove blacklisted URLs.\n\n```\npython blacklist_urls.py \u003cpath to the dowloaded deduplicated URLs\u003e \u003cfilename for clean urls. e.g. clean_urls.txt\u003e\n```\n\n3. Download the content from the clean urls with [openwebtext's utilities](https://github.com/eukaryote31/openwebtext/blob/master/download.py). \n\n4. Merge the contents into one loose json file with 1 json per newline of the format `{'text': text, 'url': unique_url}`. It is important for the url to be unique.\n\n### Prepare the data for GPT training\n\n1. Perform ftfy, english detection and remove documents with less than 128 tokens. This step can be sharded and run on shards.\n\n```\npython cleanup_dataset.py \u003cinput data file\u003e \u003coutput cleaned data filename\u003e\n```\n\nAdditional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. More details can be found by running `python cleanup_fix_dataset.py --help`.\n\n2. Using LSH, find possible duplicates and store then in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by `python find_duplicate.py --help`.\n\n```\npython find_duplicates.py --inputs \u003cpairlist list of input cleaned data files and keys, e.g. cc.json cc_id news.json news_id\u003e --output \u003coutput possible duplicate urls filename\u003e\n```\n\n3. Based on similarity measure defind inside function `is_similar` (default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.\n\n```\npython group_duplicate_urls.py \u003cpossible duplicate urls file\u003e \u003coutput file containing similar urls\u003e\n```\n\n4. Remove similar documents that were detected in the last step.\n\n```\npython remove_group_duplicates.py \u003cfile containing simialr documents\u003e \u003ccleaned data file\u003e \u003coutputfile containing deduplicate data\u003e\n```\n\n5. Shuffle the dataset.\n\n```\nshuf \u003ccleaned deduped data file\u003e -o train_data.json\n```\n\n### Deduplicating ngrams\n\nTo deduplicate the downstream tasks (e.g. lambada, squad) from the training dataset, we run the following command.\n\n```\npython filter_ngrams.py --tasks \u003cname of the task, e.g. lambada, squad\u003e --dedup-dataset \u003ctraining dataset to deduplicate\u003e \u003cjson key\u003e --output \u003coutput training dataset\u003e\n```\n\nWe use 13-grams by default for the deduplication. When we find a 13-gram match in a training document, we split the document into two pieces and remove the 13-gram along with 200 characters from the both side of the 13-gram. We also remove any splitted document with less than 200 characters or if a document got splitted more than 10 times. These parameters can be changed using corresponding arguments.\n\nOnly for the lambada task, we need to provide the path, `--lambada-path \u003cpath of the lambada test data\u003e`.\n\nSeveral other features (e.g. save and load dictionary) have been added, look at `python filter_ngrams.py --help` for details.\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FOptimus-CC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FOptimus-CC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FOptimus-CC/lists"}