{"id":17871325,"url":"https://github.com/da03/hierarchical_diffusion_lm","last_synced_at":"2025-08-14T18:32:13.602Z","repository":{"id":65613614,"uuid":"575060743","full_name":"da03/hierarchical_diffusion_LM","owner":"da03","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-06T17:13:33.000Z","size":101020,"stargazers_count":9,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-28T11:43:02.501Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/da03.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null}},"created_at":"2022-12-06T17:04:23.000Z","updated_at":"2024-04-22T07:59:43.000Z","dependencies_parsed_at":"2023-01-31T19:46:15.295Z","dependency_job_id":null,"html_url":"https://github.com/da03/hierarchical_diffusion_LM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fhierarchical_diffusion_LM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fhierarchical_diffusion_LM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fhierarchical_diffusion_LM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fhierarchical_diffusion_LM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/da03","download_url":"https://codeload.github.com/da03/hierarchical_diffusion_LM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229854965,"owners_count":18134832,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-28T10:30:22.288Z","updated_at":"2024-12-15T17:42:26.611Z","avatar_url":"https://github.com/da03.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Diffusion-based hierarchical language modeling.\n\n## Dependencies\n\nPlease follow the instructions in [genslm](https://github.com/ramanathanlab/genslm/blob/main/docs/INSTALL.md) to setup environment. This is particularly important if you plan to use DeepSpeed for distributed training.\n\nNext, install this directory by\n\n```\npip install -e .\n```\n\n## Training with DeepSpeed Zero Stage 2\n\nFor foundation models with fewer than (including) 2.5B parameters, we can train the model using Zero Stage 2:\n\n```\nexport NODES=10\nexport GPUS_PER_NODE=4\nexport MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov\nexport LR=1e-4\nexport EPOCHS=20\nexport TRAIN_BATCH_SIZE=2\nexport ACCUMULATION=1\nexport EVAL_BATCH_SIZE=1\nexport SAVE_TOTAL_LIMIT=5\nexport SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}\nexport TRAIN_FILE=data/sample_train.txt\nexport TEST_FILE=data/sample_val.txt\nexport CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints\nexport MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored\ndeepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_2.5B.py \\\n       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \\\n       --deepspeed=deepspeed_configs/zero2.json \\\n       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \\\n       --gradient_accumulation_steps=${ACCUMULATION} \\\n       --output_dir=${SAVE_FOLDER} \\\n       --model_type=${MODEL} \\\n       --model_name_or_path=${MODEL} \\\n       --do_train \\\n       --do_eval \\\n       --train_file=${TRAIN_FILE} \\\n       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \\\n       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \\\n       --evaluation_strategy=epoch --save_strategy=epoch \\\n       --cl_model_name_or_path=${CL_MODEL} \\\n       --latent_dim=32 \\\n       --block_size 1024 --fp16 --prediction_loss_only\n\n```\n\n## Training with DeepSpeed Zero Stage 3\n\n```\nexport NODES=10\nexport GPUS_PER_NODE=4\nexport MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov\nexport LR=1e-4\nexport EPOCHS=20\nexport TRAIN_BATCH_SIZE=2\nexport ACCUMULATION=1\nexport EVAL_BATCH_SIZE=1\nexport SAVE_TOTAL_LIMIT=5\nexport SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}\nexport TRAIN_FILE=data/sample_train.txt\nexport TEST_FILE=data/sample_val.txt\nexport CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints\nexport MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored\ndeepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_25B.py \\\n       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \\\n       --deepspeed=deepspeed_configs/zero3.json \\\n       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \\\n       --gradient_accumulation_steps=${ACCUMULATION} \\\n       --output_dir=${SAVE_FOLDER} \\\n       --model_type=${MODEL} \\\n       --model_name_or_path=${MODEL} \\\n       --do_train \\\n       --do_eval \\\n       --train_file=${TRAIN_FILE} \\\n       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \\\n       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \\\n       --evaluation_strategy=epoch --save_strategy=epoch \\\n       --cl_model_name_or_path=${CL_MODEL} \\\n       --latent_dim=32 \\\n       --block_size 1024 --fp16 --prediction_loss_only\n\n```\n\n## Generate\n\nTo generate, run\n\n```\nCUDA_VISIBLE_DEVICES=0 python examples/pytorch/language-modeling/generate_genslm_2.5B.py\n```\n\n## Citations\nIf you use our models in your research, please cite this paper:\n\n```\n@article{zvyagin2022genslms,\n  title={GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.},\n  author={Zvyagin, Max T and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Bohorquez, Cindy Orozco and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and others},\n  journal={bioRxiv},\n  year={2022},\n  publisher={Cold Spring Harbor Laboratory}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fda03%2Fhierarchical_diffusion_lm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fda03%2Fhierarchical_diffusion_lm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fda03%2Fhierarchical_diffusion_lm/lists"}