Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/da03/hierarchical_diffusion_lm

Last synced: 23 days ago
JSON representation

Host: GitHub
URL: https://github.com/da03/hierarchical_diffusion_lm
Owner: da03
License: apache-2.0
Created: 2022-12-06T17:04:23.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2022-12-06T17:13:33.000Z (almost 2 years ago)
Last Synced: 2023-03-06T04:19:10.358Z (over 1 year ago)
Language: Python
Size: 96.3 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

        # Diffusion-based hierarchical language modeling.

## Dependencies

Please follow the instructions in [genslm](https://github.com/ramanathanlab/genslm/blob/main/docs/INSTALL.md) to setup environment. This is particularly important if you plan to use DeepSpeed for distributed training.

Next, install this directory by

```

pip install -e .

```

## Training with DeepSpeed Zero Stage 2

For foundation models with fewer than (including) 2.5B parameters, we can train the model using Zero Stage 2:

```

export NODES=10

export GPUS_PER_NODE=4

export MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov

export LR=1e-4

export EPOCHS=20

export TRAIN_BATCH_SIZE=2

export ACCUMULATION=1

export EVAL_BATCH_SIZE=1

export SAVE_TOTAL_LIMIT=5

export SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}

export TRAIN_FILE=data/sample_train.txt

export TEST_FILE=data/sample_val.txt

export CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints

export MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored

deepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_2.5B.py \

       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \

       --deepspeed=deepspeed_configs/zero2.json \

       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \

       --gradient_accumulation_steps=${ACCUMULATION} \

       --output_dir=${SAVE_FOLDER} \

       --model_type=${MODEL} \

       --model_name_or_path=${MODEL} \

       --do_train \

       --do_eval \

       --train_file=${TRAIN_FILE} \

       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \

       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \

       --evaluation_strategy=epoch --save_strategy=epoch \

       --cl_model_name_or_path=${CL_MODEL} \

       --latent_dim=32 \

       --block_size 1024 --fp16 --prediction_loss_only

```

## Training with DeepSpeed Zero Stage 3

```

export NODES=10

export GPUS_PER_NODE=4

export MASTER_ADDR=x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov

export LR=1e-4

export EPOCHS=20

export TRAIN_BATCH_SIZE=2

export ACCUMULATION=1

export EVAL_BATCH_SIZE=1

export SAVE_TOTAL_LIMIT=5

export SAVE_FOLDER=2.5B_${NODES}nodes_deepspeed_diffusion_sep_checkpoints_${LR}

export TRAIN_FILE=data/sample_train.txt

export TEST_FILE=data/sample_val.txt

export CL_MODEL=/lus/eagle/projects/CVD-Mol-AI/yuntian/genomenewnaive/encoder_93810/run_l0.001_b32/checkpoints

export MODEL=EleutherAI/gpt-neox-20b # doesn't matter, will be ignored

deepspeed --num_gpus=${GPUS_PER_NODE} --num_nodes=${NODES} --master_addr=${MASTER_ADDR} --hostfile=hostfile --master_port=54321 examples/pytorch/language-modeling/run_clm_genslm_25B.py \

       --per_device_train_batch_size=${TRAIN_BATCH_SIZE} \

       --deepspeed=deepspeed_configs/zero3.json \

       --per_device_eval_batch_size=${EVAL_BATCH_SIZE} \

       --gradient_accumulation_steps=${ACCUMULATION} \

       --output_dir=${SAVE_FOLDER} \

       --model_type=${MODEL} \

       --model_name_or_path=${MODEL} \

       --do_train \

       --do_eval \

       --train_file=${TRAIN_FILE} \

       --validation_file=${TEST_FILE} --overwrite_output_dir --save_total_limit=${SAVE_TOTAL_LIMIT} \

       --learning_rate=${LR} --num_train_epochs=${EPOCHS} --load_best_model_at_end=True \

       --evaluation_strategy=epoch --save_strategy=epoch \

       --cl_model_name_or_path=${CL_MODEL} \

       --latent_dim=32 \

       --block_size 1024 --fp16 --prediction_loss_only

```

## Generate

To generate, run

```

CUDA_VISIBLE_DEVICES=0 python examples/pytorch/language-modeling/generate_genslm_2.5B.py

```

## Citations

If you use our models in your research, please cite this paper:

```

@article{zvyagin2022genslms,

  title={GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.},

  author={Zvyagin, Max T and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Bohorquez, Cindy Orozco and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and others},

  journal={bioRxiv},

  year={2022},

  publisher={Cold Spring Harbor Laboratory}

}

```