{"id":18610214,"url":"https://github.com/iamncj/yuangpt","last_synced_at":"2026-05-18T06:35:05.888Z","repository":{"id":163545190,"uuid":"453372293","full_name":"iamNCJ/YuanGPT","owner":"iamNCJ","description":"GPT-like Large Language Model Pretrained on Inspur's Yuan Dataset","archived":false,"fork":false,"pushed_at":"2023-02-15T17:21:57.000Z","size":577,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-22T06:51:41.787Z","etag":null,"topics":["gpt","gpt-2","large-language-models","llm","mlsys","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iamNCJ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-01-29T11:00:27.000Z","updated_at":"2023-05-10T17:29:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"3633aa3c-1095-4e95-bfe6-6799cd73b40f","html_url":"https://github.com/iamNCJ/YuanGPT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iamNCJ/YuanGPT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamNCJ%2FYuanGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamNCJ%2FYuanGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamNCJ%2FYuanGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamNCJ%2FYuanGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iamNCJ","download_url":"https://codeload.github.com/iamNCJ/YuanGPT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamNCJ%2FYuanGPT/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33167669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T05:43:36.989Z","status":"ssl_error","status_checked_at":"2026-05-18T05:43:19.133Z","response_time":71,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt","gpt-2","large-language-models","llm","mlsys","pytorch"],"created_at":"2024-11-07T03:08:54.498Z","updated_at":"2026-05-18T06:35:05.873Z","avatar_url":"https://github.com/iamNCJ.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# YuanGenerativeLM\nGenerative Language Model Pretrained on Inspur's Yuan Dataset, codebase for ASC22 supercomputing competition\n\n## Project Structure\n\nTo simplify experiments on different distributed training frameworks, we decoupled the training code into `config`, `data`, `model` and `trainer` modules.\n\nThe idea of this decoupling is inspired by pytorch-lightning, however we decoupled it even further to make it more flexible when integrating with other frameworks.\n\n### `config` Module\n\nWe put all hyperparameters and configurations into `config` module for better tracing and logging.\n\n### `data` Module\n\nWe directly use `pytorch-lightning.LightningDataModule` since it's interface is well-designed and easy to use.\n\n### `model` Module\n\nSince most distributed training framework need to wrap the model before or after model initialization, and `pytorch-lightning.LightningModule` has already exposed some problem in integrating multiple frameworks simultaneously, we decide to further decouple this module into `BaseModel` class.\n\nThe `BaseModel` directly inherits `nn.Module`, which is the compatible for most of the distributed training frameworks. All implementations of the language model are derived from `BaseModel` and maintain only the model config, the model structure, the forward method, the loss function and the optimizer.\n\nCurrently, implemented models include:\n- native model: written in native pytorch\n- huggingface model: written in HuggingFace's transformers\n\n### `trainer` Module\n\nNow we put everything else like model initialization, training, validation and testing into `trainer` module. All training preparation and iterations are done here.\n\nCurrently, implemented trainers include:\n- PytorchLightning trainer: distributed training with pytorch-lightning, with deepspeed integration provided by the lightning team\n- PatrickStar Trainer\n\n## Distributed Launch\n\nBelow are examples of how to launch the training job on different distributed frameworks.\n\n### DDP in PyTorch-Lightning\n\n`num_nodes` must be set to number of GPUs in all nodes, otherwise it will use the number of GPUs in the master node.\n\n```sh\ntorchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ddp_pl.py\n```\n\n### DeepSpeed in PyTorch-Lightning\n\n```sh\nOMP_NUM_THREADS=32 torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 1 train.ds_pl.py\n```\n\nNote that `OMP_NUM_THREADS` is a must when offload is used, since Optimizer now runs on CPU. \n\n### Horovod in PyTorch-Lightning\n\n```sh\nhorovodrun -np 2 python train.hvd_pl.py\n```\n\nWe still prefer to use `torchrun`\n\n### PatrickStar\n\n```sh\ntorchrun --nnodes=1 --nproc_per_node=2 train.pstar.py\n```\n\n### Colossal AI\n```sh\nGLOO_SOCKET_IFNAME=ibs5 OMP_NUM_THREADS=32 torchrun --master_addr=\"172.25.2.105\" --master_port=29500 --nnodes=2 --node_rank=1 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py\n```\n\n## Run Profile\n\n```sh\nOMP_NUM_THREADS=32 nsys profile -o cpu_adam torchrun --nnodes=2 --nproc_per_node=2 --master_addr GPU04 --master_port 9001 --node_rank 0 train.ds_pl.py\n\nOMP_NUM_THREADS=32 nsys profile --gpu-metrics-device=all --gpuctxsw=true --nic-metrics=true --cuda-memory-usage=true --cudabacktrace=all torchrun  --nnodes=2 --nproc_per_node=2 train.col_ai.py --config=trainer/colossal_ai/strategy.py\n```\n\n## Docker Environment\n\n```sh\ndocker run -it --name pytorch --gpus all --privileged --cap-add=SYS_ADMIN --ipc=host --network=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/infiniband -v $(pwd):/workspace registry.cn-hangzhou.aliyuncs.com/ncj/pytorch bash\n```\n\nCheck details in [Dockerfile](./Dockerfile)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamncj%2Fyuangpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiamncj%2Fyuangpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamncj%2Fyuangpt/lists"}