{"id":21437514,"url":"https://github.com/janspiry/distributed-pytorch-template","last_synced_at":"2025-07-14T15:30:50.337Z","repository":{"id":37737951,"uuid":"452535636","full_name":"Janspiry/distributed-pytorch-template","owner":"Janspiry","description":"This is a seed project for distributed PyTorch training, which was built to customize your network quickly","archived":false,"fork":false,"pushed_at":"2022-06-22T08:54:35.000Z","size":411,"stargazers_count":67,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-03-04T20:07:50.893Z","etag":null,"topics":["ddp","deep-learning","distributeddataparallel","logger","pytorch","seed-project","template"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Janspiry.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-27T04:12:13.000Z","updated_at":"2023-02-28T03:29:48.000Z","dependencies_parsed_at":"2022-08-08T21:30:50.034Z","dependency_job_id":null,"html_url":"https://github.com/Janspiry/distributed-pytorch-template","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Janspiry%2Fdistributed-pytorch-template","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Janspiry%2Fdistributed-pytorch-template/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Janspiry%2Fdistributed-pytorch-template/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Janspiry%2Fdistributed-pytorch-template/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Janspiry","download_url":"https://codeload.github.com/Janspiry/distributed-pytorch-template/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225983005,"owners_count":17555076,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ddp","deep-learning","distributeddataparallel","logger","pytorch","seed-project","template"],"created_at":"2024-11-23T00:20:43.733Z","updated_at":"2024-11-23T00:20:44.378Z","avatar_url":"https://github.com/Janspiry.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyTorch Template Using DistributedDataParallel\n\nThis is a seed project for distributed PyTorch training, which was built to customize your network quickly. \n\n### Overview\n\nHere is an overview of what this template can do, and most of them can be customized by the configure file.\n\n![distributed pytorch template](misc/template.png)\n\n### Basic Functions\n\n- checkpoint/resume training\n- progress bar (using tqdm)\n- progress logs (using logging)\n- progress visualization (using tensorboard)\n- finetune (partial network parameters training)\n- learning rate scheduler\n- random seed (reproducibility)\n\n------\n### Features\n\n- distributed training using DistributedDataParallel\n- base class for extensibility\n- `.json` configure file for most parameter tuning\n- support multiple networks/losses/metrics definition\n- debug mode for fast test 🌟\n\n------\n### Usage\n\n#### You Need to Know \n\n1. cuDNN default settings are as follows for training, which may reduce your code reproducibility! Notice it to avoid unexpected behaviors.\n\n```python\n torch.backends.cudnn.enabled = True\n # speed-reproducibility tradeoff https://pytorch.org/docs/stable/notes/randomness.html\n if seed \u003e=0 and gl_seed\u003e=0:  # slower, more reproducible\n     torch.backends.cudnn.deterministic = True\n     torch.backends.cudnn.benchmark = False\n else:  # faster, less reproducible, default setting\n     torch.backends.cudnn.deterministic = False\n     torch.backends.cudnn.benchmark = True\n```\n\n2. The project allows custom classes/functions and parameters by configure file. You can define dataset, losses, networks, etc. by the specific format. Take the `network` as an example:\n\n```yaml\n// import Network() class from models.network.py file with args\n\"which_networks\": [\n\t{\n    \t\"name\": [\"models.network\", \"Network\"],\n    \t\"args\": { \"init_type\": \"kaiming\"}\n\t}\n],\n\n// import mutilple Networks from defualt file with args\n\"which_networks\": [ \n    {\"name\": \"Network1\", args: {\"init_type\": \"kaiming\"}},\n    {\"name\": \"Network2\", args: {\"init_type\": \"kaiming\"}},\n],\n\n// import mutilple Networks from defualt file without args\n\"which_networks\" : [\n    \"Network1\", // equivalent to {\"name\": \"Network1\", args: {}},\n    \"Network2\"\n]\n\n// more details can be found on More Details part and init_objs function in praser.py\n```\n\n\n\n#### Start\n\nRun the `run.py` with your setting.\n\n```python\npython run.py\n```\n\nMore choices can be found on `run.py` and `config/base.json`.\n\n\n#### Customize Dataset\n\nDataset part decides the data need to be fed into the network, you can define the dataset by following steps:\n\n1. Put your dataset under `data` folder. See `dataset.py` in this folder as an example.\n2. Edit the **\\[dataset\\]\\[train|test\\]** part in `config/base.json` to import and initialize dataset. \n\n```yaml\n\"datasets\": { // train or test\n    \"train\": { \n            \"which_dataset\": {  // import designated dataset using args \n            \"name\": [\"data.dataset\", \"Dataset\"], \n            \"args\":{ // args to init dataset\n                \"data_root\": \"/data/jlw/datasets/comofod\"\n            } \n        },\n        \"dataloader\":{\n        \t\"validation_split\": 0.1, // percent or number\n            \"args\":{ // args to init dataloader\n                \"batch_size\": 2, // batch size in every gpu\n                \"num_workers\": 4,\n                \"shuffle\": true,\n                \"pin_memory\": true,\n                \"drop_last\": true\n            }\n        }\n    },\n}\n```\n\n##### More details\n\n- You can import dataset from a new file. Key `name` can be a list to show your file name and class/function name, or a single string to explain class name in default file(`data.dataset.py`). An example is as follows:\n\n```yaml\n\"name\": [\"data.dataset\", \"Dataset\"], // import Dataset() class from data.dataset.py\n\"name\": \"Dataset\", // import Dataset() class from default file\n```\n\n- You can control and record more parameters through configure file. Take `data_root`  as the example, you just need to add it in `args` dict and edit the corresponding class to parse this value:\n\n```yaml\n\"args\":{ // args to init dataset\n    \"data_root\": \"your data path\"\n} \n```\n\n```python\nclass Dataset(data.Dataset):\n\tdef __init__(self, data_root, phase='train', image_size=[256, 256], loader=pil_loader):\n\t\timgs = make_dataset(data_root) # data_root value is from configure file\n```\n\n\n\n#### Customize Network\n\nNetwork part shows your learning network structure, you can define your network by following steps:\n\n1. Put your network under `models` folder. See `network.py` in this folder as an example.\n2. Edit the **\\[model\\][which_networks]** part in `config/base.json` to import and initialize your networks, and it is a list. \n\n```yaml\n\"which_networks\": [ // import designated list of networks using args\n    {\n        \"name\": \"Network\",\n        \"args\": { // args to init network\n            \"init_type\": \"kaiming\" \n        }\n    }\n],\n```\n##### More details\n\n- You can import networks from a new file. Key `name` can be a list to show your file name and class/function name, or a single string to explain class name in default file(`models.network.py` ). An example is as follows:\n\n```yaml\n\"name\": [\"models.network\", \"Network\"], // import Network() class from models.network.py\n\"name\": \"Network\", // import Network() class from default file\n```\n\n- You can control and record more parameters through configure file. Take `init_type`  as the example, you just need to add it in `args` dict and edit corresponding class to parse this value:\n\n```yaml\n\"args\": { // args to init network\n    \"init_type\": \"kaiming\" \n}\n```\n\n```python\nclass BaseNetwork(nn.Module):\n\tdef __init__(self, init_type='kaiming', gain=0.02):\n\t\tsuper(BaseNetwork, self).__init__() # init_type value is from configure file\nclass Network(BaseNetwork):\n\tdef __init__(self, in_channels=3, **kwargs):\n    \tsuper(Network, self).__init__(**kwargs) # get init_type value and pass it to base network\n```\n\n- You can import multiple networks. You should import the networks in configure file and use it in model.\n\n```yaml\n\"which_networks\": [ \n    {\"name\": \"Network1\", args: {}},\n    {\"name\": \"Network2\", args: {}},\n],\n```\n\n\n\n\n#### Customize Model(Trainer)\n\nModel part shows your training process including optimizers/losses/process control, etc.  You can define your model by following steps:\n\n1. Put your Model under `models` folder. See `model.py` in its folder as an example.\n2. Edit the **\\[model\\][which_model]** part in `config/base.json` to import and initialize your model.\n\n```yaml\n\"which_model\": { // import designated  model(trainer) using args \n    \"name\": [\"models.model\", \"Model\"],\n    \"args\": { // args to init model\n    } \n}, \n```\n\n##### More details\n\n- You can import model from a new file. Key `name` can be a list to show your file name and class/function name, or a single string to explain class name in default file(`models.model.py` ). An example is as follows:\n\n```yaml\n\"name\": [\"models.model\", \"Model\"], // import Model() class / function(not recommend) from models.model.py (default is [models.model.py])\n\"name\": \"Model\", // import Model() class from default file\n```\n\n- You can control and record more parameters through configure file. Please infer to above  `More details` part.\n\n\n##### Losses and Metrics\n\nLosses and Metrics are defined on configure file. You also can control and record more parameters through configure file, please refer to the above  `More details` part.\n\n```yaml\n\"which_metrics\": [\"mae\"], \n\"which_losses\": [\"mse_loss\"] \n```\n\nAfter the above steps, you need to rewrite several functions like  `base_model.py/model.py` for your network and dataset. \n\n##### Init step\n\nSee `__init__()` functions as the example.\n\n##### Training/validation step\n\nSee `train_step()/val_step()` functions as the example.\n\n##### Checkpoint/Resume training\n\nSee `save_everything()/load_everything()` functions as the example.\n\n\n\n#### Debug mode\n\nSometimes we hope to debug the process quickly to ensure the whole project works, so debug mode is necessary.\n\nThis mode will reduce the dataset size and speed up the training process. You just need to run the file with -d option and edit the debug dict in configure file.\n\n```python\npython run.py -d\n```\n\n```yaml\n\"debug\": { // args in debug mode, which will replace args in train\n    \"val_epoch\": 1,\n    \"save_checkpoint_epoch\": 1,\n    \"log_iter\": 30,\n    \"data_len\": 50 // percent or number, change the size of dataloder to debug_split.\n}\n```\n\n\n\n#### Customize More \n\nYou can choose the random seed,  experiment path in configure file. We will add more useful basic functions with related instructions. **Welcome to more contributions for more extensive customization and code enhancements.**\n\n------\n### Todo\n\nHere are some basic functions or examples that this repository is ready to implement:\n\n- [x] basic dataset/data_loader with validation split\n- [x] basic networks with weight initialization\n- [x] basic model (trainer)\n- [x] checkpoint/resume training\n- [x] progress bar (using tqdm)\n- [x] progress logs (using logging)\n- [x] progress visualization (using tensorboard)\n- [x] multi-gpu support (using DistributedDataParallel and torch.multiprocessing)\n- [x] finetune (partial network parameters training)\n- [x] learning rate scheduler\n- [x] random seed (reproducibility)\n- [x] multiple optimizer and scheduler by configure file\n- [ ] praser arguments customization\n- [ ] more network examples \n\n\n------\n### Acknowledge\n\nWe are benefit a lot from following projects:\n\n\u003e 1. https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement\n\u003e 2. https://github.com/researchmm/PEN-Net-for-Inpainting\n\u003e 3. https://github.com/tczhangzhi/pytorch-distributed\n\u003e 4. https://github.com/victoresque/pytorch-template","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjanspiry%2Fdistributed-pytorch-template","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjanspiry%2Fdistributed-pytorch-template","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjanspiry%2Fdistributed-pytorch-template/lists"}