{"id":18322524,"url":"https://github.com/tencentarc/common_trainer","last_synced_at":"2025-04-05T23:31:02.970Z","repository":{"id":109180104,"uuid":"577657843","full_name":"TencentARC/common_trainer","owner":"TencentARC","description":"Common template for pytorch project. Easy to extent and modify for new project.","archived":false,"fork":false,"pushed_at":"2022-12-13T08:28:35.000Z","size":135,"stargazers_count":12,"open_issues_count":0,"forks_count":3,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-21T13:23:19.363Z","etag":null,"topics":["computer-vision","deep-learning","machine-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-13T08:28:13.000Z","updated_at":"2023-08-06T04:19:41.000Z","dependencies_parsed_at":"2023-03-30T12:04:31.724Z","dependency_job_id":null,"html_url":"https://github.com/TencentARC/common_trainer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fcommon_trainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fcommon_trainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fcommon_trainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fcommon_trainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/common_trainer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415783,"owners_count":20935383,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","machine-learning","pytorch"],"created_at":"2024-11-05T18:25:01.241Z","updated_at":"2025-04-05T23:31:02.964Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A pytorch template for deep learning project\nAn easy-to-use template for pytorch dl projects.\n\n------------------------------------------------------------------------\n## Start a new proj\n- Use `python start_new_proj.py --proj_name xx --proj_loc /path/to/proj_parent_dir` to extent to a new project.\nWhat you need to implement are the data, model, loss, metric, progress_img_saver.\nAll other func for training and evaluation have been provided.\n- When you start a new proj with proj_name, the custom lib will be renamed by your proj_name. Recommend to use\nCamel-Case like (ProjName).\n\n------------------------------------------------------------------------\n## Installation\n- Install required lib by `pip install -r requirements`. Major lib are: torch, numpy, loguru, tensorboard, pyyaml\n- `pre-commit install` to install pre-commit for formatting. `pre-commit run --all-files` for checking all files.\n\n------------------------------------------------------------------------\n## Main Function\nUse `python train.py --config configs/default.yaml` to start training.\nAll params should be referred to `configs/default.yaml`\n\n------------------------------------------------------------------------\n## CPU training\n- Setting `--gpu_ids -1` will only use cpu, good for debugging. Refer `scripts/cpu.sh` for more detail.\n\n## GPU and Multi process\n- Use launch: You can refer to `script/gpu.sh` for training on gpu.\nSingle/Multi-gpu with local machine and distributed machines are allowed.\n- Use slurm: You can refer to `script/slurm.sh` for training on gpu using `slurm`.\nSingle/Multi-gpu with local machine and distributed machines are allowed.\n\n`@master_only` in all functions allows only the `rank=0` node performing func.\n\n------------------------------------------------------------------------\n## Config\n- Use yaml to save configs. Mainly saved at `configs/`. If you want to set or update\nby argument, you can directly add `--arg value` during input.\n\n- All arguments in yaml are in levels, and input arguments should be `--level1.level2...`\n\n------------------------------------------------------------------------\n## Logging\n- We use `loguru` to save and show the log. Only `rank=0` process shows the log. You can `add_log` and set msg_level\n\n------------------------------------------------------------------------\n## Resume training\n- You can set `--resume` as the checkpoint_path, or the checkpoint folder which will load the `lastest.pt.tar`.\nBut this only reads the model, you have to set `--configs xxx` as the configs in the existing expr folder.\n\n- In `resume` mode, if you set `progress.start_epoch` as `-1`. It will resume training.\n\n- If `progress.start_epoch` is `0`, it will load the weight and fine-tune from epoch 0. You should set\na different expr name like `xxx_finetune` for separation.\n\n------------------------------------------------------------------------\n## Reproduce an old experiment\n- All updated configs will be saved in the experiment. You just need to run `job.sh` in the exp to reproduce result.\n\n- The script is for starting cpu training. You need to modify the `job.sh` to use gpu.\n\n------------------------------------------------------------------------\n## Model\n- You can add your model at `custom.models` with `xxx_model.py`.\n\n- Add `@MODEL_REGISTRY.register()` to the class for registration.\n\n- Some backbones/components are provided in `common.models`.\n\n------------------------------------------------------------------------\n## Dataset\n- `dir.data_dir` in config is the main data_dir for all dataset. Should not specify it for any single dataset.\nYou should modify you `custom.xx_dataset.py` to make the address specified for you dataset.\n\n- You can add your dataset at `custom.datasets` with `xxx_dataset.py`.\n\n- Add `@DATASET_REGISTRY.register()` to the class for registration.\n\nTo set dataset used in train/val/eval, set\n```\ndataset\n    train:\n        type: xxDataset\n        augmentation:\n            xxx:\n    val:\n    eval:\n```\nMissing val/eval will not do validation and eval during training.\n\n## Data Transforms\n- You can modify the function `custom.dataset.transform.get_transforms` for choosing data transformation.\n\n- Some basic function are provided in `common.dataset.transform.augmentation`.\n\n------------------------------------------------------------------------\n## Loss\n- You can add your loss at `custom.loss` with `xxx_loss.py`.\n\n- Add `@LOSS_REGISTRY.register()` to the class for registration.\n\nTo set loss\n```\nloss:\n    loss1:\n        weight: 1.0\n        other: xxx\n        augmentation:\n    loss2:\n        weight: 2.0\n```\n- Weights will be combined in loss_factory in `custom.loss.__init__`, you don't need to multiply weight in\neach implementation.\n\n- When implementing metric, you have to put `inputs` to the `output` device. Refer to `custom.loss.img_loss` for example.\n\n\nThe resulting loss dict will be:\n```\nloss:\n    names: [loss1, loss2, ...]\n    loss1: xx.xx\n    loss2: xx.xx\n    ...\n    sum: xx.xx\n```\n\n------------------------------------------------------------------------\n## Metric\n- Similar to Loss to calculate all metrics in once. But you don't need to set weights here, and no 'sum' is calculated.\n-\n- Add `@METRIC_REGISTRY.register()` to the class for registration.\n\n- When implementing metric, you have to put `inputs` to the `output` device. Refer to `custom.metric.custom_metric` for example.\n\n- The resulting metric dict will be:\n```\nmetric:\n    names: [metric1, metric2, ...]\n    metric1: xx.xx\n    metric2: xx.xx\n    ...\n```\n\n------------------------------------------------------------------------\n## Grad clip\n- Support grad on the whole model by `clip_gradients`.\nYou can set `clip_warm` as positive number in order to use `clip_gradients_warmup` after warmup period.\n\n------------------------------------------------------------------------\n## Valid\n- Validation will be performed on `val` dataset every `progress.epoch_val` epoch. Monitor will record result like loss, imgs.\n\n- You can specify the valid cfgs in `dataset.val` to change the dataset details.\n\n- If `progress.save_progress_val` is `True`, will save `progress.max_samples_val` result into `experiments/expr_name/progress/val`.\n\n------------------------------------------------------------------------\n## Eval\n- Evaluation will be performed on `eval` dataset every `progress.epoch_eval` dataset.\nAll result will be locally recorded in `experiments/expr_name/eval` for each epoch.\nBut generally you should not make it in training progress. Local evaluation is better to avoid over-fitting.\n\n- You can specify the valid cfgs in `dataset.eval` to change the dataset details.\n\n- Metric will be needed for quantitative evaluation.\n\n- If `progress.init_eval` is `True`, will evaluate with init model or resume model.\n### Local Evaluation\n- If you want to evaluate on a trained model, you can use `python evaluate.py` and set `--configs configs/eval.yaml` and\n`--model_pt /path/to/model` for evaluation. Result will be written to `--dir.eval_dir results/eval_sample`.\n\n- `eval.yaml` should contain param for `--dataset.eval`, `--model`, `--metric`.\n\n------------------------------------------------------------------------\n## Tests\n- Tests for `common` class and `custom` are in `tests`. You should implement your tests for `custom` class when needed.\n\n- We use unittest. You can run\n  - `python -m unittest test_file` on tests in the whole file.\n  - `python -m unittest discover test_dir` on tests in the whole directory.\n  - `python -m unitttest test_dir.test_file.test_method` on test for single func.\n\n------------------------------------------------------------------------\n## Monitor and Progress saver\n- A tensorboard monitor will be used during training to record train/val loss, vals, images, etc.\n\n- All result in progress will be saved in `experiments/expr_name/event`. Use `tensorboard --logdir=experiments/expr_name/event` to check.\n\n- At the same time, if you set `progress.local_progress` as True, imgs will be written to `experiments/expr_name/progress`.\n\n- Change `render_progress_img` in `custom_trainer` for different visual results.\n\n------------------------------------------------------------------------\n## CUDA extension\nWe provide simple samples of CUDA extensions for simple add_matrix function, and a python wrapper\nto use it like a `torch.nn.Module`.\nMore detail please see [official doc](https://pytorch.org/tutorials/advanced/cpp_extension.html).\n\nInstall it by getting into `custom/ops` and run `python setup.py install`. Or run `sh ./scripts/install_ops.sh`.\n\nRun it by `python custom/ops/add_matrix.py` or\nrun tests by `python -m unittest tests/tests_custom/tests_ops/tests_ops.py`.\n\n### Develop new ops\nYou need to have a new folder in `custom/ops/` to include the source cpp-wrapper and cuda implementation.\n\nA python wrapper is suggested to put under `custom/ops/func.py` to use the func for usage.\n\n### __global__, __device__, __host__: keywords\n- `__global__`: call by cpu, run on gpu. Function must be `void`.\n- `__device__`: call by gpu, run on gpu\n- `__host__`: call by cpu, run on cpu\n- `__host__ __device__`: both cpu and gpu\n- `__global__ __host__` is not allow.\n\n### grid-block-thread\n`grid - block - thread` is the level structure of GPU computation unit.\n- index = blockIdx.x * blockDim.x + threadIdx.x = the thread id in a grid\n- stride = blockDim.x = total num of thread in a block. Commonly a block can be used to handle one batch.\n- stride = blockDim.x * gridDim.x  = total num of thread in a grid\n  - use this is called `grid-stride loop`\n#### 2d and 1d\n- 2d/1d grid/block are all supported based on your input tensor shape.\n- Ref to [doc1](http://www.mathcs.emory.edu/~cheung/Courses/355/Syllabus/94-CUDA/2D-grids.html) and [doc2](https://blog.csdn.net/canhui_wang/article/details/51730264) for detail.\n\n### PackedAccessor\nTo put a tensor into cuda kernel, it uses\n`\n    AT_DISPATCH_FLOATING_TYPES(A.scalar_type(), \"sample_cuda\",  // this will switch actual scalar type\n    ([\u0026] {\n        kernel_func\u003cscalar_t\u003e\u003c\u003c\u003cblocks, threads\u003e\u003e\u003e(\n            A.data_ptr\u003cscalar_t\u003e(), B.data_ptr\u003cscalar_t\u003e(),\n        );\n    }));\n`\nIf you use `A.data_ptr\u003cscalar_t\u003e()` to send the pointer, it will be hard to access the elements in kernel func.\n\nYou can instead use `PackedAccessor`, which is like\n`torch::PackedTensorAccessor\u003cscalar_t, 2, torch::RestrictPtrTraits, size_t\u003e()` to allow easier access.\n\n### cal_grad in forward\nIn some case, it is helpful to store by-product for backward grad calculation. But in pure inference mode, it is not\ngood to do such calculation during forward pass. It is helpful to pass an indicator in customized forward pass.\n\nThis indicator should be [`any(input.requires_grad)` and `torch.is_grad_enabled()`] to check\nwhether any input requires_grad and whether it is in the no_grad context. In the `.cu` kernel, you should have the\ngrad calculation by yourself.\n\n------------------------------------------------------------------------\n## More to do:\n- inference, demo\n- onnx or other implementation\n- deploy and web server\n- online project homepage\n- colab\n- setup.py\n\n------------------------------------------------------------------------\n## Acknowledge\nThis project template refers to:\n- https://github.com/xinntao/ProjectTemplate-Python\n- https://github.com/ventusff/neurecon#volume-rendering--3d-implicit-surface\n- https://github.com/kwea123/pytorch_cppcuda_practice\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fcommon_trainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fcommon_trainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fcommon_trainer/lists"}