{"id":35126037,"url":"https://github.com/advancedphotonsource/generic_trainer","last_synced_at":"2026-05-17T23:34:35.044Z","repository":{"id":225190936,"uuid":"765313344","full_name":"AdvancedPhotonSource/generic_trainer","owner":"AdvancedPhotonSource","description":"A model-agnostic and customizable PyTorch trainer with the support of multi-node training on HPCs. ","archived":false,"fork":false,"pushed_at":"2025-07-22T17:19:25.000Z","size":344,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-22T19:14:19.403Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AdvancedPhotonSource.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-29T17:21:18.000Z","updated_at":"2025-07-22T17:19:28.000Z","dependencies_parsed_at":"2024-04-12T18:24:25.882Z","dependency_job_id":"7649a628-5a09-419a-982d-4654bb1ce15e","html_url":"https://github.com/AdvancedPhotonSource/generic_trainer","commit_stats":null,"previous_names":["mdw771/generic_trainer","advancedphotonsource/generic_trainer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AdvancedPhotonSource/generic_trainer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdvancedPhotonSource%2Fgeneric_trainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdvancedPhotonSource%2Fgeneric_trainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdvancedPhotonSource%2Fgeneric_trainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdvancedPhotonSource%2Fgeneric_trainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AdvancedPhotonSource","download_url":"https://codeload.github.com/AdvancedPhotonSource/generic_trainer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdvancedPhotonSource%2Fgeneric_trainer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33159104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T22:39:12.733Z","status":"ssl_error","status_checked_at":"2026-05-17T22:39:10.741Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-28T02:40:33.603Z","updated_at":"2026-05-17T23:34:35.040Z","avatar_url":"https://github.com/AdvancedPhotonSource.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A generic PyTorch trainer with multi-node support\n\nThis repository contains a generic PyTorch trainer that can be used for various\nprojects. It is designed to be model-agnostic and (maximally) task-agnostic.\nUsers can customize the following through the configuration interface:\n- model class and parameters\n- model checkpoint\n- dataset object\n- expected predictions\n- loss functions (allows different loss functions for different predictions and Tikonov regularizers)\n- parallelization (single node multiple GPUs, multinode)\n- optimizer class and parameters\n- Training parameters (learning rate, batch size per process, ...)\n\nIf more freedom is needed, one can also conveniently create a subclass of the trainer\nand override certain methods.\n\n## Installation\n\nTo install `generic_trainer` as a Python package, clone the GitHub\nrepository, then\n```\npip install -e .\n```\n\nThis command should automatically install the dependencies, which are\nspecified in `pyproject.toml`. If you prefer not to install the\ndependencies, do\n```\npip install --no-deps -e .\n```\nThe `-e` flag makes the installation editable, *i.e.*, any\nmodifications made in the source code will be reflected when you\n`import generic_trainer` in Python without reinstalling the package. \n\n## Usage \n\nIn general, model training using `generic_trainer` involves the following steps:\n\n1. Create a model that is a subclass of `torch.nn.Module`.\n2. Create a model configuration class that is a subclass of `ModelParameters`, and contains the arguments of the constructor of the model class.\n3. Create a dataset object that is a subclass of `torch.utils.data.Dataset` and has essential methods like `__len__`, `__getitem__`.\n4. Instantiate a `TrainingConfig` object, and plug in the class handles or objects created above, along with other configurations and parameters.\n5. Run the following:\n```\ntrainer.build()\ntrainer.run_training()\n```\nExample scripts are available in `examples` and users are highly recommended to refer to them.\n\n### Config objects\n\nMost configurations and parameters are passed to the trainer through the `OptionContainer`\nobjects defined in `configs.py`. `TrainingConfig` is the main config object for training. \nSome of its fields accept other config objects. For example, model parameters are provided by\npassing an object of a subclass of `ModelParameters` to `TrainingConfig.model_params`; \nalso, parallelization options are provided by passing a object of `ParallelizationConfig` to\n`TrainingConfig.parallelization_params`.\n\n### Model\n\nThe model definition should be given to the trainer through `TrainingConfig.model_class`\nand `TrainingConfig.model_params`. The former should be the handle of a subclass of\n`torch.nn.Module`, and the latter should be an object of a subclass of `ModelParameters`.\nThe fields of the model parameter config object must match the arguments in the `__init__`\nmethod of the model class.\n\n### Dataset\n\nData are passed to the trainer through `TrainingConfig.dataset`. This field expect an\nobject of a subclass of `torch.utils.data.Dataset`. Please refer to\n[PyTorch's official documentation](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)\nfor a guide on creating custom dataset classes for your own data.\n\nThe provided dataset is assumed to contain both training and validation data. Inside the\ntrainer, it will be randomly split into a training dataset and validation dataset. To\ncontrol the ratio of train-validation separation, use `TrainingConfig.validation_ratio`.\n\n### Optimizer\n\nOne can specify an optimizer by passing the class handle of the desired optimizer to\n`TrainingConfig.optimizer`, and its arguments (if any) to `TrainingConfig.optimizer_params`\nas a `dict`. The learning rate should NOT be included in the optimizer parameter object, as\nit is set elsewhere. \n\nFor example, the following code instructs the trainer to use `AdamW` with `weight_decay=0.01`:\n```\nconfigs = TrainingConfig(\n    ...\n    optimizer=torch.optim.AdamW,\n    optimizer_params={'weight_decay': 0.01},\n    ...\n)\n```\n\n### Prediction names\n\nThe trainer needs to know the names and orders of the model's predictions. These are set\nthrough `TrainingConfig.pred_names` as a list or tuple of strings, with each string being\nthe name of a prediction. The names could be anything as long as they suggest the nature\nof the prediction. The length of list or tuple is more important, as it tells the trainer\nhow many model predictions to expect. \n\n### Loss functions\n\nLoss function can be customized through `TrainingConfig.loss_function`. This field takes\neither a single, or a list/tuple of Callables that have a signature of \n`loss_func(preds, labels)`. When using loss functions from Pytorch,\nthey should be instantiated objects instead of the class handles (*e.g.*, `nn.CrossEntropyLoss()`\ninstead of `nn.CrossEntropyLoss`, because the Callable that has the required signature is\nthe `forward` method of the object). \n\nCurrently, additioanl arguments to the loss function\nis not allowed, but one can create a loss function class subclassing `nn.Module`, and\nset extra arguments through its constructor. For example, to set the weight to a particular\nloss function, one can create the loss as\n```\nclass MyLoss(nn.Module):\n    def __init__(weight=0.01):\n        self.weight = weight\n        \n    def forward(preds, labels):\n        return self.weight * torch.mean((preds - labels) ** 2)\n```\n\nWhen a list or tuple of Callables is passed to `loss_function`, it uses the loss functions\nrespectively for each prediction defined in `TrainingConfig.pred_names`. If there\nare more loss functions than the number of `pred_names`, the rest are treated as regularizers\nand they should have a signature of `loss_func(pred1, pred2, ...)`. When encountering these\nloss functions, the trainer would first try calling them with keyword arguments \n`loss_func(pred_name_1=pred_1, pred_name_2=pred_2, ...)`. This is done in case the function's arguments\ncome in a different order from the predictions. If the argument names do not match, it would then\npass the predictions as positional arguments like `loss_func(pred_1, pred_2, ...)`.\n\n### Parallelization\n\nThe trainer should work with either single-node (default) or multi-node parallelization.\nTo run multi-node training, one should create a `ParallelizationConfig` object, set\n`parallelization_type` to `'multi_node''`, then pass the config object to \n`TrainingConfig.parallelization_params`.\n\nWhen `parallelization_type` is `single_node`, the trainer wraps the model \nobject with `torch.nn.DataParallel`,\nallowing it to use all GPUs available on a single machine. \n\nIf `parallelization_type` is set to `multi_node`, the trainer instead wraps the model\nobject with `torch.nn.parallel.DistributedDataParallel` (DDP), which should allow it to\nwork with multiple processes that are potentially distributed over multiple nodes on an HPC.\n\nIn order to run multi-node training on an HPC like ALCF's Polaris, one should\nlaunch multiple processes when submitting the job to the HPC's job scheduler. \nPyTorch DDP's offcial documentation says jobs should be launched using `torchrun` in this case,\nbut it was found that in some cases, \njobs with `torchrun` would run into GPU visibility-related exception\nwhen using the NCCL backend. Instead, we figured out that the job may also be launched with\n`aprun`, the standard multi-processing run command with Cobalt (Theta) or PBS (Polaris) scheduler.\nSome environment variables need to be set in the Python script, as shown in the \n[ALCF training material repository](https://github.com/argonne-lcf/ai-science-training-series/tree/13bd951ca01dd432f4939c309834252de2a493e9/06_distributedTraining/DDP). \nAn example of multi-node training on Polaris\nis available in `examples/hpc/ddp_training_with_dummy_data.py`. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadvancedphotonsource%2Fgeneric_trainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadvancedphotonsource%2Fgeneric_trainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadvancedphotonsource%2Fgeneric_trainer/lists"}