{"id":13935572,"url":"https://github.com/coqui-ai/Trainer","last_synced_at":"2025-07-19T20:33:17.137Z","repository":{"id":37911630,"uuid":"415948435","full_name":"coqui-ai/Trainer","owner":"coqui-ai","description":"🐸  - A general purpose model trainer, as flexible as it gets","archived":false,"fork":false,"pushed_at":"2024-03-07T12:54:52.000Z","size":269,"stargazers_count":193,"open_issues_count":19,"forks_count":112,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-10-22T19:42:15.572Z","etag":null,"topics":["ai","data-science","deep-learning","machine-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/coqui-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-11T13:54:26.000Z","updated_at":"2024-10-22T09:26:47.000Z","dependencies_parsed_at":"2024-06-18T15:39:08.952Z","dependency_job_id":null,"html_url":"https://github.com/coqui-ai/Trainer","commit_stats":{"total_commits":169,"total_committers":11,"mean_commits":"15.363636363636363","dds":"0.20118343195266275","last_synced_commit":"82db96743c8ceeac93632d85bcf8a9b99b5404bf"},"previous_names":[],"tags_count":32,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coqui-ai%2FTrainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coqui-ai%2FTrainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coqui-ai%2FTrainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coqui-ai%2FTrainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/coqui-ai","download_url":"https://codeload.github.com/coqui-ai/Trainer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226677109,"owners_count":17666007,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-science","deep-learning","machine-learning","pytorch"],"created_at":"2024-08-07T23:01:53.725Z","updated_at":"2024-11-27T03:30:49.739Z","avatar_url":"https://github.com/coqui-ai.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"https://user-images.githubusercontent.com/1402048/151947958-0bcadf38-3a82-4b4e-96b4-a38d3721d737.png\" align=\"right\" height=\"255px\" /\u003e\u003c/p\u003e\n\n# 👟 Trainer\nAn opinionated general purpose model trainer on PyTorch with a simple code base.\n\n## Installation\n\nFrom Github:\n\n```console\ngit clone https://github.com/coqui-ai/Trainer\ncd Trainer\nmake install\n```\n\nFrom PyPI:\n\n```console\npip install trainer\n```\n\nPrefer installing from Github as it is more stable.\n\n## Implementing a model\nSubclass and overload the functions in the [```TrainerModel()```](trainer/model.py)\n\n\n## Training a model with auto-optimization\nSee the [MNIST example](examples/train_mnist.py).\n\n\n## Training a model with advanced optimization\nWith 👟 you can define the whole optimization cycle as you want as the in GAN example below. It enables more\nunder-the-hood control and flexibility for more advanced training loops.\n\nYou just have to use the ```scaled_backward()``` function to handle mixed precision training.\n\n```python\n...\n\ndef optimize(self, batch, trainer):\n    imgs, _ = batch\n\n    # sample noise\n    z = torch.randn(imgs.shape[0], 100)\n    z = z.type_as(imgs)\n\n    # train discriminator\n    imgs_gen = self.generator(z)\n    logits = self.discriminator(imgs_gen.detach())\n    fake = torch.zeros(imgs.size(0), 1)\n    fake = fake.type_as(imgs)\n    loss_fake = trainer.criterion(logits, fake)\n\n    valid = torch.ones(imgs.size(0), 1)\n    valid = valid.type_as(imgs)\n    logits = self.discriminator(imgs)\n    loss_real = trainer.criterion(logits, valid)\n    loss_disc = (loss_real + loss_fake) / 2\n\n    # step dicriminator\n    _, _ = self.scaled_backward(loss_disc, None, trainer, trainer.optimizer[0])\n\n    if trainer.total_steps_done % trainer.grad_accum_steps == 0:\n        trainer.optimizer[0].step()\n        trainer.optimizer[0].zero_grad()\n\n    # train generator\n    imgs_gen = self.generator(z)\n\n    valid = torch.ones(imgs.size(0), 1)\n    valid = valid.type_as(imgs)\n\n    logits = self.discriminator(imgs_gen)\n    loss_gen = trainer.criterion(logits, valid)\n\n    # step generator\n    _, _ = self.scaled_backward(loss_gen, None, trainer, trainer.optimizer[1])\n    if trainer.total_steps_done % trainer.grad_accum_steps == 0:\n        trainer.optimizer[1].step()\n        trainer.optimizer[1].zero_grad()\n    return {\"model_outputs\": logits}, {\"loss_gen\": loss_gen, \"loss_disc\": loss_disc}\n\n...\n```\n\nSee the [GAN training example](examples/train_simple_gan.py) with Gradient Accumulation\n\n\n## Training with Batch Size Finder\nsee the test script [here](tests/test_train_batch_size_finder.py) for training with batch size finder.\n\n\nThe batch size finder starts at a default BS(defaults to 2048 but can also be user defined) and searches for the largest batch size that can fit on your hardware. you should expect for it to run multiple trainings until it finds it. to use it instead of calling ```trainer.fit()``` youll call ```trainer.fit_with_largest_batch_size(starting_batch_size=2048)``` with ```starting_batch_size``` being the batch the size you want to start the search with. very useful if you are wanting to use as much gpu mem as possible.\n\n## Training with DDP\n\n```console\n$ python -m trainer.distribute --script path/to/your/train.py --gpus \"0,1\"\n```\n\nWe don't use ```.spawn()``` to initiate multi-gpu training since it causes certain limitations.\n\n- Everything must the pickable.\n- ```.spawn()``` trains the model in subprocesses and the model in the main process is not updated.\n- DataLoader with N processes gets really slow when the N is large.\n\n## Training with [Accelerate](https://huggingface.co/docs/accelerate/index)\n\nSetting `use_accelerate` in `TrainingArgs` to `True` will enable training with Accelerate.\n\nYou can also use it for multi-gpu or distributed training.\n\n```console\nCUDA_VISIBLE_DEVICES=\"0,1,2\" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py\n```\n\nSee the [Accelerate docs](https://huggingface.co/docs/accelerate/basic_tutorials/launch).\n\n## Adding a callback\n👟 Supports callbacks to customize your runs. You can either set callbacks in your model implementations or give them\nexplicitly to the Trainer.\n\nPlease check `trainer.utils.callbacks` to see available callbacks.\n\nHere is how you provide an explicit call back to a 👟Trainer object for weight reinitialization.\n\n```python\ndef my_callback(trainer):\n    print(\" \u003e My callback was called.\")\n\ntrainer = Trainer(..., callbacks={\"on_init_end\": my_callback})\ntrainer.fit()\n```\n\n## Profiling example\n\n- Create the torch profiler as you like and pass it to the trainer.\n    ```python\n    import torch\n    profiler = torch.profiler.profile(\n        activities=[\n            torch.profiler.ProfilerActivity.CPU,\n            torch.profiler.ProfilerActivity.CUDA,\n        ],\n        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),\n        on_trace_ready=torch.profiler.tensorboard_trace_handler(\"./profiler/\"),\n        record_shapes=True,\n        profile_memory=True,\n        with_stack=True,\n    )\n    prof = trainer.profile_fit(profiler, epochs=1, small_run=64)\n    then run Tensorboard\n    ```\n- Run the tensorboard.\n    ```console\n    tensorboard --logdir=\"./profiler/\"\n    ```\n\n## Supported Experiment Loggers\n- [Tensorboard](https://www.tensorflow.org/tensorboard) - actively maintained\n- [ClearML](https://clear.ml/) - actively maintained\n- [MLFlow](https://mlflow.org/)\n- [Aim](https://aimstack.io/)\n- [WandDB](https://wandb.ai/)\n\nTo add a new logger, you must subclass [BaseDashboardLogger](trainer/logging/base_dash_logger.py) and overload its functions.\n\n## Anonymized Telemetry\nWe constantly seek to improve 🐸 for the community. To understand the community's needs better and address them accordingly, we collect stripped-down anonymized usage stats when you run the trainer.\n\nOf course, if you don't want, you can opt out by setting the environment variable `TRAINER_TELEMETRY=0`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoqui-ai%2FTrainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoqui-ai%2FTrainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoqui-ai%2FTrainer/lists"}