{"id":15630738,"url":"https://github.com/galeone/dynamic-training-bench","last_synced_at":"2025-04-06T10:10:31.832Z","repository":{"id":12343137,"uuid":"71651079","full_name":"galeone/dynamic-training-bench","owner":"galeone","description":"Simplify the training and tuning of Tensorflow models","archived":false,"fork":false,"pushed_at":"2022-03-12T08:18:53.000Z","size":629,"stargazers_count":214,"open_issues_count":1,"forks_count":30,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-03-30T09:09:10.334Z","etag":null,"topics":["convolutional-neural-networks","dataset","models","neural-network","tensorboard","tensorflow","training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/galeone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-22T16:25:31.000Z","updated_at":"2025-03-25T20:37:26.000Z","dependencies_parsed_at":"2022-08-07T06:16:56.377Z","dependency_job_id":null,"html_url":"https://github.com/galeone/dynamic-training-bench","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galeone%2Fdynamic-training-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galeone%2Fdynamic-training-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galeone%2Fdynamic-training-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galeone%2Fdynamic-training-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/galeone","download_url":"https://codeload.github.com/galeone/dynamic-training-bench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247464220,"owners_count":20942970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolutional-neural-networks","dataset","models","neural-network","tensorboard","tensorflow","training"],"created_at":"2024-10-03T10:36:05.313Z","updated_at":"2025-04-06T10:10:31.809Z","avatar_url":"https://github.com/galeone.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Dynamic Training Bench: DyTB\n===========================\n\nStop wasting your time rewriting the training, evaluation \u0026 visualization procedures for your ML model: let DyTB do the work for you!\n\nDyTB is compatible with: **Tensorflow 1.x \u0026 Python \u003e= 3.5**\n\n# Features\n\n1. Dramatically easy to use\n2. Object Oriented: models and inputs are interfaces to implement\n3. End-to-end training of ML models\n4. Fine tuning\n5. Transfer learning\n6. Easy model comparison\n7. Metrics visualization\n8. Easy statistics\n9. Hyperparameters oriented: change hyperparameters to see how they affect the performance\n10. Automatic checkpoint save of the best model with respect to a metric\n11. Usable as a library or a CLI tool\n\n---\n\n# Getting started: python library\n\n**TL;DR**: `pip install dytb` + [python-notebook with a complete example](examples/VGG-Cifar10-100-TransferLearning-FineTuning.ipynb).\n\nThe standard workflow is extremely simple:\n\n1. Define or pick a predefined model\n2. Define or pick a predefined dataset\n3. Train!\n\n## Define or pick a predefined Model\n\nDyTB comes with some common ML model, like LeNet \u0026 VGG, if you want to test how these models perform when trained on different datasets and/or with different hyperparameters, just use it.\n\nInstead, if you want to define your own model just implement one of the [available interfaces](dytb/models/interfaces.py), depending on ML model you want to implement. The available interfaces are:\n\n1. Classifier\n2. Autoencoder\n3. Regressor\n4. Detector\n\nIt's recommended, but not strictly required, to use the wrappers built around the Tensorflow methods to define the model: these wrappers creates log and visualizations for you.\nWrappers are documented and intuitive: you can find it in the [dytb/models/utils.py](dytb/models/utils.py) file.\n\nDyTB provides different models that can be used alone or can be used as examples of correct implementations.\nEvery model in the [dytb/models/predefined/](dytb/models/predefined/) folder is a valid example.\n\nIn general, the model definition is just the implementation of 2 methods:\n\n1. `get` is which implementing the model itself\n2. `loss` in which implementing the loss function\n\nIt's strictly required to return the parameters that the method documentation requires to, even if they're unused by your model.\n\nE.g.: even if you never use a `is_training_` boolean placeholder in your model definition, define it and return it anyway.\n\n## Define or pick a predefined Input\n\nDyTB comes with some common ML benchmark, like Cifar10, Cifar100 \u0026 MNIST, you can use it to train and measure the performances of your model or you can define your own input source implementing the Input interface that you can find here:\n\n1. [dytb/inputs/interfaces.py](dytb/inputs/interfaces.py)\n\nThe interface implementation should follow these points:\n\n1. Implement the `__init__` method: this method must download the dataset and apply the desired transformations to its elements. There are some utility functions defined in the [`inputs/utils.py`](inputs/utils.py) file that can be used.\nThis method is executed as first operation when the dataset object is created, therefore is recommended to cache the results.\n2. Implement the `num_classes` method: this method must return the number of classes of the dataset. If your dataset has no labels, just return 0.\n3. Implement the `num_examples(input_type)` method: this method accepts an `InputType` enumeration, defined in `inputs/utils.py`.\nThis enumeration has 3 possible values: `InputType.train`, `InputType.validation`, `InputType.test`. As obvious, the method must return the number of examples for every possible value of this enumeration.\n4. Implement the `inputs` method. The `inputs` method is a general method that should return the real values of the dataset, related to the `InputType` passed, without any augmentation. The augmentations are defined at training time.\n\n**Note**: `inputs` must return a Tensorflow queue of `value, label` pairs.\n\nThe better way to understand how to build the input source is to look at the examples in the [dytb/inputs/predefined/](dytb/inputs/predefined/) folder.\nA small and working example that can be worth looking is Cifar10: [dytb/inputs/predefined/Cifar10.py](dytb/inputs/predefined/Cifar10.py).\n\n## Train\n\nTrain measuring predefined metrics it's extremely easy, let's see a complete example:\n\n```python\nimport pprint\nimport tensorflow as tf\nfrom dytb.inputs.predefined import Cifar10\nfrom dytb.train import train\nfrom dytb.models.predefined.VGG import VGG\n\n# Instantiate the model\nvgg = VGG()\n\n# Instantiate the CIFAR-10 input source\ncifar10 = Cifar10.Cifar10()\n\n# 1: Train VGG on Cifar10 for 50 epochs\n# Place the train process on GPU:0\ndevice = '/gpu:0'\nwith tf.device(device):\n    info = train(\n        model=vgg,\n        dataset=cifar10,\n        hyperparameters={\n            \"epochs\": 50,\n            \"batch_size\": 50,\n            \"regularizations\": {\n                \"l2\": 1e-5,\n                \"augmentation\": {\n                    \"name\": \"FlipLR\",\n                    \"fn\": tf.image.random_flip_left_right,\n                    # factor is the estimated amount of augmentation\n                    # that \"fn\" introduces.\n                    # In this case, \"fn\" doubles the training set size\n                    # Thus, an epoch is now seen as the original training\n                    # training set size * 2\n                    \"factor\": 2,\n                }\n            },\n            \"gd\": {\n                \"optimizer\": tf.train.AdamOptimizer,\n                \"args\": {\n                    \"learning_rate\": 1e-3,\n                    \"beta1\": 0.9,\n                    \"beta2\": 0.99,\n                    \"epsilon\": 1e-8\n                }\n            }\n        })\n```\n\nFinish!\n\nAt the end of the training process `info` will contain some useful information, let's (pretty) print them:\n\n```python\npprint.pprint(info, indent=4)\n```\n\n```\n{   'args': {   'batch_size': 50,\n                'checkpoint_path': '',\n                'comment': '',\n                'dataset': \u003cdytb.inputs.Cifar10.Cifar10 object at 0x7f896c19a1d0\u003e,\n                'epochs': 2,\n                'exclude_scopes': '',\n                'force_restart': False,\n                'gd': {   'args': {   'beta1': 0.9,\n                                      'beta2': 0.99,\n                                      'epsilon': 1e-08,\n                                      'learning_rate': 0.001},\n                          'optimizer': \u003cclass 'tensorflow.python.training.adam.AdamOptimizer'\u003e},\n                'lr_decay': {'enabled': False, 'epochs': 25, 'factor': 0.1},\n                'model': \u003cdytb.models.VGG.VGG object at 0x7f896c19a128\u003e,\n                'regularizations': {   'augmentation': \u003cfunction random_flip_left_right at 0x7f89109cb0d0\u003e,\n                                       'l2': 1e-05},\n                'trainable_scopes': ''},\n    'paths': {   'best': '/mnt/data/pgaleone/dytb_work/examples/log/VGG/CIFAR-10_Adam_l2=1e-05_fliplr/best',\n                 'current': '/mnt/data/pgaleone/dytb_work/examples',\n                 'log': '/mnt/data/pgaleone/dytb_work/examples/log/VGG/CIFAR-10_Adam_l2=1e-05_fliplr'},\n    'stats': {   'dataset': 'CIFAR-10',\n                 'model': 'VGG',\n                 'test': 0.55899998381733895,\n                 'train': 0.5740799830555916,\n                 'validation': 0.55899998381733895},\n    'steps': {'decay': 25000, 'epoch': 1000, 'log': 100, 'max': 2000}}\n```\n\n---\n\nHere you can see a complete example of training, continue an interrupted training, fine tuning \u0026 transfer learning: [python-notebook with a complete example](examples/VGG-Cifar10-100-TransferLearning-FineTuning.ipynb).\n\n# Getting started: CLI\n\nThe only prerequisite is to install DyTB via pip.\n\n```\npip install --upgrade dytb\n```\n\nDyTB adds to your $PATH two executables: `dytb_train` and `dytb_evaluate`.\n\nThe CLI workflow is the same as the library one, with 2 differences:\n\n## 1. Interface implementations\n\nIf you define your own input source / model, it must be placed into the appropriate folder:\n\n- For models: [scripts/models/](scripts/models)\n- For inputs: [scripts/inputs/](scripts/inputs)\n\n**Rule**: the class name must be equal to the file name. E.g.: `class LeNet` into `LeNet.py` file.\n\nIf you want to use a predefined input/model you don't need to do anything.\n\n## 2. Train via CLI\n\nEvery single hyperparameter (except for the augmentations) definable in the Python version, can be passed as CLI argument to the `dytb_train` script.\n\nA single model can be trained using various hyper-parameters, such as the learning rate, the weight decay penalty applied, the exponential learning rate decay, the optimizer and its parameters, ...\n\nDyTB allows training a model with different hyper-parameter and automatically it logs every training process allowing the developer to visually compare them.\n\nMoreover, if a training process is interrupted, it automatically resumes it from the last saved training step.\n\n## Example\n\n```\n# LeNet: no regularization\ndytb_train --model LeNet --dataset MNIST\n\n# LeNet: L2 regularization with value 1e-5\ndytb_train --model LeNet --dataset MNIST --l2_penalty 1e-5\n\n# LeNet: L2 regularization with value 1e-2\ndytb_train --model LeNet --dataset MNIST --l2_penalty 1e-2\n\n# LeNet: L2 regularization with value 1e-2, initial learning rate of 1e-4\n# The default optimization algorithm is MomentumOptimizer, so we can change the momentum value\n# The optimizer parameters are passed as a json string\ndytb_train --model LeNet --dataset MNIST --l2_penalty 1e-2 \\\n    --optimizer_args '{\"learning_rate\": 1e-4, \"momentum\": 0.5}'\n\n# If, for some reason, we interrupt this training process, rerunning the same command\n# will restart the training process from the last saved training step.\n# If we want to delete every saved model and log, we can pass the --restart flag\ndytb_train --model LeNet --dataset MNIST --l2_penalty \\\n    --optimizer_args '{\"learning_rate\": 1e-4, \"momentum\": 0.5}' --restart\n```\n\nThe commands above will create 4 different models. Every model has it's own log folder that shares the same root folder.\n\nIn particular, in the `log` folder there'll be a `LeNet` folder and within this folder, there'll be other 4 folders, each one with a name that contains the hyper-parameters previously defined.\nThis allows visualizing in the same graphs, using Tensorboard, the 4 models and easily understand which one performs better.\n\nNo matter what interface has been implemented, the script to run is **always** `train.py`: it's capable of identifying the type of the model and use the right training procedure.\n\nA complete list of the available tunable parameters can be obtained running `dytb_train --help` (`dytb_train --help`).\n\nFor reference, a part of the output of `dytb_train --help`:\n\n```\nusage: train.py [-h] --model --dataset\n  -h, --help            show this help message and exit\n  --model {\u003clist of models in the models/ folder, without the .py suffix\u003e}\n  --dataset {\u003clist of inputs in the inputs/folder, without the .py suffix}\n  --batch_size BATCH_SIZE\n  --restart             restart the training process DELETING the old\n                        checkpoint files\n  --lr_decay            enable the learning rate decay\n  --lr_decay_epochs LR_DECAY_EPOCHS\n                        decay the learning rate every lr_decay_epochs epochs\n  --lr_decay_factor LR_DECAY_FACTOR\n                        decay of lr_decay_factor the initial learning rate\n                        after lr_decay_epochs epochs\n  --l2_penalty L2_PENALTY\n                        L2 penalty term to apply ad the trained parameters\n  --optimizer {\u003clist of tensorflow available optimizers\u003e}\n                        the optimizer to use\n  --optimizer_args OPTIMIZER_ARGS\n                        the optimizer parameters\n  --epochs EPOCHS       number of epochs to train the model\n  --train_device TRAIN_DEVICE\n                        the device on which place the the model during the\n                        trining phase\n  --comment COMMENT     comment string to preprend to the model name\n  --exclude_scopes EXCLUDE_SCOPES\n                        comma separated list of scopes of variables to exclude\n                        from the checkpoint restoring.\n  --checkpoint_path CHECKPOINT_PATH\n                        the path to a checkpoint from which load the model\n```\n\n# Best models \u0026 results\n\nNo matter if the CLI or the library version is used: DyTB saves for you in the log folder of every model the \"best\" model with respect to the default metric used for the trained model.\n\nFor example, for the `LeNet` model created with the first command in the previous script, the following directory structure is created:\n\n```\nlog/LeNet/\n|---MNIST_Momentum\n|-----best\n|-----train\n|-----validation\n```\n\n`train` and `validation` folders contain the logs, used by Tensorboard to display in the same graphs train and validation metrics.\n\nThe `best` folder contains one single checkpoint file that is the model with the highest quality obtained during the training phase.\n\nThis model is used at the end of the training process to evaluate the model performance.\n\nMoreover, is possible to run the evaluation of any checkpoint file (in the `log/\u003cMODEL\u003e` folder or in the `log/\u003cMODEL\u003e/best` folder) using the `dytb_evaluate` script.\n\nFor example:\n\n```\n# Evaluate the validation accuracy\ndytb_evaluate --model LeNet \\\n              --dataset MNIST \\\n              --checkpoint_path log/LeNet/MNIST_Momentum/\n# outputs something like: validation accuracy = 0.993\n\n# Evaluate the test accuracy\ndytb_evaluate --model LeNet \\\n              --dataset MNIST \\\n              --checkpoint_path log/LeNet/MNIST_Momentum/ \\\n              --test\n# outputs something like: test accuracy = 0.993\n```\n\n# Fine Tuning \u0026 network surgery\n\nA trained model can be used to build a new model exploiting the learned parameters: this helps to speed up the learning process of new models.\n\nDyTB allows to restore a model from its checkpoint file, remove some layer that's not necessary for the new model, and add new layers to train.\n\nFor example, a VGG model trained on the Cifar10 dataset, can be used to train a VGG model but on the Cifar100 dataset.\n\nThe examples are for the CLI version, **but the same parameters can be used in the Python library**.\n\n```\ndytb_train\n    --model VGG \\\n    --dataset Cifar100 \\\n    --checkpoint_path log/VGG/Cifar10_Momentum/best/ \\\n    --exclude_scopes softmax_linear\n```\n\nThis training process loads the \"best\" VGG model weights trained on Cifar10 from the `checkpoint_path`, then the weights are used to initialize the VGG model (so the VGG model must be compatible, at least for the non excluded scopes, to the loaded model) except for the layers under the `excluded_scopes` list.\n\nThen the `softmax_linear` layers are replaced with the ones defined in the `VGG` model, that when trained on Cifar100 adapt themself to output 100 classes instead of 10.\n\nSo the above command starts a new training from the pre-trained model and trains the new output layer (with 100 outputs) that the VGG model defines, refining every other weights imported.\n\nIf you don't want to train the imported weights, you have to point out which scopes to train, using `trainable_scopes`:\n\n```\ndytb_train \\\n    --model VGG \\\n    --dataset Cifar100 \\\n    --checkpoint_path log/VGG/Cifar10_Momentum/best/ \\\n    --exclude_scopes softmax_linear \\\n    --trainable_scopes softmax_linear\n```\nWith the above command your instructing DyTB to exclude the `softmax_linear` scope from the checkpoint_file and to train only the scope named `softmax_linear` in the new defined model.\n\n# Data visualization\n\nRunning tensorboard\n\n```\ntensorboard --logdir log/\u003cMODEL\u003e\n```\n\nIt's possible to visualize the trend of the loss, the validation measures, the input values and so on.\nTo see some of the produced output, have a look at the implementation of the Convolutional Autoencoder, described here: https://pgaleone.eu/neural-networks/deep-learning/2016/12/13/convolutional-autoencoders-in-tensorflow/#visualization\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaleone%2Fdynamic-training-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgaleone%2Fdynamic-training-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaleone%2Fdynamic-training-bench/lists"}