{"id":21389484,"url":"https://github.com/feifeibear/dist-tensorflow","last_synced_at":"2025-03-16T12:46:56.799Z","repository":{"id":85016969,"uuid":"114935858","full_name":"feifeibear/dist-tensorflow","owner":"feifeibear","description":"Tensorflow test for supercomputer","archived":false,"fork":false,"pushed_at":"2017-12-20T22:13:20.000Z","size":1419,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-23T00:41:16.707Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feifeibear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-20T22:11:10.000Z","updated_at":"2017-12-20T22:13:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"f977519a-25ae-41c8-ab81-11c5a2f9881d","html_url":"https://github.com/feifeibear/dist-tensorflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2Fdist-tensorflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2Fdist-tensorflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2Fdist-tensorflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2Fdist-tensorflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feifeibear","download_url":"https://codeload.github.com/feifeibear/dist-tensorflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243871652,"owners_count":20361378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T12:26:42.134Z","updated_at":"2025-03-16T12:46:56.766Z","avatar_url":"https://github.com/feifeibear.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Introduction\nThis repo is the TensorFlow code for our oral paper in NIPS 2017 ([TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning](https://arxiv.org/abs/1705.07878)).\n\nFor the code of [NIPS 2016](http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf) and [ICCV 2017](https://arxiv.org/abs/1703.09746) to accelerate the inference of DNNs, please go to [here](https://github.com/wenwei202/caffe).\n\nThis is a modified copy of TensorFlow [inception](https://github.com/tensorflow/models/tree/master/inception) (with original contributions kept). \n\n**In this workspace, `inception` refers to all types of neural networks in a general way.**\n\n**Note that there is name abuse because of history reasons. All \"bingrad/binary gradient/binarizing\" in code comments, help info and filenames essentially refers to \"terngrad/ternary gradient/ternarizing\". Will update them, but the code is correct and is exactly for terngrad**\n\n*More tutorials will be updated. Feel free to open an issue if any question.*\n\n# Dependencies\nTested stable dependencies:\n* python 2.7 (Anaconda)\n* Tensorflow v1.0.0 and v1.3.0\n* cudnn 5.1.5\n* bazel release 0.4.4\n\nPending to test by python 3.6.1 (Anaconda) and Tensorflow 1.1.0 (installed from python wheel). Known issues (mainly because of update to python 3):\n* use `pickle` instead of `cPickle` python package\n* use `range` instead of `xrange`\n* use `dict.items` instead of `dict.iteritems` \n* `TypeError: 'RGB' has type str, but expected one of: bytes`: use `b'RGB'` instead of `'RGB'`\n* ...\n\n\n# Build all\n```\ncd ${TERNGRAD_ROOT}/terngrad\n./build_all.sh\n```\n# Dataset preparation\n## Download and generate mnist TFRecord\n\n```\ncd ${TERNGRAD_ROOT}/slim\n# Generate train-mnist.tfrecord and test-mnist.tfrecord\nexport DATA_PATH=\"${HOME}/dataset/mnist-data/\"\npython download_and_convert_data.py --dataset_name mnist --dataset_dir ${DATA_PATH}\n```\n\n## Download and generate cifar-10 TFRecord\n\n```\ncd ${TERNGRAD_ROOT}/slim\n# Generate train-cifar10.tfrecord and test-cifar10.tfrecord\nexport DATA_PATH=\"${HOME}/dataset/cifar10-data/\" # the directory of database\npython download_and_convert_data.py --dataset_name cifar10 --dataset_dir ${DATA_PATH}\n\n# Instead of putting all training examples in one tfrecord file, we can split them by enabling --shard\n# This is useful for distributed training by date parallelsim, where we should split data across nodes\n# Generate train-xxxxx-of-xxxxx (1000 shards in default) and test-00000-of-00001 tfrecord shards\nexport DATA_PATH=\"${HOME}/dataset/cifar10-shard-data/\" # the directory of database\npython download_and_convert_data.py \\\n--dataset_name cifar10 \\\n--dataset_dir ${DATA_PATH} \\\n--shard True\n```\n\n## Download and generate ImageNet TFRecord\n\nBefore generating, `RAW_PIXEL=True` in `${TERNGRAD_ROOT}/terngrad/inception/data/download_and_preprocess_imagenet.sh` can enable storing raw RGB pixels of images into TFRecord.\n\nStoring raw pixels can save JPG decoding time but burden storage read bandwidth. Set `RAW_PIXEL=True` if high-speed external storage (like SSD) is used but decoder like in CPU cannot feed as fast as training (like in multi-GPUs).\n\nWhen `RAW_PIXEL=True`, setting `RESIZE_DIMEN` to a positive value enables image resizing before writing them to TFRecord files.\n\n```\n# location of where to place the ImageNet data\n# If ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar were downloaded before, \n# copy them in ${DATA_DIR} to save time\nDATA_DIR=/tmp/\n\n# build the preprocessing script.\nbazel build inception/download_and_preprocess_imagenet\n\n# run it\nbazel-bin/inception/download_and_preprocess_imagenet \"${DATA_DIR}\"\n```\n# Examples on multi-gpu mode\n## Training CifarNet by TernGrad with Adam\n```\ncd ${TERNGRAD_ROOT}/terngrad\n./run_multi_gpus_cifar10.sh\n```\n[run_multi_gpus_cifar10.sh](/terngrad/run_multi_gpus_cifar10.sh#L5-L33) is a training script on cifar-10, which\n* creates a subfolder under `$ROOT_WORKSPACE/${DATASET_NAME}_xxx/` to store the training data (`${ROOT_WORKSPACE}/${DATASET_NAME}_training_data/`), evaluating data (`${ROOT_WORKSPACE}/${DATASET_NAME}_eval_data/`) and logs (`${ROOT_WORKSPACE}/${DATASET_NAME}_info/`). The subfolder name or log filename is similar to `cifar10_cifar10_alexnet_24_adam_1_0.0002_2.5_0_0.004_0.9_1_128_2_Tue_Sep_19_15-27-51_EDT_2017`. \n* starts training and \n* starts evaluating.  \nYou can change those environments to play.\n\nUse `--help` to check descriptions for usage of python executables. For example,\n```\nbazel-bin/inception/cifar10_train --help\n```\nSome important configurations related to TernGrad:\n```\n--size_to_binarize SIZE_TO_BINARIZE\n    The min number of parameters in a Variable (weights/biases) to enable ternarizing this Variable. \n    1 means ternarizing all. You can use this to exclude some small Variables\n--num_gpus NUM_GPUS   \n    How many GPUs to use.\n--num_nodes NUM_NODES\n    How many virtual nodes to use. One GPU can have multiple nodes \n    This enables emulating N workers in M GPUs, where N \u003e M\n    This is a good feature to verify TernGrad algorithm when multiple GPUs are unavailable\n--grad_bits [32/1]\n    The number of gradient bits. Either 32 or 1. 32 for floating, and 1 for terngrad \n    (I know ternary is not 1 bit. This is just an argument to use either floating or terngrad. \n    We may consider to extend it to more options ranging from terngrad to floating, \n    like 2 bits, 4 bits, 8 bits, etc)\n--clip_factor CLIP_FACTOR\n    The factor of stddev to clip gradients (0.0 means no clipping). \n    This is the value of c in gradient clipping technique.\n    2.5 works well in general\n--quantize_logits [True/False]\n--noquantize_logits                        \n    If quantize the gradients in the last logits layer. \n    (sometimes, a skew distribution of gradients in last layer may affect the effectiveness of terngrad.\n    asymmetric ternary levels may be more effective)\n```\nMore explanations are also covered in [run_multi_gpus_cifar10.sh](/terngrad/run_multi_gpus_cifar10.sh#L5-L33).\n\nMore training bash scripts are in [terngrad](/terngrad), which have similar arguments. \n\n# Examples on distributed-node mode\n## ssh setup\nWe provide a single script to remotely launch all workers and parameter servers.\nTo authorize access to each other, all machines must share the same ssh key. Please follow this [tutorial](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/) to setup. After generating key, you can simply copy keys (`~/.ssh/id_rsa` and `~/.ssh/id_rsa.pub`) to all machines.\nAt the first run, you may need to answer `yes` to\n```\nThe authenticity of host '10.236.176.29 (10.236.176.29)' can't be established.\nECDSA key fingerprint is SHA256:jkfjkdslajfklsjaflkjs/jowufuf98e8e8eu9.\nAre you sure you want to continue connecting (yes/no)?\n```\n\nOur bash script uses `ssh` to login and start worker/ps. If some variables are configured in `.bashrc` and are necessary for training (e.g., the `PATH` of anaconda), you may need to source `~/.bashrc` in `~/.bash_profile` or `~/.profile` by adding \n```\nif [ -f ~/.bashrc ]; then\n  . ~/.bashrc\nfi\n```\nIn some linux distributions (e.g. ubuntu) you may need to comment\n```\n# If not running interactively, don't do anything\ncase $- in\n    *i*) ;;\n      *) return;;\nesac\n```\n\n## A toy example\n[run_dist_cifar10.sh](/terngrad/run_dist_cifar10.sh) is a toy example by launching one parameter server and two workers in `localhost`.\nBefore start, we must split cifar-10 dataset to two parts:\n`$HOME/dataset/cifar10-data-shard-500-999` and `$HOME/dataset/cifar10-data-shard-0-499`, which each worker paralell fetches and trains its model replica.\n\nThe python executable is `bazel-bin/inception/cifar10_distributed_train`, of which most arguments are similar to `bazel-bin/inception/cifar10_train` for multi-gpu mode but with\n```\n--job_name JOB_NAME   One of \"ps\", \"worker\"\n--task_id TASK_ID     Task ID of the worker/replica running the training.\n--ps_hosts PS_HOSTS   Comma-separated list of hostname:port for the\n                      parameter server jobs. e.g.\n                      'machine1:2222,machine2:1111,machine2:2222'\n--worker_hosts WORKER_HOSTS\n                      Comma-separated list of hostname:port for the worker\n                      jobs. e.g. 'machine1:2222,machine2:1111,machine2:2222'\n```\nFor more details, type `bazel-bin/inception/cifar10_distributed_train --help` or go [here](#backup-inception-in-tensorflow).\n\n## A single script to launch all\nConfig [config_dist.sh](terngrad/config_dist.sh) and run [run_dist.sh](/terngrad/run_dist.sh). \n\nYou only need to configure `config_dist.sh` (including workers, ps, gpu devices and dataset paths), and write a `WORKER_SCRIPT` to specify how to start a worker. [run_single_worker_cifarnet.sh](/terngrad/run_single_worker_cifarnet.sh) and [run_single_worker_alexnet.sh](/terngrad/run_single_worker_alexnet.sh) are two `WORKER_SCRIPT` examples, which basically set hyperparameters and start training.\n\nUsage is explained within `config_dist.sh` script. \n\nBy default, results are saved in `${HOME}/tmp/`.\n\nWe also provide [split_dataset.sh](/terngrad/split_dataset.sh) to *locally* split shards of training dataset. Usage: `./split_dataset.sh \u003cpath-of-dataset-to-be-split\u003e \u003ctotal_workers\u003e \u003cworker_index\u003e`.\nIt will create a subfolder under `\u003cpath-of-dataset-to-be-split\u003e` named as `worker_\u003cworker_index\u003e_of_\u003ctotal_workers\u003e`, and create links to shard files belonging to this worker.\nFor example,\n```\n$ cd ~/dataset/imagenet-data/\n$ ls -l\n  -rw-rw-r-- 1 wew57 wew57 144582762 Mar 24  2017 train-00000-of-01024\n  -rw-rw-r-- 1 wew57 wew57 148475588 Mar 24  2017 train-00001-of-01024\n  -rw-rw-r-- 1 wew57 wew57 150196808 Mar 24  2017 train-00002-of-01024\n  ...\n  -rw-rw-r-- 1 wew57 wew57 144180160 Mar 24  2017 train-01021-of-01024\n  -rw-rw-r-- 1 wew57 wew57 140903282 Mar 24  2017 train-01022-of-01024\n  -rw-rw-r-- 1 wew57 wew57 138485470 Mar 24  2017 train-01023-of-01024\n$ cd /home/wew57/github/users/wenwei202/terngrad/terngrad\n$ ./split_dataset.sh ~/dataset/imagenet-data/ 16 1\n  Splitting to /home/wew57/dataset/imagenet-data//worker_1_of_16 ...\n$ ls -l /home/wew57/dataset/imagenet-data//worker_1_of_16\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00064-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00064-of-01024\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00065-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00065-of-01024\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00066-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00066-of-01024\n  ...\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00125-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00125-of-01024\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00126-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00126-of-01024\n  lrwxrwxrwx 1 wew57 wew57 55 Sep 30 17:30 train-00127-of-01024 -\u003e /home/wew57/dataset/imagenet-data//train-00127-of-01024\n```\n\nYou can stop all tasks by [stop_dist.sh](terngrad/stop_dist.sh)\n\nCurrently, distributed-node mode only supports 32bit gradients. It will take a while to hack the highly-encapsulated `SyncReplicasOptimizer` to integrate TernGrad. Keep updating.\n\n# Python executables\nBash scripts essentially call python executables. We list python commands here for agile development.\nTaking 32bit gradients as examples.\n\nNode that, in TernGrad, parameters are allocated in each GPU to reduce communication because we can communicate quantized gradients instead of floating parameters.\nBy default, the program saves parameters in all GPUs. To evaluate/test, use `--tower \u003cgpu_id\u003e` to specify which GPU's parameters you want to test on. (We will try to remove this feature to save storage, because the parameter sets are identical across all GPUs). Alteratively, you can use `--save_tower 0` in training executables to avoid saving duplicated parameters, in which case, `--tower \u003cgpu_id\u003e` is unnecessary during evaluation/testing.\n\n## Build and run evaluating/training LeNet on mnist\n```\ncd ${TERNGRAD_ROOT}/terngrad\nbazel build inception/mnist_train\nbazel build inception/mnist_eval\n\nbazel-bin/inception/mnist_train \\\n--optimizer momentum \\\n--initial_learning_rate 0.01 \\\n--learning_rate_decay_type polynomial \\\n--max_steps 10000 \\\n--net lenet \\\n--image_size 28 \\\n--num_gpus 2 \\\n--batch_size 64 \\\n--train_dir /tmp/mnist_train \\\n--data_dir ~/dataset/mnist-data/\n\nbazel-bin/inception/mnist_eval \\\n--data_dir ~/dataset/mnist-data/ \\\n--net lenet \\\n--image_size 28 \\\n--batch_size 100 \\\n--checkpoint_dir /tmp/mnist_train  \\\n--restore_avg_var True \\\n--eval_interval_secs 300 \\\n--eval_dir /tmp/mnist_eval \\\n--subset test\n```\n\n## Build and run evaluating/training on cifar-10\n```\ncd ${TERNGRAD_ROOT}/terngrad\nbazel build inception/cifar10_train\nbazel build inception/cifar10_eval\n\nbazel-bin/inception/cifar10_train \\\n--optimizer adam \\\n--initial_learning_rate 0.0002 \\\n--num_epochs_per_decay 256 \\\n--max_steps 200000 \\\n--net cifar10_alexnet \\\n--image_size 24 \\\n--num_gpus 2 \\\n--batch_size 128 \\\n--train_dir /tmp/cifar10_train \\\n--data_dir ~/dataset/cifar10-data/ \n\nbazel-bin/inception/cifar10_eval \\\n--data_dir ~/dataset/cifar10-data/ \\\n--net cifar10_alexnet \\\n--image_size 24 \\\n--batch_size 50 \\\n--checkpoint_dir /tmp/cifar10_train  \\\n--restore_avg_var True \\\n--eval_interval_secs 300 \\\n--eval_dir /tmp/cifar10_eval \\\n--subset test\n\n```\n\n## Build and run ImageNet\n\n```\ncd ${TERNGRAD_ROOT}/terngrad\nbazel build inception/imagenet_train\nbazel build inception/imagenet_eval\n\nbazel-bin/inception/imagenet_train \\\n--optimizer momentum \\\n--net alexnet \\\n--image_size 224 \\\n--num_gpus 2 \\\n--batch_size 256 \\\n--train_dir /tmp/imagenet_train \\\n--data_dir ~/dataset/imagenet-data/\n\n\nbazel-bin/inception/imagenet_eval \\\n--data_dir ~/dataset/imagenet-data/ \\\n--net alexnet \\\n--image_size 224 \\\n--batch_size 50 \\\n--checkpoint_dir /tmp/imagenet_train  \\\n--restore_avg_var True \\\n--eval_dir /tmp/imagenet_eval\n\n```\n\n# Open questions\n  1. How will TernGrad work with asynchronous SGD \n  2. How to reduce variance of TernGrad when the larger variance introduces some accuracy loss\n  3. How will TernGrad work when server-to-worker gradients are ternarized in the same way\n\n# Backup (Inception in TensorFlow)\n\n[ImageNet](http://www.image-net.org/) is a common academic data set in machine\nlearning for training an image recognition system. Code in this directory\ndemonstrates how to use TensorFlow to train and evaluate a type of convolutional\nneural network (CNN) on this academic data set. In particular, we demonstrate\nhow to train the Inception v3 architecture as specified in:\n\n_Rethinking the Inception Architecture for Computer Vision_\n\nChristian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew\nWojna\n\nhttp://arxiv.org/abs/1512.00567\n\nThis network achieves 21.2% top-1 and 5.6% top-5 error for single frame\nevaluation with a computational cost of 5 billion multiply-adds per inference\nand with using less than 25 million parameters. Below is a visualization of the\nmodel architecture.\n\n\u003ccenter\u003e\n![Inception-v3 Architecture](g3doc/inception_v3_architecture.png)\n\u003c/center\u003e\n\n## Description of Code\n\nThe code base provides three core binaries for:\n\n*   Training an Inception v3 network from scratch across multiple GPUs and/or\n    multiple machines using the ImageNet 2012 Challenge training data set.\n*   Evaluating an Inception v3 network using the ImageNet 2012 Challenge\n    validation data set.\n*   Retraining an Inception v3 network on a novel task and back-propagating the\n    errors to fine tune the network weights.\n\nThe training procedure employs synchronous stochastic gradient descent across\nmultiple GPUs. The user may specify the number of GPUs they wish harness. The\nsynchronous training performs *batch-splitting* by dividing a given batch across\nmultiple GPUs.\n\nThe training set up is nearly identical to the section [Training a Model Using\nMultiple GPU Cards]\n(https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)\nwhere we have substituted the CIFAR-10 model architecture with Inception v3. The\nprimary differences with that setup are:\n\n*   Calculate and update the batch-norm statistics during training so that they\n    may be substituted in during evaluation.\n*   Specify the model architecture using a (still experimental) higher level\n    language called TensorFlow-Slim.\n\nFor more details about TensorFlow-Slim, please see the [Slim README]\n(inception/slim/README.md). Please note that this higher-level language is still\n*experimental* and the API may change over time depending on usage and\nsubsequent research.\n\n## Getting Started\n\n**NOTE** Before doing anything, we first need to build TensorFlow from source,\nand installed as a PIP package. Please follow the instructions at [Installing\nFrom Source]\n(https://www.tensorflow.org/get_started/os_setup.html#create-the-pip-package-and-install).\n\nBefore you run the training script for the first time, you will need to download\nand convert the ImageNet data to native TFRecord format. The TFRecord format\nconsists of a set of sharded files where each entry is a serialized `tf.Example`\nproto. Each `tf.Example` proto contains the ImageNet image (JPEG encoded) as\nwell as metadata such as label and bounding box information. See\n[`parse_example_proto`](inception/image_processing.py) for details.\n\nWe provide a single [script](inception/data/download_and_preprocess_imagenet.sh) for\ndownloading and converting ImageNet data to TFRecord format. Downloading and\npreprocessing the data may take several hours (up to half a day) depending on\nyour network and computer speed. Please be patient.\n\nTo begin, you will need to sign up for an account with [ImageNet]\n(http://image-net.org) to gain access to the data. Look for the sign up page,\ncreate an account and request an access key to download the data.\n\nAfter you have `USERNAME` and `PASSWORD`, you are ready to run our script. Make\nsure that your hard disk has at least 500 GB of free space for downloading and\nstoring the data. Here we select `DATA_DIR=$HOME/dataset/imagenet-data` as such a\nlocation but feel free to edit accordingly.\n\nWhen you run the below script, please enter *USERNAME* and *PASSWORD* when\nprompted. This will occur at the very beginning. Once these values are entered,\nyou will not need to interact with the script again.\n\n```shell\n# location of where to place the ImageNet data\nDATA_DIR=$HOME/dataset/imagenet-data\n\n# build the preprocessing script.\nbazel build inception/download_and_preprocess_imagenet\n\n# run it\nbazel-bin/inception/download_and_preprocess_imagenet \"${DATA_DIR}\"\n```\n\nThe final line of the output script should read:\n\n```shell\n2016-02-17 14:30:17.287989: Finished writing all 1281167 images in data set.\n```\n\nWhen the script finishes you will find 1024 and 128 training and validation\nfiles in the `DATA_DIR`. The files will match the patterns `train-????-of-1024`\nand `validation-?????-of-00128`, respectively.\n\n[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now\nready to train or evaluate with the ImageNet data set.\n\n## How to Train from Scratch\n\n**WARNING** Training an Inception v3 network from scratch is a computationally\nintensive task and depending on your compute setup may take several days or even\nweeks.\n\n*Before proceeding* please read the [Convolutional Neural Networks]\n(https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial in\nparticular focus on [Training a Model Using Multiple GPU Cards]\n(https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)\n. The model training method is nearly identical to that described in the\nCIFAR-10 multi-GPU model training. Briefly, the model training\n\n*   Places an individual model replica on each GPU. Split the batch across the\n    GPUs.\n*   Updates model parameters synchronously by waiting for all GPUs to finish\n    processing a batch of data.\n\nThe training procedure is encapsulated by this diagram of how operations and\nvariables are placed on CPU and GPUs respectively.\n\n\u003cdiv style=\"width:40%; margin:auto; margin-bottom:10px; margin-top:20px;\"\u003e\n  \u003cimg style=\"width:100%\" src=\"https://www.tensorflow.org/images/Parallelism.png\"\u003e\n\u003c/div\u003e\n\nEach tower computes the gradients for a portion of the batch and the gradients\nare combined and averaged across the multiple towers in order to provide a\nsingle update of the Variables stored on the CPU.\n\nA crucial aspect of training a network of this size is *training speed* in terms\nof wall-clock time. The training speed is dictated by many factors -- most\nimportantly the batch size and the learning rate schedule. Both of these\nparameters are heavily coupled to the hardware set up.\n\nGenerally speaking, a batch size is a difficult parameter to tune as it requires\nbalancing memory demands of the model, memory available on the GPU and speed of\ncomputation. Generally speaking, employing larger batch sizes leads to more\nefficient computation and potentially more efficient training steps.\n\nWe have tested several hardware setups for training this model from scratch but\nwe emphasize that depending your hardware set up, you may need to adapt the\nbatch size and learning rate schedule.\n\nPlease see the comments in `inception_train.py` for a few selected learning rate\nplans based on some selected hardware setups.\n\nTo train this model, you simply need to specify the following:\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/imagenet_train\n\n# run it\nbazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data\n```\n\nThe model reads in the ImageNet training data from `--data_dir`. If you followed\nthe instructions in [Getting Started](#getting-started), then set\n`--data_dir=\"${DATA_DIR}\"`. The script assumes that there exists a set of\nsharded TFRecord files containing the ImageNet data. If you have not created\nTFRecord files, please refer to [Getting Started](#getting-started)\n\nHere is the output of the above command line when running on a Tesla K40c:\n\n```shell\n2016-03-07 12:24:59.922898: step 0, loss = 13.11 (5.3 examples/sec; 6.064 sec/batch)\n2016-03-07 12:25:55.206783: step 10, loss = 13.71 (9.4 examples/sec; 3.394 sec/batch)\n2016-03-07 12:26:28.905231: step 20, loss = 14.81 (9.5 examples/sec; 3.380 sec/batch)\n2016-03-07 12:27:02.699719: step 30, loss = 14.45 (9.5 examples/sec; 3.378 sec/batch)\n2016-03-07 12:27:36.515699: step 40, loss = 13.98 (9.5 examples/sec; 3.376 sec/batch)\n2016-03-07 12:28:10.220956: step 50, loss = 13.92 (9.6 examples/sec; 3.327 sec/batch)\n2016-03-07 12:28:43.658223: step 60, loss = 13.28 (9.6 examples/sec; 3.350 sec/batch)\n...\n```\n\nIn this example, a log entry is printed every 10 step and the line includes the\ntotal loss (starts around 13.0-14.0) and the speed of processing in terms of\nthroughput (examples / sec) and batch speed (sec/batch).\n\nThe number of GPU devices is specified by `--num_gpus` (which defaults to 1).\nSpecifying `--num_gpus` greater then 1 splits the batch evenly split across the\nGPU cards.\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/imagenet_train\n\n# run it\nbazel-bin/inception/imagenet_train --num_gpus=2 --batch_size=64 --train_dir=/tmp/imagenet_train\n```\n\nThis model splits the batch of 64 images across 2 GPUs and calculates the\naverage gradient by waiting for both GPUs to finish calculating the gradients\nfrom their respective data (See diagram above). Generally speaking, using larger\nnumbers of GPUs leads to higher throughput as well as the opportunity to use\nlarger batch sizes. In turn, larger batch sizes imply better estimates of the\ngradient enabling the usage of higher learning rates. In summary, using more\nGPUs results in simply faster training speed.\n\nNote that selecting a batch size is a difficult parameter to tune as it requires\nbalancing memory demands of the model, memory available on the GPU and speed of\ncomputation. Generally speaking, employing larger batch sizes leads to more\nefficient computation and potentially more efficient training steps.\n\nNote that there is considerable noise in the loss function on individual steps\nin the previous log. Because of this noise, it is difficult to discern how well\na model is learning. The solution to the last problem is to launch TensorBoard\npointing to the directory containing the events log.\n\n```shell\ntensorboard --logdir=/tmp/imagenet_train\n```\n\nTensorBoard has access to the many Summaries produced by the model that describe\nmultitudes of statistics tracking the model behavior and the quality of the\nlearned model. In particular, TensorBoard tracks a exponentially smoothed\nversion of the loss. In practice, it is far easier to judge how well a model\nlearns by monitoring the smoothed version of the loss.\n\n## How to Train from Scratch in a Distributed Setting\n\n**NOTE** Distributed TensorFlow requires version 0.8 or later.\n\nDistributed TensorFlow lets us use multiple machines to train a model faster.\nThis is quite different from the training with multiple GPU towers on a single\nmachine where all parameters and gradients computation are in the same place. We\ncoordinate the computation across multiple machines by employing a centralized\nrepository for parameters that maintains a unified, single copy of model\nparameters. Each individual machine sends gradient updates to the centralized\nparameter repository which coordinates these updates and sends back updated\nparameters to the individual machines running the model training.\n\nWe term each machine that runs a copy of the training a `worker` or `replica`.\nWe term each machine that maintains model parameters a `ps`, short for\n`parameter server`. Note that we might have more than one machine acting as a\n`ps` as the model parameters may be sharded across multiple machines.\n\nVariables may be updated with synchronous or asynchronous gradient updates. One\nmay construct a an [`Optimizer`]\n(https://www.tensorflow.org/api_docs/python/train.html#optimizers) in TensorFlow\nthat constructs the necessary graph for either case diagrammed below from\nTensorFlow [Whitepaper]\n(http://download.tensorflow.org/paper/whitepaper2015.pdf):\n\n\u003cdiv style=\"width:40%; margin:auto; margin-bottom:10px; margin-top:20px;\"\u003e\n  \u003cimg style=\"width:100%\"\n  src=\"https://www.tensorflow.org/images/tensorflow_figure7.png\"\u003e\n\u003c/div\u003e\n\nIn [a recent paper](https://arxiv.org/abs/1604.00981), synchronous gradient\nupdates have demonstrated to reach higher accuracy in a shorter amount of time.\nIn this distributed Inception example we employ synchronous gradient updates.\n\nNote that in this example each replica has a single tower that uses one GPU.\n\nThe command-line flags `worker_hosts` and `ps_hosts` specify available servers.\nThe same binary will be used for both the `worker` jobs and the `ps` jobs.\nCommand line flag `job_name` will be used to specify what role a task will be\nplaying and `task_id` will be used to idenify which one of the jobs it is\nrunning. Several things to note here:\n\n*   The numbers of `ps` and `worker` tasks are inferred from the lists of hosts\n    specified in the flags. The `task_id` should be within the range `[0,\n    num_ps_tasks)` for `ps` tasks and `[0, num_worker_tasks)` for `worker`\n    tasks.\n*   `ps` and `worker` tasks can run on the same machine, as long as that machine\n    has sufficient resources to handle both tasks. Note that the `ps` task does\n    not benefit from a GPU, so it should not attempt to use one (see below).\n*   Multiple `worker` tasks can run on the same machine with multiple GPUs so\n    machine_A with 2 GPUs may have 2 workers while machine_B with 1 GPU just has\n    1 worker.\n*   The default learning rate schedule works well for a wide range of number of\n    replicas [25, 50, 100] but feel free to tune it for even better results.\n*   The command line of both `ps` and `worker` tasks should include the complete\n    list of `ps_hosts` and `worker_hosts`.\n*   There is a chief `worker` among all workers which defaults to `worker` 0.\n    The chief will be in charge of initializing all the parameters, writing out\n    the summaries and the checkpoint. The checkpoint and summary will be in the\n    `train_dir` of the host for `worker` 0.\n*   Each worker processes a batch_size number of examples but each gradient\n    update is computed from all replicas. Hence, the effective batch size of\n    this model is batch_size * num_workers.\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/imagenet_distributed_train\n\n# To start worker 0, go to the worker0 host and run the following (Note that\n# task_id should be in the range [0, num_worker_tasks):\nbazel-bin/inception/imagenet_distributed_train \\\n--batch_size=32 \\\n--data_dir=$HOME/imagenet-data \\\n--job_name='worker' \\\n--task_id=0 \\\n--ps_hosts='ps0.example.com:2222' \\\n--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'\n\n# To start worker 1, go to the worker1 host and run the following (Note that\n# task_id should be in the range [0, num_worker_tasks):\nbazel-bin/inception/imagenet_distributed_train \\\n--batch_size=32 \\\n--data_dir=$HOME/imagenet-data \\\n--job_name='worker' \\\n--task_id=1 \\\n--ps_hosts='ps0.example.com:2222' \\\n--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'\n\n# To start the parameter server (ps), go to the ps host and run the following (Note\n# that task_id should be in the range [0, num_ps_tasks):\nbazel-bin/inception/imagenet_distributed_train \\\n--job_name='ps' \\\n--task_id=0 \\\n--ps_hosts='ps0.example.com:2222' \\\n--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'\n```\n\nIf you have installed a GPU-compatible version of TensorFlow, the `ps` will also\ntry to allocate GPU memory although it is not helpful. This could potentially\ncrash the worker on the same machine as it has little to no GPU memory to\nallocate. To avoid this, you can prepend the previous command to start `ps`\nwith: `CUDA_VISIBLE_DEVICES=''`\n\n```shell\nCUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \\\n--job_name='ps' \\\n--task_id=0 \\\n--ps_hosts='ps0.example.com:2222' \\\n--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'\n```\n\nIf you have run everything correctly, you should see a log in each `worker` job\nthat looks like the following. Note the training speed varies depending on your\nhardware and the first several steps could take much longer.\n\n```shell\nINFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']\nINFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']\nI tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -\u003e {ps0.example.com:2222, ps1.example.com:2222}\nI tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -\u003e {localhost:2222, worker1.example.com:2222}\nI tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222\nINFO:tensorflow:Created variable global_step:0 with shape () and init \u003cfunction zeros_initializer at 0x7f6aa014b140\u003e\n\n...\n\nINFO:tensorflow:Created variable logits/logits/biases:0 with shape (1001,) and init \u003cfunction _initializer at 0x7f6a77f3cf50\u003e\nINFO:tensorflow:SyncReplicas enabled: replicas_to_aggregate=2; total_num_replicas=2\nINFO:tensorflow:2016-04-13 01:56:26.405639 Supervisor\nINFO:tensorflow:Started 2 queues for processing input data.\nINFO:tensorflow:global_step/sec: 0\nINFO:tensorflow:Worker 0: 2016-04-13 01:58:40.342404: step 0, loss = 12.97(0.0 examples/sec; 65.428  sec/batch)\nINFO:tensorflow:global_step/sec: 0.0172907\n...\n```\n\nand a log in each `ps` job that looks like the following:\n\n```shell\nINFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']\nINFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']\nI tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -\u003e {localhost:2222, ps1.example.com:2222}\nI tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -\u003e {worker0.example.com:2222, worker1.example.com:2222}\nI tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222\n```\n\n[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now\ntraining Inception in a distributed manner.\n\n## How to Evaluate\n\nEvaluating an Inception v3 model on the ImageNet 2012 validation data set\nrequires running a separate binary.\n\nThe evaluation procedure is nearly identical to [Evaluating a Model]\n(https://www.tensorflow.org/tutorials/deep_cnn/index.html#evaluating-a-model)\ndescribed in the [Convolutional Neural Network]\n(https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial.\n\n**WARNING** Be careful not to run the evaluation and training binary on the same\nGPU or else you might run out of memory. Consider running the evaluation on a\nseparate GPU if available or suspending the training binary while running the\nevaluation on the same GPU.\n\nBriefly, one can evaluate the model by running:\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/imagenet_eval\n\n# run it\nbazel-bin/inception/imagenet_eval --checkpoint_dir=/tmp/imagenet_train --eval_dir=/tmp/imagenet_eval\n```\n\nNote that we point `--checkpoint_dir` to the location of the checkpoints saved\nby `inception_train.py` above. Running the above command results in the\nfollowing output:\n\n```shell\n2016-02-17 22:32:50.391206: precision @ 1 = 0.735\n...\n```\n\nThe script calculates the precision @ 1 over the entire validation data\nperiodically. The precision @ 1 measures the how often the highest scoring\nprediction from the model matched the ImageNet label -- in this case, 73.5%. If\nyou wish to run the eval just once and not periodically, append the `--run_once`\noption.\n\nMuch like the training script, `imagenet_eval.py` also exports summaries that\nmay be visualized in TensorBoard. These summaries calculate additional\nstatistics on the predictions (e.g. recall @ 5) as well as monitor the\nstatistics of the model activations and weights during evaluation.\n\n## How to Fine-Tune a Pre-Trained Model on a New Task\n\n### Getting Started\n\nMuch like training the ImageNet model we must first convert a new data set to\nthe sharded TFRecord format which each entry is a serialized `tf.Example` proto.\n\nWe have provided a script demonstrating how to do this for small data set of of\na few thousand flower images spread across 5 labels:\n\n```shell\ndaisy, dandelion, roses, sunflowers, tulips\n```\n\nThere is a single automated script that downloads the data set and converts it\nto the TFRecord format. Much like the ImageNet data set, each record in the\nTFRecord format is a serialized `tf.Example` proto whose entries include a\nJPEG-encoded string and an integer label. Please see [`parse_example_proto`]\n(inception/image_processing.py) for details.\n\nThe script just takes a few minutes to run depending your network connection\nspeed for downloading and processing the images. Your hard disk requires 200MB\nof free storage. Here we select `DATA_DIR=$HOME/flowers-data` as such a location\nbut feel free to edit accordingly.\n\n```shell\n# location of where to place the flowers data\nFLOWERS_DATA_DIR=$HOME/flowers-data\n\n# build the preprocessing script.\nbazel build inception/download_and_preprocess_flowers\n\n# run it\nbazel-bin/inception/download_and_preprocess_flowers \"${FLOWERS_DATA_DIR}\"\n```\n\nIf the script runs successfully, the final line of the terminal output should\nlook like:\n\n```shell\n2016-02-24 20:42:25.067551: Finished writing all 3170 images in data set.\n```\n\nWhen the script finishes you will find 2 shards for the training and validation\nfiles in the `DATA_DIR`. The files will match the patterns `train-????-of-00001`\nand `validation-?????-of-00001`, respectively.\n\n**NOTE** If you wish to prepare a custom image data set for transfer learning,\nyou will need to invoke [`build_image_data.py`](inception/data/build_image_data.py) on\nyour custom data set. Please see the associated options and assumptions behind\nthis script by reading the comments section of [`build_image_data.py`]\n(inception/data/build_image_data.py). Also, if your custom data has a different \nnumber of examples or classes, you need to change the appropriate values in\n[`imagenet_data.py`](inception/imagenet_data.py).\n\nThe second piece you will need is a trained Inception v3 image model. You have\nthe option of either training one yourself (See [How to Train from Scratch]\n(#how-to-train-from-scratch) for details) or you can download a pre-trained\nmodel like so:\n\n```shell\n# location of where to place the Inception v3 model\nDATA_DIR=$HOME/inception-v3-model\ncd ${DATA_DIR}\n\n# download the Inception v3 model\ncurl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz\ntar xzf inception-v3-2016-03-01.tar.gz\n\n# this will create a directory called inception-v3 which contains the following files.\n\u003e ls inception-v3\nREADME.txt\ncheckpoint\nmodel.ckpt-157585\n```\n\n[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now\nready to fine-tune your pre-trained Inception v3 model with the flower data set.\n\n### How to Retrain a Trained Model on the Flowers Data\n\nWe are now ready to fine-tune a pre-trained Inception-v3 model on the flowers\ndata set. This requires two distinct changes to our training procedure:\n\n1.  Build the exact same model as previously except we change the number of\n    labels in the final classification layer.\n\n2.  Restore all weights from the pre-trained Inception-v3 except for the final\n    classification layer; this will get randomly initialized instead.\n\nWe can perform these two operations by specifying two flags:\n`--pretrained_model_checkpoint_path` and `--fine_tune`. The first flag is a\nstring that points to the path of a pre-trained Inception-v3 model. If this flag\nis specified, it will load the entire model from the checkpoint before the\nscript begins training.\n\nThe second flag `--fine_tune` is a boolean that indicates whether the last\nclassification layer should be randomly initialized or restored. You may set\nthis flag to false if you wish to continue training a pre-trained model from a\ncheckpoint. If you set this flag to true, you can train a new classification\nlayer from scratch.\n\nIn order to understand how `--fine_tune` works, please see the discussion on\n`Variables` in the TensorFlow-Slim [`README.md`](inception/slim/README.md).\n\nPutting this all together you can retrain a pre-trained Inception-v3 model on\nthe flowers data set with the following command.\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/flowers_train\n\n# Path to the downloaded Inception-v3 model.\nMODEL_PATH=\"${INCEPTION_MODEL_DIR}/model.ckpt-157585\"\n\n# Directory where the flowers data resides.\nFLOWERS_DATA_DIR=/tmp/flowers-data/\n\n# Directory where to save the checkpoint and events files.\nTRAIN_DIR=/tmp/flowers_train/\n\n# Run the fine-tuning on the flowers data set starting from the pre-trained\n# Imagenet-v3 model.\nbazel-bin/inception/flowers_train \\\n  --train_dir=\"${TRAIN_DIR}\" \\\n  --data_dir=\"${FLOWERS_DATA_DIR}\" \\\n  --pretrained_model_checkpoint_path=\"${MODEL_PATH}\" \\\n  --fine_tune=True \\\n  --initial_learning_rate=0.001 \\\n  --input_queue_memory_factor=1\n```\n\nWe have added a few extra options to the training procedure.\n\n*   Fine-tuning a model a separate data set requires significantly lowering the\n    initial learning rate. We set the initial learning rate to 0.001.\n*   The flowers data set is quite small so we shrink the size of the shuffling\n    queue of examples. See [Adjusting Memory Demands](#adjusting-memory-demands)\n    for more details.\n\nThe training script will only reports the loss. To evaluate the quality of the\nfine-tuned model, you will need to run `flowers_eval`:\n\n```shell\n# Build the model. Note that we need to make sure the TensorFlow is ready to\n# use before this as this command will not build TensorFlow.\nbazel build inception/flowers_eval\n\n# Directory where we saved the fine-tuned checkpoint and events files.\nTRAIN_DIR=/tmp/flowers_train/\n\n# Directory where the flowers data resides.\nFLOWERS_DATA_DIR=/tmp/flowers-data/\n\n# Directory where to save the evaluation events files.\nEVAL_DIR=/tmp/flowers_eval/\n\n# Evaluate the fine-tuned model on a hold-out of the flower data set.\nbazel-bin/inception/flowers_eval \\\n  --eval_dir=\"${EVAL_DIR}\" \\\n  --data_dir=\"${FLOWERS_DATA_DIR}\" \\\n  --subset=validation \\\n  --num_examples=500 \\\n  --checkpoint_dir=\"${TRAIN_DIR}\" \\\n  --input_queue_memory_factor=1 \\\n  --run_once\n```\n\nWe find that the evaluation arrives at roughly 93.4% precision@1 after the model\nhas been running for 2000 steps.\n\n```shell\nSuccesfully loaded model from /tmp/flowers/model.ckpt-1999 at step=1999.\n2016-03-01 16:52:51.761219: starting evaluation on (validation).\n2016-03-01 16:53:05.450419: [20 batches out of 20] (36.5 examples/sec; 0.684sec/batch)\n2016-03-01 16:53:05.450471: precision @ 1 = 0.9340 recall @ 5 = 0.9960 [500 examples]\n```\n\n## How to Construct a New Dataset for Retraining\n\nOne can use the existing scripts supplied with this model to build a new dataset\nfor training or fine-tuning. The main script to employ is\n[`build_image_data.py`](inception/data/build_image_data.py). Briefly, this script takes a\nstructured directory of images and converts it to a sharded `TFRecord` that can\nbe read by the Inception model.\n\nIn particular, you will need to create a directory of training images that\nreside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such:\n\n```shell\n  $TRAIN_DIR/dog/image0.jpeg\n  $TRAIN_DIR/dog/image1.jpg\n  $TRAIN_DIR/dog/image2.png\n  ...\n  $TRAIN_DIR/cat/weird-image.jpeg\n  $TRAIN_DIR/cat/my-image.jpeg\n  $TRAIN_DIR/cat/my-image.JPG\n  ...\n  $VALIDATION_DIR/dog/imageA.jpeg\n  $VALIDATION_DIR/dog/imageB.jpg\n  $VALIDATION_DIR/dog/imageC.png\n  ...\n  $VALIDATION_DIR/cat/weird-image.PNG\n  $VALIDATION_DIR/cat/that-image.jpg\n  $VALIDATION_DIR/cat/cat.JPG\n  ...\n```\n\nEach sub-directory in `$TRAIN_DIR` and `$VALIDATION_DIR` corresponds to a unique\nlabel for the images that reside within that sub-directory. The images may be\nJPEG or PNG images. We do not support other images types currently.\n\nOnce the data is arranged in this directory structure, we can run\n`build_image_data.py` on the data to generate the sharded `TFRecord` dataset.\nEach entry of the `TFRecord` is a serialized `tf.Example` protocol buffer. A\ncomplete list of information contained in the `tf.Example` is described in the\ncomments of `build_image_data.py`.\n\nTo run `build_image_data.py`, you can run the following command line:\n\n```shell\n# location to where to save the TFRecord data.\nOUTPUT_DIRECTORY=$HOME/my-custom-data/\n\n# build the preprocessing script.\nbazel build inception/build_image_data\n\n# convert the data.\nbazel-bin/inception/build_image_data \\\n  --train_directory=\"${TRAIN_DIR}\" \\\n  --validation_directory=\"${VALIDATION_DIR}\" \\\n  --output_directory=\"${OUTPUT_DIRECTORY}\" \\\n  --labels_file=\"${LABELS_FILE}\" \\\n  --train_shards=128 \\\n  --validation_shards=24 \\\n  --num_threads=8\n```\n\nwhere the `$OUTPUT_DIRECTORY` is the location of the sharded `TFRecords`. The\n`$LABELS_FILE` will be a text file that is read by the script that provides\na list of all of the labels. For instance, in the case flowers data set, the\n`$LABELS_FILE` contained the following data:\n\n```shell\ndaisy\ndandelion\nroses\nsunflowers\ntulips\n```\n\nNote that each row of each label corresponds with the entry in the final\nclassifier in the model. That is, the `daisy` corresponds to the classifier for\nentry `1`; `dandelion` is entry `2`, etc. We skip label `0` as a background\nclass.\n\nAfter running this script produces files that look like the following:\n\n```shell\n  $TRAIN_DIR/train-00000-of-00024\n  $TRAIN_DIR/train-00001-of-00024\n  ...\n  $TRAIN_DIR/train-00023-of-00024\n\nand\n\n  $VALIDATION_DIR/validation-00000-of-00008\n  $VALIDATION_DIR/validation-00001-of-00008\n  ...\n  $VALIDATION_DIR/validation-00007-of-00008\n```\n\nwhere 24 and 8 are the number of shards specified for each dataset,\nrespectively. Generally speaking, we aim for selecting the number of shards such\nthat roughly 1024 images reside in each shard. Once this data set is built, you\nare ready to train or fine-tune an Inception model on this data set.\n\nNote, if you are piggy backing on the flowers retraining scripts, be sure to \nupdate `num_classes()` and `num_examples_per_epoch()` in `flowers_data.py` \nto correspond with your data.\n\n## Practical Considerations for Training a Model\n\nThe model architecture and training procedure is heavily dependent on the\nhardware used to train the model. If you wish to train or fine-tune this model\non your machine **you will need to adjust and empirically determine a good set\nof training hyper-parameters for your setup**. What follows are some general\nconsiderations for novices.\n\n### Finding Good Hyperparameters\n\nRoughly 5-10 hyper-parameters govern the speed at which a network is trained. In\naddition to `--batch_size` and `--num_gpus`, there are several constants defined\nin [inception_train.py](inception/inception_train.py) which dictate the learning\nschedule.\n\n```shell\nRMSPROP_DECAY = 0.9                # Decay term for RMSProp.\nMOMENTUM = 0.9                     # Momentum in RMSProp.\nRMSPROP_EPSILON = 1.0              # Epsilon term for RMSProp.\nINITIAL_LEARNING_RATE = 0.1        # Initial learning rate.\nNUM_EPOCHS_PER_DECAY = 30.0        # Epochs after which learning rate decays.\nLEARNING_RATE_DECAY_FACTOR = 0.16  # Learning rate decay factor.\n```\n\nThere are many papers that discuss the various tricks and trade-offs associated\nwith training a model with stochastic gradient descent. For those new to the\nfield, some great references are:\n\n*   Y Bengio, [Practical recommendations for gradient-based training of deep\n    architectures](http://arxiv.org/abs/1206.5533)\n*   I Goodfellow, Y Bengio and A Courville, [Deep Learning]\n    (http://www.deeplearningbook.org/)\n\nWhat follows is a summary of some general advice for identifying appropriate\nmodel hyper-parameters in the context of this particular model training setup.\nNamely, this library provides *synchronous* updates to model parameters based on\nbatch-splitting the model across multiple GPUs.\n\n*   Higher learning rates leads to faster training. Too high of learning rate\n    leads to instability and will cause model parameters to diverge to infinity\n    or NaN.\n\n*   Larger batch sizes lead to higher quality estimates of the gradient and\n    permit training the model with higher learning rates.\n\n*   Often the GPU memory is a bottleneck that prevents employing larger batch\n    sizes. Employing more GPUs allows one to user larger batch sizes because\n    this model splits the batch across the GPUs.\n\n**NOTE** If one wishes to train this model with *asynchronous* gradient updates,\none will need to substantially alter this model and new considerations need to\nbe factored into hyperparameter tuning. See [Large Scale Distributed Deep\nNetworks](http://research.google.com/archive/large_deep_networks_nips2012.html)\nfor a discussion in this domain.\n\n### Adjusting Memory Demands\n\nTraining this model has large memory demands in terms of the CPU and GPU. Let's\ndiscuss each item in turn.\n\nGPU memory is relatively small compared to CPU memory. Two items dictate the\namount of GPU memory employed -- model architecture and batch size. Assuming\nthat you keep the model architecture fixed, the sole parameter governing the GPU\ndemand is the batch size. A good rule of thumb is to try employ as large of\nbatch size as will fit on the GPU.\n\nIf you run out of GPU memory, either lower the `--batch_size` or employ more\nGPUs on your desktop. The model performs batch-splitting across GPUs, thus N\nGPUs can handle N times the batch size of 1 GPU.\n\nThe model requires a large amount of CPU memory as well. We have tuned the model\nto employ about ~20GB of CPU memory. Thus, having access to about 40 GB of CPU\nmemory would be ideal.\n\nIf that is not possible, you can tune down the memory demands of the model via\nlowering `--input_queue_memory_factor`. Images are preprocessed asynchronously\nwith respect to the main training across `--num_preprocess_threads` threads. The\npreprocessed images are stored in shuffling queue in which each GPU performs a\ndequeue operation in order to receive a `batch_size` worth of images.\n\nIn order to guarantee good shuffling across the data, we maintain a large\nshuffling queue of 1024 x `input_queue_memory_factor` images. For the current\nmodel architecture, this corresponds to about 4GB of CPU memory. You may lower\n`input_queue_memory_factor` in order to decrease the memory footprint. Keep in\nmind though that lowering this value drastically may result in a model with\nslightly lower predictive accuracy when training from scratch. Please see\ncomments in [`image_processing.py`](inception/image_processing.py) for more details.\n\n## Troubleshooting\n\n#### The model runs out of CPU memory.\n\nIn lieu of buying more CPU memory, an easy fix is to decrease\n`--input_queue_memory_factor`. See [Adjusting Memory Demands]\n(#adjusting-memory-demands).\n\n#### The model runs out of GPU memory.\n\nThe data is not able to fit on the GPU card. The simplest solution is to\ndecrease the batch size of the model. Otherwise, you will need to think about a\nmore sophisticated method for specifying the training which cuts up the model\nacross multiple `session.run()` calls or partitions the model across multiple\nGPUs. See [Using GPUs](https://www.tensorflow.org/how_tos/using_gpu/index.html)\nand [Adjusting Memory Demands](#adjusting-memory-demands) for more information.\n\n#### The model training results in NaN's.\n\nThe learning rate of the model is too high. Turn down your learning rate.\n\n#### I wish to train a model with a different image size.\n\nThe simplest solution is to artificially resize your images to `299x299` pixels.\nSee [Images](https://www.tensorflow.org/api_docs/python/image.html) section for\nmany resizing, cropping and padding methods. Note that the entire model\narchitecture is predicated on a `299x299` image, thus if you wish to change the\ninput image size, then you may need to redesign the entire model architecture.\n\n#### What hardware specification are these hyper-parameters targeted for?\n\nWe targeted a desktop with 128GB of CPU ram connected to 8 NVIDIA Tesla K40 GPU\ncards but we have run this on desktops with 32GB of CPU ram and 1 NVIDIA Tesla\nK40. You can get a sense of the various training configurations we tested by\nreading the comments in [`inception_train.py`](inception/inception_train.py).\n\n#### How do I continue training from a checkpoint in distributed setting?\n\nYou only need to make sure that the checkpoint is in a location that can be\nreached by all of the `ps` tasks. By specifying the checkpoint location with\n`--train_dir` , the `ps` servers will load the checkpoint before commencing\ntraining.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fdist-tensorflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeifeibear%2Fdist-tensorflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fdist-tensorflow/lists"}