{"id":15035076,"url":"https://github.com/vahidk/effectivepytorch","last_synced_at":"2025-05-14T14:07:48.939Z","repository":{"id":41177570,"uuid":"252383936","full_name":"vahidk/EffectivePyTorch","owner":"vahidk","description":"PyTorch tutorials and best practices.","archived":false,"fork":false,"pushed_at":"2025-03-24T04:30:13.000Z","size":51,"stargazers_count":1680,"open_issues_count":0,"forks_count":170,"subscribers_count":50,"default_branch":"master","last_synced_at":"2025-04-08T13:00:08.261Z","etag":null,"topics":["deep-learning","ebook","machine-learning","neural-network","pytorch"],"latest_commit_sha":null,"homepage":"https://twitter.com/VahidK","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vahidk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-04-02T07:21:57.000Z","updated_at":"2025-04-01T09:37:16.000Z","dependencies_parsed_at":"2025-05-14T14:04:05.842Z","dependency_job_id":null,"html_url":"https://github.com/vahidk/EffectivePyTorch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2FEffectivePyTorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2FEffectivePyTorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2FEffectivePyTorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2FEffectivePyTorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vahidk","download_url":"https://codeload.github.com/vahidk/EffectivePyTorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254160026,"owners_count":22024567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","ebook","machine-learning","neural-network","pytorch"],"created_at":"2024-09-24T20:27:26.425Z","updated_at":"2025-05-14T14:07:48.918Z","avatar_url":"https://github.com/vahidk.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Effective PyTorch\n\nTable of Contents\n=================\n## Part I: PyTorch Fundamentals\n1.  [PyTorch basics](#basics)\n2.  [Encapsulate your model with Modules](#modules)\n3.  [Broadcasting the good and the ugly](#broadcast)\n4.  [Take advantage of the overloaded operators](#overloaded_ops)\n5.  [Optimizing runtime with TorchScript](#torchscript)\n6.  [Building efficient custom data loaders](#dataloader)\n7.  [Numerical stability in PyTorch](#stable)\n8.  [Faster training with automatic mixed precision](#amp)\n---\n\n_To install PyTorch follow the [instructions on the official website](https://pytorch.org/):_\n```\npip install torch torchvision\n```\n\n_We aim to gradually expand this series by adding new articles and keep the content up to date with the latest releases of PyTorch API. If you have suggestions on how to improve this series or find the explanations ambiguous, feel free to create an issue, send patches, or reach out by email._\n\n# Part I: PyTorch Fundamentals\n\u003ca name=\"fundamentals\"\u003e\u003c/a\u003e\n\n## PyTorch basics\n\u003ca name=\"basics\"\u003e\u003c/a\u003e\nPyTorch is one of the most popular libraries for numerical computation and currently is amongst the most widely used libraries for performing machine learning research. In many ways PyTorch is similar to NumPy, with the additional benefit that PyTorch allows you to perform your computations on CPUs, GPUs, and TPUs without any material change to your code. PyTorch also makes it easy to distribute your computation across multiple devices or machines. One of the most important features of PyTorch is automatic differentiation. It allows computing the gradients of your functions analytically in an efficient manner which is crucial for training machine learning models using gradient descent method. Our goal here is to provide a gentle introduction to PyTorch and discuss best practices for using PyTorch.\n\nThe first thing to learn about PyTorch is the concept of Tensors. Tensors are simply multidimensional arrays. A PyTorch Tensor is very similar to a NumPy array with some ~~magical~~ additional functionality.\n\nA tensor can store a scalar value:\n```python\nimport torch\na = torch.tensor(3)\nprint(a)  # tensor(3)\n```\n\nor an array:\n```python\nb = torch.tensor([1, 2])\nprint(b)  # tensor([1, 2])\n```\n\na matrix:\n```python\nc = torch.zeros([2, 2])\nprint(c)  # tensor([[0., 0.], [0., 0.]])\n```\n\nor any arbitrary dimensional tensor:\n```python\nd = torch.rand([2, 2, 2])\n```\n\nTensors can be used to perform algebraic operations efficiently. One of the most commonly used operations in machine learning applications is matrix multiplication. Say you want to multiply two random matrices of size 3x5 and 5x4, this can be done with the matrix multiplication (@) operation:\n```python\nimport torch\n\nx = torch.randn([3, 5])\ny = torch.randn([5, 4])\nz = x @ y\n\nprint(z)\n```\n\nSimilarly, to add two vectors, you can do:\n```python\nz = x + y\n```\n\nTo convert a tensor into a numpy array you can call Tensor's numpy() method:\n```python\nprint(z.numpy())\n```\n\nAnd you can always convert a numpy array into a tensor by:\n```python\nx = torch.tensor(np.random.normal([3, 5]))\n```\n\n### Automatic differentiation\n\nThe most important advantage of PyTorch over NumPy is its automatic differentiation functionality which is very useful in optimization applications such as optimizing parameters of a neural network. Let's try to understand it with an example.\n\nSay you have a composite function which is a chain of two functions: `g(u(x))`.\nTo compute the derivative of `g` with respect to `x` we can use the chain rule which states that: `dg/dx = dg/du * du/dx`. PyTorch can analytically compute the derivatives for us.\n\nTo compute the derivatives in PyTorch first we create a tensor and set its `requires_grad` to true. We can use tensor operations to define our functions. We assume `u` is a quadratic function and `g` is a simple linear function:\n```python\nx = torch.tensor(1.0, requires_grad=True)\n\ndef u(x):\n  return x * x\n\ndef g(u):\n  return -u\n```\n\nIn this case our composite function is `g(u(x)) = -x*x`. So its derivative with respect to `x` is `-2x`. At point `x=1`, this is equal to `-2`.\n\nLet's verify this. This can be done using grad function in PyTorch:\n```python\ndgdx = torch.autograd.grad(g(u(x)), x)[0]\nprint(dgdx)  # tensor(-2.)\n```\n\n### Curve fitting\n\nTo understand how powerful automatic differentiation can be let's have a look at another example. Assume that we have samples from a curve (say `f(x) = 5x^2 + 3`) and we want to estimate `f(x)` based on these samples. We define a parametric function `g(x, w) = w0 x^2 + w1 x + w2`, which is a function of the input `x` and latent parameters `w`, our goal is then to find the latent parameters such that `g(x, w) ≈ f(x)`. This can be done by minimizing the following loss function: `L(w) = Σ (f(x) - g(x, w))^2`. Although there's a closed form solution for this simple problem, we opt to use a more general approach that can be applied to any arbitrary differentiable function, and that is using stochastic gradient descent. We simply compute the average gradient of `L(w)` with respect to `w` over a set of sample points and move in the opposite direction.\n\nHere's how it can be done in PyTorch:\n\n```python\nimport numpy as np\nimport torch\n\n# Assuming we know that the desired function is a polynomial of 2nd degree, we\n# allocate a vector of size 3 to hold the coefficients and initialize it with\n# random noise.\nw = torch.tensor(torch.randn([3, 1]), requires_grad=True)\n\n# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.\nopt = torch.optim.Adam([w], 0.1)\n\ndef model(x):\n    # We define yhat to be our estimate of y.\n    f = torch.stack([x * x, x, torch.ones_like(x)], 1)\n    yhat = torch.squeeze(f @ w, 1)\n    return yhat\n\ndef compute_loss(y, yhat):\n    # The loss is defined to be the mean squared error distance between our\n    # estimate of y and its true value.\n    loss = torch.nn.functional.mse_loss(yhat, y)\n    return loss\n\ndef generate_data():\n    # Generate some training data based on the true function\n    x = torch.rand(100) * 20 - 10\n    y = 5 * x * x + 3\n    return x, y\n\ndef train_step():\n    x, y = generate_data()\n\n    yhat = model(x)\n    loss = compute_loss(y, yhat)\n\n    opt.zero_grad()\n    loss.backward()\n    opt.step()\n\nfor _ in range(1000):\n    train_step()\n\nprint(w.detach().numpy())\n```\nBy running this piece of code you should see a result close to this:\n```python\n[4.9924135, 0.00040895029, 3.4504161]\n```\nWhich is a relatively close approximation to our parameters.\n\nThis is just tip of the iceberg for what PyTorch can do. Many problems such as optimizing large neural networks with millions of parameters can be implemented efficiently in PyTorch in just a few lines of code. PyTorch takes care of scaling across multiple devices, and threads, and supports a variety of platforms.\n\n## Encapsulate your model with Modules\n\u003ca name=\"modules\"\u003e\u003c/a\u003e\nIn the previous example we used bare bone tensors and tensor operations to build our model. To make your code slightly more organized it's recommended to use PyTorch's modules. A module is simply a container for your parameters and encapsulates model operations. For example say you want to represent a linear model `y = ax + b`. This model can be represented with the following code:\n\n```python\nimport torch\n\nclass Net(torch.nn.Module):\n\n  def __init__(self):\n    super().__init__()\n    self.a = torch.nn.Parameter(torch.rand(1))\n    self.b = torch.nn.Parameter(torch.rand(1))\n\n  def forward(self, x):\n    yhat = self.a * x + self.b\n    return yhat\n```\n\nTo use this model in practice you instantiate the module and simply call it like a function:\n```python\nx = torch.arange(100, dtype=torch.float32)\n\nnet = Net()\ny = net(x)\n```\n\nParameters are essentially tensors with `requires_grad` set to true. It's convenient to use parameters because you can simply retrieve them all with module's `parameters()` method:\n```python\nfor p in net.parameters():\n    print(p)\n```\n\nNow, say you have an unknown function `y = 5x + 3 + some noise`, and you want to optimize the parameters of your model to fit this function.  You can start by sampling some points from your function:\n```python\nx = torch.arange(100, dtype=torch.float32) / 100\ny = 5 * x + 3 + torch.rand(100) * 0.3\n```\n\nSimilar to the previous example, you can define a loss function and optimize the parameters of your model as follows:\n```python\ncriterion = torch.nn.MSELoss()\noptimizer = torch.optim.SGD(net.parameters(), lr=0.01)\n\nfor i in range(10000):\n  net.zero_grad()\n  yhat = net(x)\n  loss = criterion(yhat, y)\n  loss.backward()\n  optimizer.step()\n\nprint(net.a, net.b) # Should be close to 5 and 3\n```\n\nPyTorch comes with a number of predefined modules. One such module is `torch.nn.Linear` which is a more general form of a linear function than what we defined above. We can rewrite our module above using `torch.nn.Linear` like this:\n\n```python\nclass Net(torch.nn.Module):\n\n  def __init__(self):\n    super().__init__()\n    self.linear = torch.nn.Linear(1, 1)\n\n  def forward(self, x):\n    yhat = self.linear(x.unsqueeze(1)).squeeze(1)\n    return yhat\n```\n\nNote that we used squeeze and unsqueeze since `torch.nn.Linear` operates on batch of vectors as opposed to scalars.\n\nBy default calling parameters() on a module will return the parameters of all its submodules:\n```python\nnet = Net()\nfor p in net.parameters():\n    print(p)\n```\n\nThere are some predefined modules that act as a container for other modules. The most commonly used container module is `torch.nn.Sequential`. As its name implies it's used to to stack multiple modules (or layers) on top of each other. For example to stack two Linear layers with a `ReLU` nonlinearity in between you can do:\n\n```python\nmodel = torch.nn.Sequential(\n    torch.nn.Linear(64, 32),\n    torch.nn.ReLU(),\n    torch.nn.Linear(32, 10),\n)\n```\n\n## Broadcasting the good and the ugly\n\u003ca name=\"broadcast\"\u003e\u003c/a\u003e\nPyTorch supports broadcasting elementwise operations. Normally when you want to perform operations like addition and multiplication, you need to make sure that shapes of the operands match, e.g. you can’t add a tensor of shape `[3, 2]` to a tensor of shape `[3, 4]`. But there’s a special case and that’s when you have a singular dimension. PyTorch implicitly tiles the tensor across its singular dimensions to match the shape of the other operand. So it’s valid to add a tensor of shape `[3, 2]` to a tensor of shape `[3, 1]`.\n\n```python\nimport torch\n\na = torch.tensor([[1., 2.], [3., 4.]])\nb = torch.tensor([[1.], [2.]])\n# c = a + b.repeat([1, 2])\nc = a + b\n\nprint(c)\n```\n\nBroadcasting allows us to perform implicit tiling which makes the code shorter, and more memory efficient, since we don’t need to store the result of the tiling operation. One neat place that this can be used is when combining features of varying length. In order to concatenate features of varying length we commonly tile the input tensors, concatenate the result and apply some nonlinearity. This is a common pattern across a variety of neural network architectures:\n\n```python\na = torch.rand([5, 3, 5])\nb = torch.rand([5, 1, 6])\n\nlinear = torch.nn.Linear(11, 10)\n\n# concat a and b and apply nonlinearity\ntiled_b = b.repeat([1, 3, 1])\nc = torch.cat([a, tiled_b], 2)\nd = torch.nn.functional.relu(linear(c))\n\nprint(d.shape)  # torch.Size([5, 3, 10])\n```\n\nBut this can be done more efficiently with broadcasting. We use the fact that `f(m(x + y))` is equal to `f(mx + my)`. So we can do the linear operations separately and use broadcasting to do implicit concatenation:\n\n```python\na = torch.rand([5, 3, 5])\nb = torch.rand([5, 1, 6])\n\nlinear1 = torch.nn.Linear(5, 10)\nlinear2 = torch.nn.Linear(6, 10)\n\npa = linear1(a)\npb = linear2(b)\nd = torch.nn.functional.relu(pa + pb)\n\nprint(d.shape)  # torch.Size([5, 3, 10])\n```\n\nIn fact this piece of code is pretty general and can be applied to tensors of arbitrary shape as long as broadcasting between tensors is possible:\n\n```python\nclass Merge(torch.nn.Module):\n    def __init__(self, in_features1, in_features2, out_features, activation=None):\n        super().__init__()\n        self.linear1 = torch.nn.Linear(in_features1, out_features)\n        self.linear2 = torch.nn.Linear(in_features2, out_features)\n        self.activation = activation\n\n    def forward(self, a, b):\n        pa = self.linear1(a)\n        pb = self.linear2(b)\n        c = pa + pb\n        if self.activation is not None:\n            c = self.activation(c)\n        return c\n```\n\nSo far we discussed the good part of broadcasting. But what’s the ugly part you may ask? Implicit assumptions almost always make debugging harder to do. Consider the following example:\n\n```python\na = torch.tensor([[1.], [2.]])\nb = torch.tensor([1., 2.])\nc = torch.sum(a + b)\n\nprint(c)\n```\n\nWhat do you think the value of `c` would be after evaluation? If you guessed 6, that’s wrong. It’s going to be 12. This is because when rank of two tensors don’t match, PyTorch automatically expands the first dimension of the tensor with lower rank before the elementwise operation, so the result of addition would be `[[2, 3], [3, 4]]`, and the reducing over all parameters would give us 12.\n\nThe way to avoid this problem is to be as explicit as possible. Had we specified which dimension we would want to reduce across, catching this bug would have been much easier:\n\n```python\na = torch.tensor([[1.], [2.]])\nb = torch.tensor([1., 2.])\nc = torch.sum(a + b, 0)\n\nprint(c)\n```\n\nHere the value of `c` would be `[5, 7]`, and we immediately would guess based on the shape of the result that there’s something wrong. A general rule of thumb is to always specify the dimensions in reduction operations and when using `torch.squeeze`.\n\n## Take advantage of the overloaded operators\n\u003ca name=\"overloaded_ops\"\u003e\u003c/a\u003e\nJust like NumPy, PyTorch overloads a number of python operators to make PyTorch code shorter and more readable.\n\nThe slicing op is one of the overloaded operators that can make indexing tensors very easy:\n```python\nz = x[begin:end]  # z = torch.narrow(x, 0, begin, end-begin)\n```\nBe very careful when using this op though. The slicing op, like any other op, has some overhead. Because it's a common op and innocent looking it may get overused a lot which may lead to inefficiencies. To understand how inefficient this op can be let's look at an example. We want to manually perform reduction across the rows of a matrix:\n```python\nimport torch\nimport time\n\nx = torch.rand([500, 10])\n\nz = torch.zeros([10])\n\nstart = time.time()\nfor i in range(500):\n    z += x[i]\nprint(\"Took %f seconds.\" % (time.time() - start))\n```\nThis runs quite slow and the reason is that we are calling the slice op 500 times, which adds a lot of overhead. A better choice would have been to use `torch.unbind` op to slice the matrix into a list of vectors all at once:\n```python\nz = torch.zeros([10])\nfor x_i in torch.unbind(x):\n    z += x_i\n```\nThis is significantly (~30% on my machine) faster.\n\nOf course, the right way to do this simple reduction is to use `torch.sum` op to this in one op:\n```python\nz = torch.sum(x, dim=0)\n```\nwhich is extremely fast (~100x faster on my machine).\n\nPyTorch also overloads a range of arithmetic and logical operators:\n```python\nz = -x      # z = torch.neg(x)\nz = x + y   # z = torch.add(x, y)\nz = x - y   # z = torch.sub(x, y)\nz = x * y   # z = torch.mul(x, y)\nz = x / y   # z = torch.div(x, y)\nz = x // y  # z = torch.floor_divide(x, y)\nz = x % y   # z = torch.remainder(x, y)\nz = x ** y  # z = torch.pow(x, y)\nz = x @ y   # z = torch.matmul(x, y)\nz = x \u003e y   # z = torch.gt(x, y)\nz = x \u003e= y  # z = torch.ge(x, y)\nz = x \u003c y   # z = torch.lt(x, y)\nz = x \u003c= y  # z = torch.le(x, y)\nz = abs(x)  # z = torch.abs(x)\nz = x \u0026 y   # z = torch.bitwise_and(x, y)\nz = x | y   # z = torch.bitwise_or(x, y)\nz = x ^ y   # z = torch.bitwise_xor(x, y)\nz = ~x      # z = torch.bitwise_not(x)\nz = x == y  # z = torch.eq(x, y)\nz = x != y  # z = torch.ne(x, y)\n```\n\nYou can also use the augmented version of these ops. For example `x += y` and `x **= 2` are also valid.\n\nNote that Python doesn't allow overloading `and`, `or`, and `not` keywords.\n\n\n## Optimizing runtime with TorchScript\n\u003ca name=\"torchscript\"\u003e\u003c/a\u003e\nPyTorch is optimized to perform operations on large tensors. Doing many operations on small tensors is quite inefficient in PyTorch. So, whenever possible you should rewrite your computations in batch form to reduce overhead and improve performance. If there's no way you can manually batch your operations, using TorchScript may improve your code's performance. TorchScript is simply a subset of Python functions that are recognized by PyTorch. PyTorch can automatically optimize your TorchScript code using its just in time (jit) compiler and reduce some overheads.\n\nLet's look at an example. A very common operation in ML applications is \"batch gather\". This operation can simply written as `output[i] = input[i, index[i]]`. This can be simply implemented in PyTorch as follows:\n```python\nimport torch\ndef batch_gather(tensor, indices):\n    output = []\n    for i in range(tensor.size(0)):\n        output += [tensor[i][indices[i]]]\n    return torch.stack(output)\n```\n\nTo implement the same function using TorchScript simply use the `torch.jit.script` decorator:\n```python\n@torch.jit.script\ndef batch_gather_jit(tensor, indices):\n    output = []\n    for i in range(tensor.size(0)):\n        output += [tensor[i][indices[i]]]\n    return torch.stack(output)\n```\nOn my tests this is about 10% faster.\n\nBut nothing beats manually batching your operations. A vectorized implementation in my tests is 100 times faster:\n```python\ndef batch_gather_vec(tensor, indices):\n    shape = list(tensor.shape)\n    flat_first = torch.reshape(\n        tensor, [shape[0] * shape[1]] + shape[2:])\n    offset = torch.reshape(\n        torch.arange(shape[0]).cuda() * shape[1],\n        [shape[0]] + [1] * (len(indices.shape) - 1))\n    output = flat_first[indices + offset]\n    return output\n```\n\n## Building efficient custom data loaders\n\u003ca name=\"dataloader\"\u003e\u003c/a\u003e\n\nIn the last lesson we talked about writing efficient PyTorch code. But to make your code run with maximum efficiency you also need to load your data efficiently into your device's memory. Fortunately PyTorch offers a tool to make data loading easy. It's called a `DataLoader`. A `DataLoader` uses multiple workers to simultanously load data from a `Dataset` and optionally uses a `Sampler` to sample data entries and form a batch.\n\nIf you can randomly access your data, using a `DataLoader` is very easy: You simply need to implement a `Dataset` class that implements `__getitem__` (to read each data item) and `__len__` (to return the number of items in the dataset) methods. For example here's how to load images from a given directory:\n\n```python\nimport glob\nimport os\nimport random\nimport cv2\nimport torch\n\nclass ImageDirectoryDataset(torch.utils.data.Dataset):\n    def __init__(path, pattern):\n        self.paths = list(glob.glob(os.path.join(path, pattern)))\n\n    def __len__(self):\n        return len(self.paths)\n\n    def __getitem__(self):\n        path = random.choice(paths)\n        return cv2.imread(path, 1)\n```\n\nTo load all jpeg images from a given directory you can then do the following:\n```python\ndataloader = torch.utils.data.DataLoader(ImageDirectoryDataset(\"/data/imagenet/*.jpg\"), num_workers=8)\nfor data in dataloader:\n    # do something with data\n```\n\nHere we are using 8 workers to simultanously read our data from the disk. You can tune the number of workers on your machine for optimal results.\n\nUsing a `DataLoader` to read data with random access may be ok if you have fast storage or if your data items are large. But imagine having a network file system with slow connection. Requesting individual files this way can be extremely slow and would probably end up becoming the bottleneck of your training pipeline.\n\nA better approach is to store your data in a contiguous file format which can be read sequentially. For example if you have a large collection of images you can use tar to create a single archive and extract files from the archive sequentially in python. To do this you can use PyTorch's `IterableDataset`. To create an `IterableDataset` class you only need to implement an `__iter__` method which sequentially reads and yields data items from the dataset.\n\nA naive implementation would like this:\n\n```python\nimport tarfile\nimport torch\n\ndef tar_image_iterator(path):\n    tar = tarfile.open(self.path, \"r\")\n    for tar_info in tar:\n        file = tar.extractfile(tar_info)\n        content = file.read()\n        yield cv2.imdecode(content, 1)\n        file.close()\n        tar.members = []\n    tar.close()\n\nclass TarImageDataset(torch.utils.data.IterableDataset):\n    def __init__(self, path):\n        super().__init__()\n        self.path = path\n\n    def __iter__(self):\n        yield from tar_image_iterator(self.path)\n```\n\nBut there's a major problem with this implementation. If you try to use DataLoader to read from this dataset with more than one worker you'd observe a lot of duplicated images:\n\n```python\ndataloader = torch.utils.data.DataLoader(TarImageDataset(\"/data/imagenet.tar\"), num_workers=8)\nfor data in dataloader:\n    # data contains duplicated items\n```\n\nThe problem is that each worker creates a separate instance of the dataset and each would start from the beginning of the dataset. One way to avoid this is to instead of having one tar file, split your data into `num_workers` separate tar files and load each with a separate worker:\n\n```python\nclass TarImageDataset(torch.utils.data.IterableDataset):\n    def __init__(self, paths):\n        super().__init__()\n        self.paths = paths\n\n    def __iter__(self):\n        worker_info = torch.utils.data.get_worker_info()\n        # For simplicity we assume num_workers is equal to number of tar files\n        if worker_info is None or worker_info.num_workers != len(self.paths):\n            raise ValueError(\"Number of workers doesn't match number of files.\")\n        yield from tar_image_iterator(self.paths[worker_info.worker_id])\n```\n\nThis is how our dataset class can be used:\n```python\ndataloader = torch.utils.data.DataLoader(\n    TarImageDataset([\"/data/imagenet_part1.tar\", \"/data/imagenet_part2.tar\"]), num_workers=2)\nfor data in dataloader:\n    # do something with data\n```\n\nWe discussed a simple strategy to avoid duplicated entries problem. [tfrecord](https://github.com/vahidk/tfrecord) package uses slightly more sophisticated strategies to shard your data on the fly.\n\n## Numerical stability in PyTorch\n\u003ca name=\"stable\"\u003e\u003c/a\u003e\nWhen using any numerical computation library such as NumPy or PyTorch, it's important to note that writing mathematically correct code doesn't necessarily lead to correct results. You also need to make sure that the computations are stable.\n\nLet's start with a simple example. Mathematically, it's easy to see that `x * y / y = x` for any non zero value of `x`. But let's see if that's always true in practice:\n```python\nimport numpy as np\n\nx = np.float32(1)\n\ny = np.float32(1e-50)  # y would be stored as zero\nz = x * y / y\n\nprint(z)  # prints nan\n```\n\nThe reason for the incorrect result is that `y` is simply too small for float32 type. A similar problem occurs when `y` is too large:\n\n```python\ny = np.float32(1e39)  # y would be stored as inf\nz = x * y / y\n\nprint(z)  # prints nan\n```\n\nThe smallest positive value that float32 type can represent is 1.4013e-45 and anything below that would be stored as zero. Also, any number beyond 3.40282e+38, would be stored as inf.\n\n```python\nprint(np.nextafter(np.float32(0), np.float32(1)))  # prints 1.4013e-45\nprint(np.finfo(np.float32).max)  # print 3.40282e+38\n```\n\nTo make sure that your computations are stable, you want to avoid values with small or very large absolute value. This may sound very obvious, but these kind of problems can become extremely hard to debug especially when doing gradient descent in PyTorch. This is because you not only need to make sure that all the values in the forward pass are within the valid range of your data types, but also you need to make sure of the same for the backward pass (during gradient computation).\n\nLet's look at a real example. We want to compute the softmax over a vector of logits. A naive implementation would look something like this:\n```python\nimport torch\n\ndef unstable_softmax(logits):\n    exp = torch.exp(logits)\n    return exp / torch.sum(exp)\n\nprint(unstable_softmax(torch.tensor([1000., 0.])).numpy())  # prints [ nan, 0.]\n```\nNote that computing the exponential of logits for relatively small numbers results to gigantic results that are out of float32 range. The largest valid logit for our naive softmax implementation is `ln(3.40282e+38) = 88.7`, anything beyond that leads to a nan outcome.\n\nBut how can we make this more stable? The solution is rather simple. It's easy to see that `exp(x - c) Σ exp(x - c) = exp(x) / Σ exp(x)`. Therefore we can subtract any constant from the logits and the result would remain the same. We choose this constant to be the maximum of logits. This way the domain of the exponential function would be limited to `[-inf, 0]`, and consequently its range would be `[0.0, 1.0]` which is desirable:\n\n```python\nimport torch\n\ndef softmax(logits):\n    exp = torch.exp(logits - torch.max(logits))\n    return exp / torch.sum(exp)\n\nprint(softmax(torch.tensor([1000., 0.])).numpy())  # prints [ 1., 0.]\n```\n\nLet's look at a more complicated case. Consider we have a classification problem. We use the softmax function to produce probabilities from our logits. We then define our loss function to be the cross entropy between our predictions and the labels. Recall that cross entropy for a categorical distribution can be simply defined as `xe(p, q) = -Σ p_i log(q_i)`. So a naive implementation of the cross entropy would look like this:\n\n```python\ndef unstable_softmax_cross_entropy(labels, logits):\n    logits = torch.log(softmax(logits))\n    return -torch.sum(labels * logits)\n\nlabels = torch.tensor([0.5, 0.5])\nlogits = torch.tensor([1000., 0.])\n\nxe = unstable_softmax_cross_entropy(labels, logits)\n\nprint(xe.numpy())  # prints inf\n```\n\nNote that in this implementation as the softmax output approaches zero, the log's output approaches infinity which causes instability in our computation. We can rewrite this by expanding the softmax and doing some simplifications:\n\n```python\ndef softmax_cross_entropy(labels, logits, dim=-1):\n    scaled_logits = logits - torch.max(logits)\n    normalized_logits = scaled_logits - torch.logsumexp(scaled_logits, dim)\n    return -torch.sum(labels * normalized_logits)\n\nlabels = torch.tensor([0.5, 0.5])\nlogits = torch.tensor([1000., 0.])\n\nxe = softmax_cross_entropy(labels, logits)\n\nprint(xe.numpy())  # prints 500.0\n```\n\nWe can also verify that the gradients are also computed correctly:\n```python\nlogits.requires_grad_(True)\nxe = softmax_cross_entropy(labels, logits)\ng = torch.autograd.grad(xe, logits)[0]\nprint(g.numpy())  # prints [0.5, -0.5]\n```\n\nLet me remind again that extra care must be taken when doing gradient descent to make sure that the range of your functions as well as the gradients for each layer are within a valid range. Exponential and logarithmic functions when used naively are especially problematic because they can map small numbers to enormous ones and the other way around.\n\n\n## Faster training with mixed precision\n\u003ca name=\"amp\"\u003e\u003c/a\u003e\nBy default tensors and model parameters in PyTorch are stored in 32-bit floating point precision. Training neural networks using 32-bit floats is usually stable and doesn\u0026#39;t cause major numerical issues, however neural networks have been shown to perform quite well in 16-bit and even lower precisions. Computation in lower precisions can be significantly faster on modern GPUs. It also has the extra benefit of using less memory enabling training larger models and/or with larger batch sizes which can boost the performance further. The problem though is that training in 16 bits often becomes very unstable because the precision is usually not enough to perform some operations like accumulations.\n\nTo help with this problem PyTorch supports training in mixed precision. In a nutshell mixed-precision training is done by performing some expensive operations (like convolutions and matrix multplications) in 16-bit by casting down the inputs while performing other numerically sensitive operations like accumulations in 32-bit. This way we get all the benefits of 16-bit computation without its drawbacks. Next we talk about using Autocast and GradScaler to do automatic mixed-precision training.\n\n### Autocast\n\n`autocast` helps improve runtime performance by automatically casting down data to 16-bit for some computations. To understand how it works let\u0026#39;s look at an example:\n\n```python\nimport torch\n\nx = torch.rand([32, 32]).cuda()\ny = torch.rand([32, 32]).cuda()\n\nwith torch.amp.autocast(\"cuda\"):\n  a = x + y\n  b = x @ y\nprint(a.dtype)  # prints torch.float32\nprint(b.dtype)  # prints torch.float16\n```\n\nNote both `x` and `y` are 32-bit tensors, but `autocast` performs matrix multiplication in 16-bit while keeping addition operation in 32-bit. What if one of the operands is in 16-bit?\n\n```python\nimport torch\n\nx = torch.rand([32, 32]).cuda()\ny = torch.rand([32, 32]).cuda().half()\n\nwith torch.amp.autocast(\"cuda\"):\n  a = x + y\n  b = x @ y\nprint(a.dtype)  # prints torch.float32\nprint(b.dtype)  # prints torch.float16\n```\n\nAgain `autocast` and casts down the 32-bit operand to 16-bit to perform matrix multiplication, but it doesn\u0026#39;t change the addition operation. By default, addition of two tensors in PyTorch results in a cast to higher precision.\n\nIn practice, you can trust `autocast` to do the right casting to improve runtime efficiency. The important thing is to keep all your forward pass computations under `autocast` context:\n\n```python\nmodel = ...\nloss_fn = ...\n\nwith torch.amp.autocast(\"cuda\"):\n  outputs = model(inputs)\n  loss = loss_fn(outputs, targets)\n```\n\nThis maybe all you need if you have a relatively stable optimization problem and if you use a relatively low learning rate. Adding this one line of extra code can reduce your training up to half on modern hardware.\n\n### GradScalar\n\nAs we mentioned in the beginning of this section, 16-bit precision may not always be enough for some computations. One particular case of interest is representing gradient values, a great portion of which are usually small values. Representing them with 16-bit floats often leads to buffer underflows (i.e. they\u0026#39;d be represented as zeros). This makes training neural networks very unstable. `GradScalar` is designed to resolve this issue. It takes as input your loss value and multiplies it by a large scalar, inflating gradient values, and therefore making them representable in 16-bit precision. It then scales them down during gradient update to ensure parameters are updated correctly. This is generally what `GradScalar` does. But under the hood `GradScalar` is a bit smarter than that. Inflating the gradients may actually result in overflows which is equally bad. So `GradScalar` actually monitors the gradient values and if it detects overflows it skips updates, scaling down the scalar factor according to a configurable schedule. (The default schedule usually works but you may need to adjust that for your use case.)\n\nUsing `GradScalar` is very easy in practice:\n\n```python\nscaler = torch.amp.GradScaler()\n\nloss = ...\noptimizer = ...  # an instance torch.optim.Optimizer\n\nscaler.scale(loss).backward()\nscaler.step(optimizer)\nscaler.update()\n```\n\nNote that we first create an instance of `GradScalar`. In training loop we call `GradScalar.scale` to scale the loss before calling backward to produce inflated gradients, we then use `GradScalar.step` which (may) update the model parameters. We then call `GradScalar.update` which performs the scalar update if needed. That\u0026#39;s all!\n\nThe following is a sample code that show cases mixed precision training on a synthetic problem of learning to generate a checkerboard from image coordinates. You can paste it on a [Google Colab](https://colab.research.google.com/), set the backend to GPU and compare the single and mixed-precision performance. Note that this is a small toy example, in practice with larger networks you may see larger boosts in performance using mixed precision.\n\n### An Example\n\n### Generating a checker board\n\n```python\nimport torch\nimport matplotlib.pyplot as plt\nimport time\n\ndef grid(width, height):\n  hrange = torch.arange(width).unsqueeze(0).repeat([height, 1]).div(width)\n  vrange = torch.arange(height).unsqueeze(1).repeat([1, width]).div(height)\n  output = torch.stack([hrange, vrange], 0)\n  return output\n\n\ndef checker(width, height, freq):\n  hrange = torch.arange(width).reshape([1, width]).mul(freq / width / 2.0).fmod(1.0).gt(0.5)\n  vrange = torch.arange(height).reshape([height, 1]).mul(freq / height / 2.0).fmod(1.0).gt(0.5)\n  output = hrange.logical_xor(vrange).float()\n  return output\n\n# Note the inputs are grid coordinates and the target is a checkerboard\ninputs = grid(512, 512).unsqueeze(0).cuda()\ntargets = checker(512, 512, 8).unsqueeze(0).unsqueeze(1).cuda()\n```\n\n### Defining a convolutional neural network\n\n```python\nclass Net(torch.jit.ScriptModule):\n  def __init__(self):\n    super().__init__()\n    self.net = torch.nn.Sequential(\n      torch.nn.Conv2d(2, 256, 1),\n      torch.nn.BatchNorm2d(256),\n      torch.nn.ReLU(),\n      torch.nn.Conv2d(256, 256, 1),\n      torch.nn.BatchNorm2d(256),\n      torch.nn.ReLU(),\n      torch.nn.Conv2d(256, 256, 1),\n      torch.nn.BatchNorm2d(256),\n      torch.nn.ReLU(),\n      torch.nn.Conv2d(256, 1, 1))\n\n  @torch.jit.script_method\n  def forward(self, x):\n    return self.net(x)\n```\n\n### Single precision training\n```python\nnet = Net().cuda()\nloss_fn = torch.nn.MSELoss()\nopt = torch.optim.Adam(net.parameters(), 0.001)\n\nstart_time = time.time()\n\nfor i in range(500):\n  opt.zero_grad()\n  outputs = net(inputs)\n  loss = loss_fn(outputs, targets)\n  loss.backward()\n  opt.step()\nprint(loss)\n\nprint(time.time() - start_time)\n\nplt.subplot(1,2,1); plt.imshow(outputs.squeeze().detach().cpu());\nplt.subplot(1,2,2); plt.imshow(targets.squeeze().cpu()); plt.show()\n```\n\n### Mixed precision training\n```python\nnet = Net().cuda()\nloss_fn = torch.nn.MSELoss()\nopt = torch.optim.Adam(net.parameters(), 0.001)\n\nscaler = torch.amp.GradScaler()\n\nstart_time = time.time()\n\nfor i in range(500):\n  opt.zero_grad()\n  with torch.amp.autocast(\"cuda\"):\n    outputs = net(inputs)\n    loss = loss_fn(outputs, targets)\n  scaler.scale(loss).backward()\n  scaler.step(opt)\n  scaler.update()\nprint(loss)\n\nprint(time.time() - start_time)\n\nplt.subplot(1,2,1); plt.imshow(outputs.squeeze().detach().cpu().float());\nplt.subplot(1,2,2); plt.imshow(targets.squeeze().cpu().float()); plt.show()\n```\n\n\n### Reference\n- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahidk%2Feffectivepytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvahidk%2Feffectivepytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahidk%2Feffectivepytorch/lists"}