{"id":15030873,"url":"https://github.com/lartpang/pytorchtricks","last_synced_at":"2025-04-09T04:04:10.963Z","repository":{"id":37392691,"uuid":"229664075","full_name":"lartpang/PyTorchTricks","owner":"lartpang","description":"Some tricks of pytorch... :star:","archived":false,"fork":false,"pushed_at":"2024-06-20T07:40:54.000Z","size":169,"stargazers_count":1183,"open_issues_count":0,"forks_count":128,"subscribers_count":32,"default_branch":"master","last_synced_at":"2025-04-09T04:04:02.567Z","etag":null,"topics":["pytorch","pytorch-trick","pytorch-tutorial","tricks"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lartpang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-23T02:45:29.000Z","updated_at":"2025-04-07T00:58:47.000Z","dependencies_parsed_at":"2024-09-30T17:00:31.266Z","dependency_job_id":null,"html_url":"https://github.com/lartpang/PyTorchTricks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lartpang%2FPyTorchTricks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lartpang%2FPyTorchTricks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lartpang%2FPyTorchTricks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lartpang%2FPyTorchTricks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lartpang","download_url":"https://codeload.github.com/lartpang/PyTorchTricks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247974715,"owners_count":21026742,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pytorch","pytorch-trick","pytorch-tutorial","tricks"],"created_at":"2024-09-24T20:14:25.972Z","updated_at":"2025-04-09T04:04:10.941Z","avatar_url":"https://github.com/lartpang.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Some Tricks of PyTorch\n\n## changelog\n\n* 2019 年 11 月 29 日: 更新一些模型设计技巧和推理加速的内容, 补充了下 apex 的一个介绍链接, ~~另外删了 tfrecord, pytorch 能用么? 这个我记得是不能, 所以删掉了~~(表示删掉:\u003c)\n* 2019 年 11 月 30 日: 补充 MAC 的含义, 补充 ShuffleNetV2 的论文链接\n* 2019 年 12 月 02 日: 之前说的 pytorch 不能用 tfrecord, 今天看到\u003chttps://www.zhihu.com/question/358632497\u003e下的一个回答, 涨姿势了\n* 2019 年 12 月 23 日: 补充几篇关于模型压缩量化的科普性文章\n* 2020 年 2 月 7 日: 从文章中摘录了一点注意事项, 补充在了 [代码层面](#代码层面) 小节\n* 2020 年 4 月 30 日:\n  * 添加了一个 github 的文档备份\n  * 补充了卷积层和 BN 层融合的介绍的链接\n  * 另外这里说明下, 对于之前参考的很多朋友的文章和回答, 没有把链接和对应的内容提要关联在一起, 估计会导致一些朋友阅读时相关的内容时的提问, 无法问到原作者, 这里深感抱歉.\n  * 调整部分内容, 将内容尽量与参考链接相对应\n* 2020 年 5 月 18 日: 补充一些关于 PyTorch 节省显存的技巧. 同时简单调整格式. 另外发现一个之前的错误: `non_blocking=False` 的建议应该是 `non_blocking=True` .\n* 2021 年 01 月 06 日：调整下关于读取图片数据的一些介绍.\n* 2021 年 01 月 13 日：补充了一条推理加速的策略. 我觉得我应该先更新 github 的文档，知乎答案的更新有点麻烦，也没法比较更改信息，就很费劲。\n* 2022 年 6 月 26 日：重新调整了下格式和内容安排，同时补充了更多的参考资料和一些最新发现的有效内容。\n* 2024 年 6 月 20 日：简单调整格式，补充了基于 `tar` 格式和 `IterableDataset` 的一种加速数据读取的思路。\n\n## PyTorch 提速\n\n\u003e [!note]\n\u003e 原始文档:\u003chttps://www.yuque.com/lart/ugkv9f/ugysgn\u003e\n\u003e\n\u003e 声明: 大部分内容来自知乎和其他博客的分享, 这里只作为一个收集罗列. 欢迎给出更多建议.\n\n知乎回答 (欢迎点赞哦):\n\n* [pytorch dataloader 数据加载占用了大部分时间, 各位大佬都是怎么解决的? - 人民艺术家的回答 - 知乎](https://www.zhihu.com/question/307282137/answer/907835663)\n* [使用 pytorch 时, 训练集数据太多达到上千万张, Dataloader 加载很慢怎么办? - 人民艺术家的回答 - 知乎](https://www.zhihu.com/question/356829360/answer/907832358)\n\n### 预处理提速\n\n* 尽量减少每次读取数据时的预处理操作, 可以考虑把一些固定的操作, 例如 `resize` , 事先处理好保存下来, 训练的时候直接拿来用。\n* 将预处理搬到 GPU 上加速。\n  * Linux 可以使用 [`NVIDIA/DALI`](https://github.com/NVIDIA/DALI)。\n  * 使用基于 Tensor 的图像处理操作。\n\n### IO 提速\n\n* mmcv 对数据的读取提供了比较高效且全面的支持：[OpenMMLab：MMCV 核心组件分析(三): FileClient](https://zhuanlan.zhihu.com/p/339190576)\n\n#### 使用更快的图片处理\n\n* `opencv` 一般要比 `PIL` 要快 。\n  * 请注意，`PIL` 的惰性加载的策略使得其看上去 `open` 要比 `opencv` 的 `imread` 要快，但是实际上那并没有完全加载数据。可以对 `open` 返回的对象调用其 `load()` 方法，从而手动加载数据，这时的速度才是合理的。\n* 对于 `jpeg` 读取, 可以尝试 `jpeg4py`。\n* 存 `bmp` 图 (降低解码时间)。\n* 关于不同图像处理库速度的讨论：[Python 的各种 imread 函数在实现方式和读取速度上有何区别？ - 知乎](https://www.zhihu.com/question/48762352)\n\n#### 整合数据为单个连续文件 (降低读取次数)\n\n对于大规模的小文件读取，可以保存为一个可以连续读取的连续文件格式。[可以选择考虑 `TFRecord (Tensorflow)` , `recordIO`, `hdf5`, `pth`, `n5`, `lmdb`等](https://github.com/Lyken17/Efficient-PyTorch#data-loader)。\n\n* `TFRecord` ：\u003chttps://github.com/vahidk/tfrecord\u003e\n* `lmdb` 数据库：\n  * \u003chttps://github.com/Fangyh09/Image2LMDB\u003e\n  * \u003chttps://blog.csdn.net/P_LarT/article/details/103208405\u003e\n  * \u003chttps://github.com/lartpang/PySODToolBox/blob/master/ForBigDataset/ImageFolder2LMDB.py\u003e\n* 基于 [`Tar`文件和`IterableDataset`的实现](https://github.com/vahidk/EffectivePyTorch?tab=readme-ov-file#building-efficient-custom-data-loaders)\n\n#### 预读取数据\n\n预读取下一次迭代需要的数据。使用案例：\n\n* [如何给你 PyTorch 里的 Dataloader 打鸡血 - MKFMIKU 的文章 - 知乎](https://zhuanlan.zhihu.com/p/66145913)\n* [给 pytorch 读取数据加速 - 体 hi 的文章 - 知乎](https://zhuanlan.zhihu.com/p/72956595)\n\n#### 借助内存\n\n* 直接载到内存里面。\n  * 将图片读取后存到一个固定的容器对象中。\n    * YoloV5 中的 [`--cache`](https://github.com/ultralytics/yolov5/blob/19f33cbae29ac2127dd877b52e228c178dda6086/utils/dataloaders.py#L521-L534)。\n* 把内存映射成磁盘。\n\n#### 借助固态\n\n机械硬盘换成 NVME 固态。参考自 [如何给你 PyTorch 里的 Dataloader 打鸡血 - MKFMIKU 的文章 - 知乎](https://zhuanlan.zhihu.com/p/66145913)\n\n### 训练策略\n\n#### 低精度训练\n\n在训练中使用低精度 ( `FP16` 甚至 `INT8` 、二值网络、三值网络) 表示取代原有精度 ( `FP32` ) 表示。\n\n可以节约一定的显存并提速, 但是要小心一些不安全的操作如 mean 和 sum。\n\n* 混合精度训练的介绍文章：\n  * [由浅入深的混合精度训练教程](https://mp.weixin.qq.com/s/6FCumAWa8fZ1r7xwIRC9ow)\n* [`NVIDIA/Apex`](https://github.com/nvidia/apex) 提供的混合精度支持。\n  * [PyTorch 必备神器 | 唯快不破：基于 Apex 的混合精度加速](https://mp.weixin.qq.com/s/HQnI8rzPvZN6Q_5c8d1nVQ)\n  * [Pytorch 安装 APEX 疑难杂症解决方案 - 陈瀚可的文章 - 知乎](https://zhuanlan.zhihu.com/p/80386137)\n* [PyTorch1.6 开始提供的`torch.cuda.amp`](https://pytorch.org/docs/stable/notes/amp_examples.html) 以支持混合精度。\n\n#### 更大的 batch\n\n更大的 batch 在固定的 epoch 的情况下往往会带来更短的训练时间。但是大的 batch 面临着超参数的设置、显存占用问题等诸多考量，这又是另一个备受关注的领域了。\n\n* 超参数设置\n  * Accurate, large minibatch SGD: training imagenet in 1 hour，[论文](https://arxiv.org/abs/1706.02677)\n* 优化显存占用\n  * [Gradient Accumulation](https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-accumulation)\n  * [Gradient Checkpointing](https://pytorch.org/docs/1.11/checkpoint.html#torch-utils-checkpoint)\n    * Training deep nets with sublinear memory cost，[论文](https://arxiv.org/abs/1604.06174)\n  * In-Place Operation\n    * In-Place Activated BatchNorm for Memory-Optimized Training of DNNs，[论文](https://arxiv.org/abs/1712.02616)，[代码](https://github.com/mapillary/inplace_abn)\n\n### 代码层面\n\n#### 库设置\n\n* 在训练循环之前设置 `torch.backends.cudnn.benchmark = True` 可以加速计算。由于计算不同内核大小卷积的 cuDNN 算法的性能不同，自动调优器可以运行一个基准来找到最佳算法。当你的输入大小不经常改变时，建议开启这个设置。如果输入大小经常改变，那么自动调优器就需要太频繁地进行基准测试，这可能会损害性能。它可以将向前和向后传播速度提高 1.27x 到 1.70x。\n* 使用页面锁定内存，即在 DataLoader 中设定 [`pin_memory=True`](https://pytorch.org/docs/stable/data.html#memory-pinning)。\n* 合适的 `num_worker`，细节讨论可见 [Pytorch 提速指南 - 云梦的文章 - 知乎](https://zhuanlan.zhihu.com/p/39752167)。\n* [optimizer.zero_grad(set_to_none=False](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html#torch-optim-optimizer-zero-grad) 这里可以通过设置 `set_to_none=True` 来降低的内存占用，并且可以适度提高性能。但是这也会改变某些行为，具体可见文档。通过 `model.zero_grad()` 或 `optimizer.zero_grad()` 将对所有参数执行 `memset`，并通过读写操作更新梯度。但是，将梯度设置为 `None` 将不会执行 `memset`，并且将使用“只写”操作更新梯度。因此，设置梯度为 `None` 更快。\n* 反向传播期间设定使用 `eval` 模式并使用 `torch.no_grad` 关闭梯度计算。\n* 可以考虑使用 [channels_last](https://pytorch.org/docs/stable/tensor_attributes.html#torch-memory-format) 的内存格式。\n* [用`DistributedDataParallel`代替`DataParallel`](https://pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)。对于多 GPU 来说，即使只有单个节点，也总是优先使用 `DistributedDataParallel` 而不是 `DataParallel` ，因为 `DistributedDataParallel` 应用于多进程，并为每个 GPU 创建一个进程，从而绕过 Python 全局解释器锁 (GIL) 并提高速度。\n\n#### 模型\n\n* 不要初始化任何用不到的变量，因为 PyTorch 的初始化和 `forward` 是分开的，他不会因为你不去使用，而不去初始化。\n* [`@torch.jit.script`](https://pytorch.org/docs/stable/generated/torch.jit.script.html#torch.jit.script)，使用 PyTroch JIT 将逐点运算融合到单个 CUDA kernel 上。PyTorch 优化了维度很大的张量的运算操作。在 PyTorch 中对小张量进行太多的运算操作是非常低效的。所以有可能的话，将计算操作都重写为批次（batch）的形式，可以减少消耗和提高性能。而如果没办法自己手动实现批次的运算操作，那么可以采用 TorchScript 来提升代码的性能。TorchScript 是一个 Python 函数的子集，但经过了 PyTorch 的验证，PyTorch 可以通过其 just in time(jtt) 编译器来自动优化 TorchScript 代码，提高性能。但更好的做法还是手动实现批次的运算操作。\n* 在使用混合精度的 FP16 时，对于所有不同架构设计，设置尺寸为 8 的倍数。\n* BN 之前的卷积层可以去掉 bias。因为在数学上，bias 可以通过 BN 的均值减法来抵消。我们可以节省模型参数、运行时的内存。\n\n#### 数据\n\n* 将 batch size 设置为 8 的倍数，最大化 GPU 内存的使用。\n* GPU 上尽可能执行 NumPy 风格的操作。\n* 使用 `del` 释放内存占用。\n* 避免不同设备之间不必要的数据传输。\n* 创建张量的时候，直接指定设备，而不要创建后再传输到目标设备上。\n* 使用 [`torch.from_numpy(ndarray)`](https://pytorch.org/docs/stable/generated/torch.from_numpy.html#torch-from-numpy) 或者 [`torch.as_tensor(data, dtype=None, device=None)`](https://pytorch.org/docs/stable/generated/torch.as_tensor.html#torch-as-tensor)，这可以通过共享内存而避免重新申请空间，具体使用细节和注意事项可参考对应文档。如果源设备和目标设备都是 CPU，`torch.from_numpy` 和 `torch.as_tensor` 不会拷贝数据。如果源数据是 NumPy 数组，使用 `torch.from_numpy` 更快。如果源数据是一个具有相同数据类型和设备类型的张量，那么 `torch.as_tensor` 可以避免拷贝数据，这里的数据可以是 Python 的 list， tuple，或者张量。\n* 使用非阻塞传输，即设定 `non_blocking=True`。这会在可能的情况下尝试异步转换，例如，将页面锁定内存中的 CPU 张量转换为 CUDA 张量。\n\n### 对优化器的优化\n\n* 将模型参数存放到一块连续的内存中，从而减少 `optimizer.step()` 的时间。\n  * [`contiguous_pytorch_params`](https://github.com/PhilJd/contiguous_pytorch_params)\n* 使用 APEX 中的 [fused building blocks](https://nvidia.github.io/apex/optimizers.html)\n\n### 模型设计\n\n#### CNN\n\n* ShuffleNetV2，[论文](https://arxiv.org/pdf/1807.11164.pdf)。\n  * 卷积层输入输出通道一致: 卷积层的输入和输出特征通道数相等时 MAC（内存访问消耗时间, `memory access cost` 缩写为 `MAC` ） 最小, 此时模型速度最快\n  * 减少卷积分组: 过多的 group 操作会增大 MAC, 从而使模型速度变慢\n  * 减少模型分支: 模型中的分支数量越少, 模型速度越快\n  * 减少 `element-wise` 操作: `element-wise` 操作所带来的时间消耗远比在 FLOPs 上的体现的数值要多, 因此要尽可能减少 `element-wise` 操作。 `depthwise convolution` 也具有低 FLOPs 、高 MAC 的特点。\n\n#### Vision Transformer\n\n* TRT-ViT: TensorRT-oriented Vision Transformer，[论文](https://arxiv.org/abs/2205.09579)，[解读](https://www.yuque.com/lart/papers/pghqxg)。\n  * stage-level：Transformer block 适合放置到模型的后期，这可以最大化效率和性能的权衡。\n  * stage-level：先浅后深的 stage 设计模式可以提升性能。\n  * block-level：Transformer 和 BottleNeck 的混合 block 要比单独的 Transformer 更有效。\n  * block-level：先全局再局部的 block 设计模式有助于弥补性能问题。\n\n#### 通用思路\n\n* 降低复杂度: 例如模型裁剪和剪枝, 减少模型层数和参数规模\n* 改模型结构: 例如模型蒸馏, 通过知识蒸馏方法来获取小模型\n\n### 推理加速\n\n#### 半精度与权重量化\n\n在推理中使用低精度 ( `FP16` 甚至 `INT8` 、二值网络、三值网络) 表示取代原有精度 ( `FP32` ) 表示。\n\n* `TensorRT` 是 NVIDIA 提出的神经网络推理 (Inference) 引擎, 支持训练后 8BIT 量化, 它使用基于交叉熵的模型量化算法, 通过最小化两个分布的差异程度来实现\n* Pytorch1.3 开始已经支持量化功能, 基于 QNNPACK 实现, 支持训练后量化, 动态量化和量化感知训练等技术\n* 另外 `Distiller` 是 Intel 基于 Pytorch 开源的模型优化工具, 自然也支持 Pytorch 中的量化技术\n* 微软的 `NNI` 集成了多种量化感知的训练算法, 并支持 `PyTorch/TensorFlow/MXNet/Caffe2` 等多个开源框架\n\n更多细节可参考 [有三 AI:【杂谈】当前模型量化有哪些可用的开源工具?](https://mp.weixin.qq.com/s/3uUwf9vQmQ4jkGjLxzb9aQ)。\n\n#### 操作融合\n\n* [模型推理加速技巧：融合 BN 和 Conv 层 - 小小将的文章 - 知乎](https://zhuanlan.zhihu.com/p/110552861)\n* [网络 inference 阶段 conv 层和 BN 层的融合 - autocyz 的文章 - 知乎](https://zhuanlan.zhihu.com/p/48005099)\n* [PyTorch 本身提供了类似的功能](https://pytorch.org/docs/1.3.0/quantization.html#torch.quantization.fuse_modules)\n\n#### 重参数化（Re-Parameterization）\n\n* [RepVGG](httsp://arxiv.org/abs/2101.03697)\n  * [RepVGG|让你的 ConVNet 一卷到底，plain 网络首次超过 80%top1 精度](https://mp.weixin.qq.com/s/M4Kspm6hO3W8fXT_JqoEhA)\n\n### 时间分析\n\n* Python 自带了几个性能分析的模块 `profile` , `cProfile` 和 `hotshot` , 使用方法基本都差不多, 无非模块是纯 Python 还是用 C 写的。\n* [PyTorch Profiler](https://pytorch.org/docs/stable/profiler.html?highlight=profile#module-torch.profiler) 是一种工具，可在训练和推理过程中收集性能指标。Profiler 的上下文管理器 API 可用于更好地了解哪种模型算子成本最高，检查其输入形状和堆栈记录，研究设备内核活动并可视化执行记录。\n\n### 项目推荐\n\n* [基于 Pytorch 实现模型压缩](https://github.com/666DZY666/model-compression):\n  * 量化:8/4/2 bits(dorefa)、三值/二值 (twn/bnn/xnor-net)。\n  * 剪枝: 正常、规整、针对分组卷积结构的通道剪枝。\n  * 分组卷积结构。\n  * 针对特征二值量化的 BN 融合。\n\n### 扩展阅读\n\n* [pytorch dataloader 数据加载占用了大部分时间, 各位大佬都是怎么解决的? - 知乎](https://www.zhihu.com/question/307282137)\n* [使用 pytorch 时, 训练集数据太多达到上千万张, Dataloader 加载很慢怎么办? - 知乎](https://www.zhihu.com/question/356829360)\n* [PyTorch 有哪些坑/bug? - 知乎](https://www.zhihu.com/question/67209417)\n* [Optimizing PyTorch training code](https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/)\n* [26 秒单 GPU 训练 CIFAR10, Jeff Dean 也点赞的深度学习优化技巧 - 机器之心的文章 - 知乎](https://zhuanlan.zhihu.com/p/79020733)\n* [线上模型加入几个新特征训练后上线, tensorflow serving 预测时间为什么比原来慢 20 多倍? - TzeSing 的回答 - 知乎](https://www.zhihu.com/question/354086469/answer/894235805)\n* [深度学习模型压缩](https://www.yuque.com/lart/gw5mta)\n* [今天, 你的模型加速了吗? 这里有 5 个方法供你参考(附代码解析)](https://mp.weixin.qq.com/s/_ATSwwVqigvqmDB0Y9lOAQ)\n* [pytorch 常见的坑汇总 - 郁振波的文章 - 知乎](https://zhuanlan.zhihu.com/p/77952356)\n* [Pytorch 提速指南 - 云梦的文章 - 知乎](https://zhuanlan.zhihu.com/p/39752167)\n* [优化 PyTorch 的速度和内存效率（2022）](https://mp.weixin.qq.com/s/ShgNdizIPzeXOREoz8rgJA)\n\n## PyTorch 节省显存\n\n\u003e 原始文档:\u003chttps://www.yuque.com/lart/ugkv9f/nvffyf\u003e\n\u003e\n\u003e 整理自: Pytorch 有什么节省内存 (显存) 的小技巧? - 知乎 \u003chttps://www.zhihu.com/question/274635237\u003e\n\n### 使用 In-Place 操作\n\n* 对于默认支持 `inplace` 的操作尽量启用。比如 `relu` 可以使用 `inplace=True` 。\n* 可以将 `batchnorm` 和一些特定的激活函数打包成 [`inplace_abn`](https://github.com/mapillary/inplace_abn)。\n\n### 损失函数\n\n每次循环结束时删除 loss, 可以节约很少显存, 但聊胜于无。可见 [Tensor to Variable and memory freeing best practices](https://discuss.pytorch.org/t/tensor-to-variable-and-memory-freeing-best-practices/6000/2)\n\n### 混合精度\n\n可以节约一定的显存并提速, 但是要小心一些不安全的操作如 mean 和 sum。\n\n* 混合精度训练的介绍文章：\n  * [由浅入深的混合精度训练教程](https://mp.weixin.qq.com/s/6FCumAWa8fZ1r7xwIRC9ow)\n* [`NVIDIA/Apex`](https://github.com/nvidia/apex) 提供的混合精度支持。\n  * [PyTorch 必备神器 | 唯快不破：基于 Apex 的混合精度加速](https://mp.weixin.qq.com/s/HQnI8rzPvZN6Q_5c8d1nVQ)\n  * [Pytorch 安装 APEX 疑难杂症解决方案 - 陈瀚可的文章 - 知乎](https://zhuanlan.zhihu.com/p/80386137)\n* [PyTorch1.6 开始提供的`torch.cuda.amp`](https://pytorch.org/docs/stable/notes/amp_examples.html) 以支持混合精度。\n\n### 管理不需要反向传播的操作\n\n* 对于不需要反向传播的前向阶段，如验证和推理期间，使用 `torch.no_grad` 来包裹代码。\n  * 注意 `model.eval()` 不等于 `torch.no_grad()` , 请看如下讨论: ['model.eval()' vs 'with torch.no_grad()'](https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615)\n* 将不需要计算梯度的变量的 `requires_grad` 设为 `False`, 让变量不参与梯度的后向传播，以减少不必要的梯度的显存占用。\n* 移除不需要计算的梯度路径：\n  * [Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models](https://arxiv.org/abs/2203.16755)，解读可见：\n    * \u003chttps://www.yuque.com/lart/papers/xu5t00\u003e\n    * \u003chttps://blog.csdn.net/P_LarT/article/details/124978961\u003e\n\n### 显存清理\n\n* `torch.cuda.empty_cache()` 这是 `del` 的进阶版, 使用 `nvidia-smi` 会发现显存有明显的变化. 但是训练时最大的显存占用似乎没变. 大家可以试试: [How can we release GPU memory cache?](https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530)\n* 可以使用 `del` 删除不必要的中间变量, 或者使用 `replacing variables` 的形式来减少占用.\n\n### 梯度累加（Gradient Accumulation）\n\n把一个 `batchsize=64` 分为两个 32 的 batch，两次 forward 以后，backward 一次。但会影响 `batchnorm` 等和 `batchsize` 相关的层。\n\n在 [PyTorch 的文档](https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-accumulation) 中提到了梯度累加与混合精度并用的例子。\n\n使用梯度累加技术可以对分布式训练加速，这可以参考：[[原创][深度][PyTorch] DDP 系列第三篇：实战与技巧 - 996 黄金一代的文章 - 知乎](https://zhuanlan.zhihu.com/p/250471767)\n\n### 梯度检查点（Gradient Checkpointing）\n\nPyTorch 中提供了 [`torch.utils.checkpoint`](https://pytorch.org/docs/1.11/checkpoint.html#torch-utils-checkpoint)。这是通过在反向传播期间，在每个检查点位置重新执行一次前向传播来实现的。\n\n论文 [Training Deep Nets with Sublinear Memory Cost](https://arxiv.org/abs/1604.06174) 基于梯度检查点技术，将显存从 O(N) 降到了 O(sqrt(N))。对于越深的模型, 这个方法省的显存就越多, 且速度不会明显变慢。\n\n* [PyTorch 之 Checkpoint 机制解析](https://www.yuque.com/lart/ugkv9f/azvnyg)\n* [torch.utils.checkpoint 简介 和 简易使用](https://blog.csdn.net/one_six_mix/article/details/93937091)\n* [Sublinear Memory Cost 的一份 PyTorch 实现](https://github.com/Lyken17/pytorch-memonger)，参考自：[Pytorch 有什么节省内存(显存)的小技巧? - Lyken 的回答 - 知乎](https://www.zhihu.com/question/274635237/answer/755102181)\n\n### 相关工具\n\n* These codes can help you to detect your GPU memory during training with Pytorch. [https://github.com/Oldpan/Pytorch-Memory-Utils](https://github.com/Oldpan/Pytorch-Memory-Utils)\n* Just less than nvidia-smi? [https://github.com/wookayin/gpustat](https://github.com/wookayin/gpustat)\n\n### 参考资料\n\n* [Pytorch 有什么节省内存(显存)的小技巧? - 郑哲东的回答 - 知乎](https://www.zhihu.com/question/274635237/answer/573633662)\n* [浅谈深度学习: 如何计算模型以及中间变量的显存占用大小](https://oldpan.me/archives/how-to-calculate-gpu-memory)\n* [如何在 Pytorch 中精细化利用显存](https://oldpan.me/archives/how-to-use-memory-pytorch)\n* [Pytorch 有什么节省显存的小技巧? - 陈瀚可的回答 - 知乎](https://www.zhihu.com/question/274635237/answer/756144739)\n* [PyTorch 显存机制分析 - Connolly 的文章 - 知乎](https://zhuanlan.zhihu.com/p/424512257)\n\n## 其他技巧\n\n### 重现\n\n可关注文档中 [相关章节](https://pytorch.org/docs/stable/notes/randomness.html#reproducibility)。\n\n#### 强制确定性操作\n\n[避免使用非确定性算法](https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms)。\n\nPyTorch 中，[`torch.use_deterministic_algorithms()`](https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms) 可以强制使用确定性算法而不是非确定性算法，并且如果已知操作是非确定性的（并且没有确定性的替代方案），则会抛出错误。\n\n#### 设置随机数种子\n\n```python\ndef seed_torch(seed=1029):\n    random.seed(seed)\n    os.environ['PYTHONHASHSEED'] = str(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.\n    torch.backends.cudnn.benchmark = False\n    torch.backends.cudnn.deterministic = True\n\nseed_torch()\n```\n\n参考自\u003chttps://www.zdaiot.com/MLFrameworks/Pytorch/Pytorch%E9%9A%8F%E6%9C%BA%E7%A7%8D%E5%AD%90/\u003e\n\n#### PyTorch 1.9 版本前 DataLoader 中的隐藏 BUG\n\n具体细节可见 [可能 95%的人还在犯的 PyTorch 错误 - serendipity 的文章 - 知乎](https://zhuanlan.zhihu.com/p/523239005)\n\n解决方法可参考 [文档](https://pytorch.org/docs/stable/notes/randomness.html#dataloader)：\n\n```python\ndef seed_worker(worker_id):\n    worker_seed = torch.initial_seed() % 2**32\n    numpy.random.seed(worker_seed)\n    random.seed(worker_seed)\n\nDataLoader(..., worker_init_fn=seed_worker)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flartpang%2Fpytorchtricks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flartpang%2Fpytorchtricks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flartpang%2Fpytorchtricks/lists"}