{"id":13689343,"url":"https://github.com/Stonesjtu/pytorch_memlab","last_synced_at":"2025-05-01T23:33:54.413Z","repository":{"id":51527710,"uuid":"188495333","full_name":"Stonesjtu/pytorch_memlab","owner":"Stonesjtu","description":"Profiling and inspecting memory in pytorch","archived":false,"fork":false,"pushed_at":"2024-08-06T06:18:03.000Z","size":211,"stargazers_count":1018,"open_issues_count":10,"forks_count":37,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-11-04T17:25:52.722Z","etag":null,"topics":["cuda-memory","memory-profiler","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stonesjtu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-24T22:39:13.000Z","updated_at":"2024-11-03T03:59:22.000Z","dependencies_parsed_at":"2024-01-16T07:23:48.720Z","dependency_job_id":"21c68920-147c-4f95-b7c0-6e9fd9e30bb2","html_url":"https://github.com/Stonesjtu/pytorch_memlab","commit_stats":{"total_commits":64,"total_committers":9,"mean_commits":7.111111111111111,"dds":0.125,"last_synced_commit":"43e4d09b1f710bdc278e8deaa8d28ba9c3a2f62b"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2Fpytorch_memlab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2Fpytorch_memlab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2Fpytorch_memlab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2Fpytorch_memlab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stonesjtu","download_url":"https://codeload.github.com/Stonesjtu/pytorch_memlab/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224282239,"owners_count":17285793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda-memory","memory-profiler","pytorch"],"created_at":"2024-08-02T15:01:44.254Z","updated_at":"2024-11-12T13:31:30.327Z","avatar_url":"https://github.com/Stonesjtu.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"pytorch_memlab\n======\n[![Test](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/test.yml/badge.svg)](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/test.yml)\n[![Upload Python Package](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/pypi-publish.yml/badge.svg)](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/pypi-publish.yml)\n![PyPI](https://img.shields.io/pypi/v/pytorch_memlab.svg)\n[![CodeQL: Python](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/github-code-scanning/codeql/badge.svg)](https://github.com/Stonesjtu/pytorch_memlab/actions/workflows/github-code-scanning/codeql)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/pytorch_memlab.svg)\n\nA simple and accurate **CUDA** memory management laboratory for pytorch,\nit consists of different parts about the memory:\n\n- Features:\n\n  - Memory Profiler: A `line_profiler` style CUDA memory profiler with simple API.\n  - Memory Reporter: A reporter to inspect tensors occupying the CUDA memory.\n  - Courtesy: An interesting feature to temporarily move all the CUDA tensors into\n    CPU memory for courtesy, and of course the backward transferring.\n  - IPython support through `%mlrun`/`%%mlrun` line/cell magic\n    commands.\n\n\n- Table of Contents\n  * [Installation](#installation)\n  * [User-Doc](#user-doc)\n    + [Memory Profiler](#memory-profiler)\n    + [IPython support](#ipython-support)\n    + [Memory Reporter](#memory-reporter)\n    + [Courtesy](#courtesy)\n    + [ACK](#ack)\n  * [CHANGES](#changes)\n\nInstallation\n-----\n\n- Released version:\n```bash\npip install pytorch_memlab\n```\n\n- Newest version:\n```bash\npip install git+https://github.com/stonesjtu/pytorch_memlab\n```\n\nWhat's for\n-----\n\nOut-Of-Memory errors in pytorch happen frequently, for new-bees and\nexperienced programmers. A common reason is that most people don't really\nlearn the underlying memory management philosophy of pytorch and GPUs.\nThey wrote memory in-efficient codes and complained about pytorch eating too\nmuch CUDA memory.\n\nIn this repo, I'm going to share some useful tools to help debugging OOM, or\nto inspect the underlying mechanism if anyone is interested in.\n\n\nUser-Doc\n-----\n\n### Memory Profiler\n\nThe memory profiler is a modification of python's `line_profiler`, it gives\nthe memory usage info for each line of code in the specified function/method.\n\n#### Sample:\n\n```python\nimport torch\nfrom pytorch_memlab import LineProfiler\n\ndef inner():\n    torch.nn.Linear(100, 100).cuda()\n\ndef outer():\n    linear = torch.nn.Linear(100, 100).cuda()\n    linear2 = torch.nn.Linear(100, 100).cuda()\n    linear3 = torch.nn.Linear(100, 100).cuda()\n\nwork()\n```\n\nAfter the script finishes or interrupted by keyboard, it gives the following\nprofiling info if you're in a Jupyter notebook:\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"readme-output.png\" width=\"640\"\u003e\u003c/p\u003e\n\nor the following info if you're in a text-only terminal:\n\n```\n## outer\n\nactive_bytes reserved_bytes line  code\n         all            all\n        peak           peak\n       0.00B          0.00B    7  def outer():\n      40.00K          2.00M    8      linear = torch.nn.Linear(100, 100).cuda()\n      80.00K          2.00M    9      linear2 = torch.nn.Linear(100, 100).cuda()\n     120.00K          2.00M   10      inner()\n\n\n## inner\n\nactive_bytes reserved_bytes line  code\n         all            all\n        peak           peak\n      80.00K          2.00M    4  def inner():\n     120.00K          2.00M    5      torch.nn.Linear(100, 100).cuda()\n```\n\nAn explanation of what each column means can be found in the [Torch documentation](https://pytorch.org/docs/stable/cuda.html#torch.cuda.memory_stats). The name of any field from `memory_stats()`\ncan be passed to `display()` to view the corresponding statistic.\n\nIf you use `profile` decorator, the memory statistics are collected during\nmultiple runs and only the maximum one is displayed at the end.\nWe also provide a more flexible API called `profile_every` which prints the\nmemory info every *N* times of function execution. You can simply replace\n`@profile` with `@profile_every(1)` to print the memory usage for each\nexecution.\n\nThe `@profile` and `@profile_every` can also be mixed to gain more control\nof the debugging granularity.\n\n- You can also add the decorator in the module class:\n\n```python\nclass Net(torch.nn.Module):\n    def __init__(self):\n        super().__init__()\n    @profile\n    def forward(self, inp):\n        #do_something\n```\n\n- The *Line Profiler* profiles the memory usage of CUDA device 0 by default,\nyou may want to switch the device to profile by `set_target_gpu`. The gpu\nselection is globally,  which means you have to remember which gpu you are\nprofiling on during the whole process:\n\n```python\nimport torch\nfrom pytorch_memlab import profile, set_target_gpu\n@profile\ndef func():\n    net1 = torch.nn.Linear(1024, 1024).cuda(0)\n    set_target_gpu(1)\n    net2 = torch.nn.Linear(1024, 1024).cuda(1)\n    set_target_gpu(0)\n    net3 = torch.nn.Linear(1024, 1024).cuda(0)\n\nfunc()\n```\n\n\nMore samples can be found in `test/test_line_profiler.py`\n\n### IPython support\n\nMake sure you have `IPython` installed, or have installed `pytorch-memlab` with\n`pip install pytorch-memlab[ipython]`.\n\nFirst, load the extension:\n\n```python\n%load_ext pytorch_memlab\n```\n\nThis makes the `%mlrun` and `%%mlrun` line/cell magics available for use. For\nexample, in a new cell run the following to profile an entire cell\n\n```python\n%%mlrun -f func\nimport torch\nfrom pytorch_memlab import profile, set_target_gpu\ndef func():\n    net1 = torch.nn.Linear(1024, 1024).cuda(0)\n    set_target_gpu(1)\n    net2 = torch.nn.Linear(1024, 1024).cuda(1)\n    set_target_gpu(0)\n    net3 = torch.nn.Linear(1024, 1024).cuda(0)\n```\n\nOr you can invoke the profiler for a single statement on via the `%mlrun` cell\nmagic.\n\n```python\nimport torch\nfrom pytorch_memlab import profile, set_target_gpu\ndef func(input_size):\n    net1 = torch.nn.Linear(input_size, 1024).cuda(0)\n%mlrun -f func func(2048)\n```\n\nSee `%mlrun?` for help on what arguments are supported. You can set the GPU\ndevice to profile, dump profiling results to a file, and return the\n`LineProfiler` object for post-profile inspection.\n\nFind out more by checking out the [demo Jupyter notebook](./demo.ipynb)\n\n\n### Memory Reporter\n\nAs *Memory Profiler* only gives the overall memory usage information by lines,\na more low-level memory usage information can be obtained by *Memory Reporter*.\n\n*Memory reporter* iterates all the `Tensor` objects and gets the underlying\n`UntypedStorage` (previously `Storage`) object to get the actual memory usage instead of the surface\n`Tensor.size`.\n\n\u003e see [UntypedStorage](https://pytorch.org/docs/stable/storage.html#torch.UntypedStorage) for detailed\n\u003e  information\n\n#### Sample\n\n- A minimal one:\n\n```python\nimport torch\nfrom pytorch_memlab import MemReporter\nlinear = torch.nn.Linear(1024, 1024).cuda()\nreporter = MemReporter()\nreporter.report()\n```\noutputs:\n```\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nParameter0                                      (1024, 1024)     4.00M\nParameter1                                           (1024,)     4.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 1049600  Used Memory: 4.00M\nThe allocated memory on cuda:0: 4.00M\n-------------------------------------------------------------------------------\n```\n\n- You can also pass in a model object for automatically name inference.\n\n```python\nimport torch\nfrom pytorch_memlab import MemReporter\n\nlinear = torch.nn.Linear(1024, 1024).cuda()\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\nreporter = MemReporter(linear)\nout = linear(inp).mean()\nprint('========= before backward =========')\nreporter.report()\nout.backward()\nprint('========= after backward =========')\nreporter.report()\n```\n\noutputs:\n```\n========= before backward =========\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight                                          (1024, 1024)     4.00M\nbias                                                 (1024,)     4.00K\nTensor0                                          (512, 1024)     2.00M\nTensor1                                                 (1,)   512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 1573889  Used Memory: 6.00M\nThe allocated memory on cuda:0: 6.00M\n-------------------------------------------------------------------------------\n========= after backward =========\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight                                          (1024, 1024)     4.00M\nweight.grad                                     (1024, 1024)     4.00M\nbias                                                 (1024,)     4.00K\nbias.grad                                            (1024,)     4.00K\nTensor0                                          (512, 1024)     2.00M\nTensor1                                                 (1,)   512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 2623489  Used Memory: 10.01M\nThe allocated memory on cuda:0: 10.01M\n-------------------------------------------------------------------------------\n```\n\n\n- The reporter automatically deals with the sharing weights parameters:\n\n```python\nimport torch\nfrom pytorch_memlab import MemReporter\n\nlinear = torch.nn.Linear(1024, 1024).cuda()\nlinear2 = torch.nn.Linear(1024, 1024).cuda()\nlinear2.weight = linear.weight\ncontainer = torch.nn.Sequential(\n    linear, linear2\n)\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\n\nout = container(inp).mean()\nout.backward()\n\n# verbose shows how storage is shared across multiple Tensors\nreporter = MemReporter(container)\nreporter.report(verbose=True)\n```\n\noutputs:\n```\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\n0.weight                                        (1024, 1024)     4.00M\n0.weight.grad                                   (1024, 1024)     4.00M\n0.bias                                               (1024,)     4.00K\n0.bias.grad                                          (1024,)     4.00K\n1.bias                                               (1024,)     4.00K\n1.bias.grad                                          (1024,)     4.00K\nTensor0                                          (512, 1024)     2.00M\nTensor1                                                 (1,)   512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 2625537  Used Memory: 10.02M\nThe allocated memory on cuda:0: 10.02M\n-------------------------------------------------------------------------------\n```\n\n- You can better understand the memory layout for more complicated module:\n\n```python\nimport torch\nfrom pytorch_memlab import MemReporter\n\nlstm = torch.nn.LSTM(1024, 1024).cuda()\nreporter = MemReporter(lstm)\nreporter.report(verbose=True)\ninp = torch.Tensor(10, 10, 1024).cuda()\nout, _ = lstm(inp)\nout.mean().backward()\nreporter.report(verbose=True)\n```\n\nAs shown below, the `(-\u003e)` indicates the re-use of the same storage back-end\noutputs:\n```\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight_ih_l0                                    (4096, 1024)    32.03M\nweight_hh_l0(-\u003eweight_ih_l0)                    (4096, 1024)     0.00B\nbias_ih_l0(-\u003eweight_ih_l0)                           (4096,)     0.00B\nbias_hh_l0(-\u003eweight_ih_l0)                           (4096,)     0.00B\nTensor0                                       (10, 10, 1024)   400.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 8499200  Used Memory: 32.42M\nThe allocated memory on cuda:0: 32.52M\nMemory differs due to the matrix alignment\n-------------------------------------------------------------------------------\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight_ih_l0                                    (4096, 1024)    32.03M\nweight_ih_l0.grad                               (4096, 1024)    32.03M\nweight_hh_l0(-\u003eweight_ih_l0)                    (4096, 1024)     0.00B\nweight_hh_l0.grad(-\u003eweight_ih_l0.grad)          (4096, 1024)     0.00B\nbias_ih_l0(-\u003eweight_ih_l0)                           (4096,)     0.00B\nbias_ih_l0.grad(-\u003eweight_ih_l0.grad)                 (4096,)     0.00B\nbias_hh_l0(-\u003eweight_ih_l0)                           (4096,)     0.00B\nbias_hh_l0.grad(-\u003eweight_ih_l0.grad)                 (4096,)     0.00B\nTensor0                                       (10, 10, 1024)   400.00K\nTensor1                                       (10, 10, 1024)   400.00K\nTensor2                                        (1, 10, 1024)    40.00K\nTensor3                                        (1, 10, 1024)    40.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 17018880         Used Memory: 64.92M\nThe allocated memory on cuda:0: 65.11M\nMemory differs due to the matrix alignment\n-------------------------------------------------------------------------------\n```\n\nNOTICE:\n\u003e When forwarding with `grad_mode=True`, pytorch maintains tensor buffers for\n\u003e future Back-Propagation, in C level. So these buffers are not going to be\n\u003e managed or collected by pytorch. But if you store these intermediate results\n\u003e as python variables, then they will be reported.\n\n- You can also filter the device to report on by passing extra arguments:\n`report(device=torch.device(0))`\n\n- A failed example due to pytorch's C side tensor buffers\n\nIn the following example, a temp buffer is created at `inp * (inp + 2)` to\nstore both `inp` and `inp + 2`, unfortunately python only knows the existence\nof inp, so we have *2M* memory lost, which is the same size of Tensor `inp`.\n\n```python\nimport torch\nfrom pytorch_memlab import MemReporter\n\nlinear = torch.nn.Linear(1024, 1024).cuda()\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\nreporter = MemReporter(linear)\nout = linear(inp * (inp + 2)).mean()\nreporter.report()\n```\n\noutputs:\n```\nElement type                                            Size  Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight                                          (1024, 1024)     4.00M\nbias                                                 (1024,)     4.00K\nTensor0                                          (512, 1024)     2.00M\nTensor1                                                 (1,)   512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 1573889  Used Memory: 6.00M\nThe allocated memory on cuda:0: 8.00M\nMemory differs due to the matrix alignment or invisible gradient buffer tensors\n-------------------------------------------------------------------------------\n```\n\n\n### Courtesy\n\nSometimes people would like to preempt your running task, but you don't want\nto save checkpoint and then load, actually all they need is GPU resources (\ntypically CPU resources and CPU memory is always spare in GPU clusters), so\nyou can move all your workspaces from GPU to CPU and then halt your task until\na restart signal is triggered, instead of saving\u0026loading checkpoints and\nbootstrapping from scratch.\n\nStill developing..... But you can have fun with:\n```python\nfrom pytorch_memlab import Courtesy\n\niamcourtesy = Courtesy()\nfor i in range(num_iteration):\n    if something_happens:\n        iamcourtesy.yield_memory()\n        wait_for_restart_signal()\n        iamcourtesy.restore()\n```\n\n#### Known Issues\n\n- As is stated above in `Memory_Reporter`, intermediate tensors are not covered\nproperly, so you may want to insert such courtesy logics after `backward` or\nbefore `forward`.\n- Currently the CUDA context of pytorch requires about 1 GB CUDA memory, which\nmeans even all Tensors are on CPU, 1GB of CUDA memory is wasted, :-(. However\nit's still under investigation if I can fully destroy the context and then\nre-init.\n\n\n### ACK\n\nI suffered a lot debugging weird memory usage during my 3-years of developing\nefficient Deep Learning models, and of course learned a lot from the great\nopen source community.\n\n## CHANGES\n\n\n##### 0.3.0 (2023-7-29)\n  - Fix `DataFrame.drop` for pandas 1.5+\n##### 0.2.4 (2021-10-28)\n  - Fix colab error (#35)\n  - Support python3.8 (#38)\n  - Support sparse tensor (#30)\n##### 0.2.3 (2020-12-01)\n  - Fix name mapping in `MemReporter` (#24)\n  - Fix reporter without model input (#22 #25)\n##### 0.2.2 (2020-10-23)\n  - Fix memory leak in `MemReporter`\n##### 0.2.1 (2020-06-18)\n  - Fix `line_profiler` not found\n##### 0.2.0 (2020-06-15)\n  - Add jupyter notebook figure and ipython support\n##### 0.1.0 (2020-04-17)\n  - Add ipython magic support (#8)\n##### 0.0.4 (2019-10-08)\n  - Add gpu switch for line-profiler(#2)\n  - Add device filter for reporter\n##### 0.0.3 (2019-06-15)\n  - Install dependency for pip installation\n##### 0.0.2 (2019-06-04)\n  - Fix statistics shift in loop\n##### 0.0.1 (2019-05-28)\n  - initial release\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=stonesjtu/pytorch_memlab\u0026type=Date)](https://star-history.com/#stonesjtu/pytorch_memlab\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStonesjtu%2Fpytorch_memlab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FStonesjtu%2Fpytorch_memlab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStonesjtu%2Fpytorch_memlab/lists"}