{ "info": { "author": "Kaiyu Shi", "author_email": "skyisno.1@gmail.com", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "pytorch_memlab\n======\n[![Build Status](https://travis-ci.com/Stonesjtu/pytorch_memlab.svg?token=vyTdxHbi1PCRzV6disHp&branch=master)](https://travis-ci.com/Stonesjtu/pytorch_memlab)\n![PyPI](https://img.shields.io/pypi/v/pytorch_memlab.svg)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/pytorch_memlab.svg)\n\nA simple and accurate **CUDA** memory management laboratory for pytorch,\nit consists of different parts about the memory:\n - A `line_profiler` style CUDA memory profiler with simple API.\n - A reporter to inspect tensors occupying the CUDA memory.\n - An interesting feature to temporarily move all the CUDA tensors into\n CPU memory for courtesy, and of course the backward transferring.\n\nInstallation\n-----\n\n- Released version:\n```bash\npip install pytorch_memlab\n```\n\n- Newest version:\n```bash\npip install git+https://github.com/stonesjtu/pytorch_memlab\n```\n\nWhat's for\n-----\n\nOut-Of-Memory errors in pytorch happen frequently, for new-bees and\nexperienced programmers. A common reason is that most people don't really\nlearn the underlying memory management philosophy of pytorch and GPUs.\nThey wrote memory in-efficient codes and complained about pytorch eating too\nmuch CUDA memory.\n\nIn this repo, I'm going to share some useful tools to help debugging OOM, or\nto inspect the underlying mechanism if anyone is interested in.\n\n\nUser-Doc\n-----\n\n### Memory Profiler\n\nThe memory profiler is a modification of python's `line_profiler`, it gives\nthe memory usage info for each line of code in the specified function/method.\n\n#### Sample:\n\n```python\nfrom pytorch_memlab import profile\n@profile\ndef work():\n linear = torch.nn.Linear(100, 100).cuda()\n linear2 = torch.nn.Linear(100, 100).cuda()\n linear3 = torch.nn.Linear(100, 100).cuda()\n\n```\n\nAfter the script finishes or interrupted by keyboard, it gives the following\nprofiling info.\n\n```\nFunction: work at line 71\n\nLine # Max usage Peak usage diff max diff peak Line Contents\n===============================================================\n71 @profile\n72 def work():\n73 # comment\n74 885.00K 1.00M 40.00K 0.00B linear = torch.nn.Linear(100, 100).cuda()\n75 925.00K 1.00M 40.00K 0.00B linear_2 = torch.nn.Linear(100, 100).cuda()\n76 965.00K 1.00M 40.00K 0.00B linear_3 = torch.nn.Linear(100, 100).cuda()\n\n```\n\nterminology:\n - `Max usage`: the max of pytorch's allocated memory (the finish memory)\n The memory usage after this line is executed.\n - `Peak usage`: the max of pytorch's cached memory (the peak memory)\n The peak memory usage during the execution of this line. Pytorch caches\n 1M CUDA memory as atomic memory, so the cached memory is unchanged in the\n sample above.\n - `diff max`: the `Max memory` usage difference caused by this line\n - `diff peak`: the `Peak memory` usage difference caused by this line\n\n\nIf you use `profile` decorator, the memory statistics are collected during\nmultiple runs and only the maximum one is displayed at the end.\nWe also provide a more flexible API called `profile_every` which prints the\nmemory info every *N* times of function execution. You can simply replace\n`@profile` with `@profile_every(1)` to print the memory usage for each \nexecution.\n\nThe `@profile` and `@profile_every` can also be mixed to gain more control\nof the debugging granularity.\n\n- You can also add the decorator in the module class:\n\n```python\nclass Net(torch.nn.Module):\n def __init__(self):\n super().__init__()\n @profile\n def forward(self, inp):\n #do_something\n```\n\n- The *Line Profiler* profiles the memory usage of CUDA device 0 by default,\nyou may want to switch the device to profile by `set_target_gpu`. The gpu\nselection is globally, which means you have to remember which gpu you are\nprofiling on during the whole process:\n\n```python\nfrom pytorch_memlab import profile, set_target_gpu\n@profile\ndef func():\n net1 = torch.nn.Linear(1024, 1024).cuda(0)\n set_target_gpu(1)\n net2 = torch.nn.Linear(1024, 1024).cuda(1)\n set_target_gpu(0)\n net3 = torch.nn.Linear(1024, 1024).cuda(0)\n```\n\n\nMore samples can be found in `test/test_line_profiler.py`\n\n\n### Memory Reporter\n\nAs *Memory Profiler* only gives the overall memory usage information by lines,\na more low-level memory usage information can be obtained by *Memory Reporter*.\n\n*Memory reporter* iterates all the `Tensor` objects and gets the underlying\n`Storage` object to get the actual memory usage instead of the surface\n`Tensor.size`.\n\n#### Sample\n\n- A minimal one:\n\n```python\nlinear = torch.nn.Linear(1024, 1024).cuda()\nreporter = MemReporter()\nreporter.report()\n```\noutputs:\n```\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nParameter0 (1024, 1024) 4.00M\nParameter1 (1024,) 4.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 1049600 Used Memory: 4.00M\nThe allocated memory on cuda:0: 4.00M\n-------------------------------------------------------------------------------\n```\n\n- You can also pass in a model object for automatically name inference.\n\n```python\nlinear = torch.nn.Linear(1024, 1024).cuda()\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\nreporter = MemReporter(linear)\nout = linear(inp).mean()\nprint('========= before backward =========')\nreporter.report()\nout.backward()\nprint('========= after backward =========')\nreporter.report()\n```\n\noutputs:\n```\n========= before backward =========\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight (1024, 1024) 4.00M\nbias (1024,) 4.00K\nTensor0 (512, 1024) 2.00M\nTensor1 (1,) 512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 1573889 Used Memory: 6.00M\nThe allocated memory on cuda:0: 6.00M\n-------------------------------------------------------------------------------\n========= after backward =========\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight (1024, 1024) 4.00M\nweight.grad (1024, 1024) 4.00M\nbias (1024,) 4.00K\nbias.grad (1024,) 4.00K\nTensor0 (512, 1024) 2.00M\nTensor1 (1,) 512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 2623489 Used Memory: 10.01M\nThe allocated memory on cuda:0: 10.01M\n-------------------------------------------------------------------------------\n```\n\n\n- The reporter automatically deals with the sharing weights parameters:\n\n```python\nlinear = torch.nn.Linear(1024, 1024).cuda()\nlinear2 = torch.nn.Linear(1024, 1024).cuda()\nlinear2.weight = linear.weight\ncontainer = torch.nn.Sequential(\n linear, linear2\n)\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\n\nout = container(inp).mean()\nout.backward()\n\n# verbose shows how storage is shared across multiple Tensors\nreporter = MemReporter(container)\nreporter.report(verbose=True)\n```\n\noutputs:\n```\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\n0.weight (1024, 1024) 4.00M\n0.weight.grad (1024, 1024) 4.00M\n0.bias (1024,) 4.00K\n0.bias.grad (1024,) 4.00K\n1.bias (1024,) 4.00K\n1.bias.grad (1024,) 4.00K\nTensor0 (512, 1024) 2.00M\nTensor1 (1,) 512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 2625537 Used Memory: 10.02M\nThe allocated memory on cuda:0: 10.02M\n-------------------------------------------------------------------------------\n```\n\n- You can better understand the memory layout for more complicated module:\n\n```python\nlstm = torch.nn.LSTM(1024, 1024).cuda()\nreporter = MemReporter(lstm)\nreporter.report(verbose=True)\ninp = torch.Tensor(10, 10, 1024).cuda()\nout, _ = lstm(inp)\nout.mean().backward()\nreporter.report(verbose=True)\n```\n\nAs shown below, the `(->)` indicates the re-use of the same storage back-end\noutputs:\n```\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight_ih_l0 (4096, 1024) 32.03M\nweight_hh_l0(->weight_ih_l0) (4096, 1024) 0.00B\nbias_ih_l0(->weight_ih_l0) (4096,) 0.00B\nbias_hh_l0(->weight_ih_l0) (4096,) 0.00B\nTensor0 (10, 10, 1024) 400.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 8499200 Used Memory: 32.42M\nThe allocated memory on cuda:0: 32.52M\nMemory differs due to the matrix alignment\n-------------------------------------------------------------------------------\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight_ih_l0 (4096, 1024) 32.03M\nweight_ih_l0.grad (4096, 1024) 32.03M\nweight_hh_l0(->weight_ih_l0) (4096, 1024) 0.00B\nweight_hh_l0.grad(->weight_ih_l0.grad) (4096, 1024) 0.00B\nbias_ih_l0(->weight_ih_l0) (4096,) 0.00B\nbias_ih_l0.grad(->weight_ih_l0.grad) (4096,) 0.00B\nbias_hh_l0(->weight_ih_l0) (4096,) 0.00B\nbias_hh_l0.grad(->weight_ih_l0.grad) (4096,) 0.00B\nTensor0 (10, 10, 1024) 400.00K\nTensor1 (10, 10, 1024) 400.00K\nTensor2 (1, 10, 1024) 40.00K\nTensor3 (1, 10, 1024) 40.00K\n-------------------------------------------------------------------------------\nTotal Tensors: 17018880 Used Memory: 64.92M\nThe allocated memory on cuda:0: 65.11M\nMemory differs due to the matrix alignment\n-------------------------------------------------------------------------------\n```\n\nNOTICE:\n> When forwarding with `grad_mode=True`, pytorch maintains tensor buffers for\n> future Back-Propagation, in C level. So these buffers are not going to be\n> managed or collected by pytorch. But if you store these intermediate results\n> as python variables, then they will be reported.\n\n- You can also filter the device to report on by passing extra arguments:\n`report(device=torch.device(0))`\n\n- A failed example due to pytorch's C side tensor buffers\n\nIn the following example, a temp buffer is created at `inp * (inp + 2)` to \nstore both `inp` and `inp + 2`, unfortunately python only knows the existence\nof inp, so we have *2M* memory lost, which is the same size of Tensor `inp`.\n\n```python\nlinear = torch.nn.Linear(1024, 1024).cuda()\ninp = torch.Tensor(512, 1024).cuda()\n# pass in a model to automatically infer the tensor names\nreporter = MemReporter(linear)\nout = linear(inp * (inp + 2)).mean()\nreporter.report()\n```\n\noutputs:\n```\nElement type Size Used MEM\n-------------------------------------------------------------------------------\nStorage on cuda:0\nweight (1024, 1024) 4.00M\nbias (1024,) 4.00K\nTensor0 (512, 1024) 2.00M\nTensor1 (1,) 512.00B\n-------------------------------------------------------------------------------\nTotal Tensors: 1573889 Used Memory: 6.00M\nThe allocated memory on cuda:0: 8.00M\nMemory differs due to the matrix alignment or invisible gradient buffer tensors\n-------------------------------------------------------------------------------\n```\n\n\n### Courtesy\n\nSometimes people would like to preempt your running task, but you don't want\nto save checkpoint and then load, actually all they need is GPU resources (\ntypically CPU resources and CPU memory is always spare in GPU clusters), so\nyou can move all your workspaces from GPU to CPU and then halt your task until\na restart signal is triggered, instead of saving&loading checkpoints and \nbootstrapping from scratch.\n\nStill developing..... But you can have fun with:\n```python\nfrom pytorch_memlab import Courtesy\n\niamcourtesy = Courtesy()\nfor i in range(num_iteration):\n if something_happens:\n iamcourtesy.yield_memory()\n wait_for_restart_signal()\n iamcourtesy.restore()\n```\n\n#### Known Issues\n\n- As is stated above in `Memory_Reporter`, intermediate tensors are not covered\nproperly, so you may want to insert such courtesy logics after `backward` or\nbefore `forward`.\n- Currently the CUDA context of pytorch requires about 1 GB CUDA memory, which\nmeans even all Tensors are on CPU, 1GB of CUDA memory is wasted, :-(. However\nit's still under investigation if I can fully destroy the context and then\nre-init.\n\n\n### ACK\n\nI suffered a lot debugging weird memory usage during my 3-years of developing\nefficient Deep Learning models, and of course learned a lot from the great\nopen source community.\n\n## CHANGES\n\n##### 0.0.4 (2019-10-08)\n - Add gpu switch for line-profiler(#2)\n - Add device filter for reporter\n##### 0.0.3 (2019-06-15)\n - Install dependency for pip installation\n##### 0.0.2 (2019-06-04)\n - Fix statistics shift in loop\n##### 0.0.1 (2019-05-28)\n - initial release", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Stonesjtu/pytorch_memlab", "keywords": "pytorch memory profile", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pytorch-memlab", "package_url": "https://pypi.org/project/pytorch-memlab/", "platform": "", "project_url": "https://pypi.org/project/pytorch-memlab/", "project_urls": { "Homepage": "https://github.com/Stonesjtu/pytorch_memlab" }, "release_url": "https://pypi.org/project/pytorch-memlab/0.0.4/", "requires_dist": null, "requires_python": "", "summary": "A lab to do simple and accurate memory experiments on pytorch", "version": "0.0.4" }, "last_serial": 5943075, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "8702353ab5a42c5682ef73904fd0d354", "sha256": "6b68bf4f322d435c6b6b4433eb6cf9f6a77a4261f6cdcdd142d26810e755eb6e" }, "downloads": -1, "filename": "pytorch_memlab-0.0.1.tar.gz", "has_sig": false, "md5_digest": "8702353ab5a42c5682ef73904fd0d354", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10911, "upload_time": "2019-05-28T01:44:04", "url": "https://files.pythonhosted.org/packages/d2/57/9b54f945c424414a078f550fa4afd80f9135b29c9f57fc3602f199f25a4a/pytorch_memlab-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "f7151be7a84aba1ee19ff5847c4223b3", "sha256": "526b2dfd6ccc30ef03b869f55cf1c8c7d856d444370003241100bfbbf47f1bd1" }, "downloads": -1, "filename": "pytorch_memlab-0.0.2.tar.gz", "has_sig": false, "md5_digest": "f7151be7a84aba1ee19ff5847c4223b3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11187, "upload_time": "2019-06-04T19:19:36", "url": "https://files.pythonhosted.org/packages/d2/61/9473c395678bf17aea8d67c60939e27d353dea5701ce5396b04895fa47b0/pytorch_memlab-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "eb3e2fe840c170217d46367961e68a10", "sha256": "1cb4a473a2c06c1f5c39fd246201afe13b54d040129eafeb512e715e0871f3b7" }, "downloads": -1, "filename": "pytorch_memlab-0.0.3.tar.gz", "has_sig": false, "md5_digest": "eb3e2fe840c170217d46367961e68a10", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14280, "upload_time": "2019-06-16T01:23:44", "url": "https://files.pythonhosted.org/packages/42/4d/7183128757e5b5059d71b78a052c17f3c99c02724be66ead1c4266b7ce28/pytorch_memlab-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "c0a056c61a6a98ca2ad461bb4f7118ed", "sha256": "f547a68dc2a0e24ce6b945323d09414eee5866090143f01de902563cea68521b" }, "downloads": -1, "filename": "pytorch_memlab-0.0.4.tar.gz", "has_sig": false, "md5_digest": "c0a056c61a6a98ca2ad461bb4f7118ed", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15357, "upload_time": "2019-10-08T06:29:14", "url": "https://files.pythonhosted.org/packages/ad/a6/06c7847e745eed79689f71bb4ca42432a4316b7b1bffdcb0fcd3e05eb966/pytorch_memlab-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c0a056c61a6a98ca2ad461bb4f7118ed", "sha256": "f547a68dc2a0e24ce6b945323d09414eee5866090143f01de902563cea68521b" }, "downloads": -1, "filename": "pytorch_memlab-0.0.4.tar.gz", "has_sig": false, "md5_digest": "c0a056c61a6a98ca2ad461bb4f7118ed", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15357, "upload_time": "2019-10-08T06:29:14", "url": "https://files.pythonhosted.org/packages/ad/a6/06c7847e745eed79689f71bb4ca42432a4316b7b1bffdcb0fcd3e05eb966/pytorch_memlab-0.0.4.tar.gz" } ] }