{"id":13741026,"url":"https://github.com/itamarst/dask-memusage","last_synced_at":"2025-09-14T06:34:10.758Z","repository":{"id":54953338,"uuid":"150315365","full_name":"itamarst/dask-memusage","owner":"itamarst","description":"A low-impact profiler to figure out how much memory each task in Dask is using","archived":false,"fork":false,"pushed_at":"2023-04-10T02:30:45.000Z","size":25,"stargazers_count":24,"open_issues_count":9,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-25T22:19:19.753Z","etag":null,"topics":["dask","memory","profiler","profiling","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/itamarst.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-25T19:01:40.000Z","updated_at":"2024-07-28T19:48:06.000Z","dependencies_parsed_at":"2022-08-14T07:20:37.561Z","dependency_job_id":null,"html_url":"https://github.com/itamarst/dask-memusage","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/itamarst/dask-memusage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamarst%2Fdask-memusage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamarst%2Fdask-memusage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamarst%2Fdask-memusage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamarst%2Fdask-memusage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/itamarst","download_url":"https://codeload.github.com/itamarst/dask-memusage/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamarst%2Fdask-memusage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271767983,"owners_count":24817592,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-23T02:00:09.327Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask","memory","profiler","profiling","python"],"created_at":"2024-08-03T04:00:54.644Z","updated_at":"2025-08-23T20:19:47.307Z","avatar_url":"https://github.com/itamarst.png","language":"Python","funding_links":[],"categories":["Packages"],"sub_categories":[],"readme":"# dask-memusage\n\nIf you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism.\nThat means you want to know how much memory each task uses:\n\n1. So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.\n2. In order to know where to focus memory optimization efforts.\n\n`dask-memusage` is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.\n\n`dask-memusage` polls your processes for memory usage and records the minimum and maximum usage for each task in the Dask execution graph in a CSV:\n\n```csv\ntask_key,min_memory_mb,max_memory_mb\n\"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)\",44.84765625,96.98046875\n\"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)\",47.015625,97.015625\n\"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)\",0,0\n\"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)\",0,0\nsum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0\napply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625\ntask_key,min_memory_mb,max_memory_mb\n\"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)\",44.84765625,96.98046875\n\"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)\",47.015625,97.015625\n\"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)\",0,0\n\"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)\",0,0\nsum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0\napply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625\n```\n\nYou may also find the [Fil memory profiler](https://pythonspeed.com/fil) useful in tracking down which specific parts of your code are responsible for peak memory allocations.\n\n## Example\n\nHere's a working standalone program using `dask-memusage`; notice you just need to add two lines of code:\n\n```python\nfrom time import sleep\nimport numpy as np\nfrom dask.bag import from_sequence\nfrom dask import compute\nfrom dask.distributed import Client, LocalCluster\n\nfrom dask_memusage import install  # \u003c-- IMPORT\n\ndef allocate_50mb(x):\n    \"\"\"Allocate 50MB of RAM.\"\"\"\n    sleep(1)\n    arr = np.ones((50, 1024, 1024), dtype=np.uint8)\n    sleep(1)\n    return x * 2\n\ndef no_allocate(y):\n    \"\"\"Don't allocate any memory.\"\"\"\n    return y * 2\n\ndef make_bag():\n    \"\"\"Create a bag.\"\"\"\n    return from_sequence(\n        [1, 2], npartitions=2\n    ).map(allocate_50mb).sum().apply(no_allocate)\n\ndef main():\n    cluster = LocalCluster(n_workers=2, threads_per_worker=1,\n                           memory_limit=None)\n    install(cluster.scheduler, \"memusage.csv\")  # \u003c-- INSTALL\n    client = Client(cluster)\n    compute(make_bag())\n\nif __name__ == '__main__':\n    main()\n```\n\n## Usage\n\n*Important:* Make sure your workers only have a single thread! Otherwise the results will be wrong.\n\n### Installation\n\nOn the machine where you are running the Distributed scheduler, run:\n\n```console\n$ pip install dask_memusage\n```\n\nOr if you're using Conda:\n\n```console\n$ conda install -c conda-forge dask-memusage\n```\n\n### API usage\n\n```python\n# Add to your Scheduler object, which is e.g. your LocalCluster's scheduler\n# attribute:\nfrom dask_memoryusage import install\ninstall(scheduler, \"/tmp/memusage.csv\")\n```\n\n### CLI usage\n\n```console\n$ dask-scheduler --preload dask_memusage --memusage.csv /tmp/memusage.csv\n```\n\n## Limitations\n\n* Again, make sure you only have one thread per worker process.\n* This is statistical profiling, running every 10ms.\n  Tasks that take less than that won't have accurate information.\n\n## Help\n\nNeed help? File a ticket at https://github.com/itamarst/dask-memusage/issues/new\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitamarst%2Fdask-memusage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fitamarst%2Fdask-memusage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitamarst%2Fdask-memusage/lists"}