{"id":15697027,"url":"https://github.com/dask/dask-pyspy","last_synced_at":"2025-05-08T23:29:52.842Z","repository":{"id":151585386,"uuid":"356076325","full_name":"dask/dask-pyspy","owner":"dask","description":"Profile the dask distributed scheduler with py-spy and viztracer","archived":false,"fork":false,"pushed_at":"2025-04-17T13:57:56.000Z","size":514,"stargazers_count":9,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-18T04:29:33.968Z","etag":null,"topics":["dask","profiling","py-spy","viztracer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dask.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-08T23:37:42.000Z","updated_at":"2025-04-17T13:58:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"9ca73cf0-6b58-4767-9b5e-ba5484f68a98","html_url":"https://github.com/dask/dask-pyspy","commit_stats":null,"previous_names":["dask/dask-pyspy"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dask%2Fdask-pyspy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dask%2Fdask-pyspy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dask%2Fdask-pyspy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dask%2Fdask-pyspy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dask","download_url":"https://codeload.github.com/dask/dask-pyspy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253163080,"owners_count":21864023,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask","profiling","py-spy","viztracer"],"created_at":"2024-10-03T19:10:51.843Z","updated_at":"2025-05-08T23:29:52.794Z","avatar_url":"https://github.com/dask.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dask-pyspy\n\nProfile dask [distributed](https://github.com/dask/distributed) clusters with [py-spy](https://github.com/benfred/py-spy).\n\n```python\nimport dask\nimport distributed\n\nfrom dask_pyspy import pyspy\n\nclient = distributed.Client()\n\ndf = dask.datasets.timeseries(\n    start=\"2000-01-01\",\n    end=\"2000-01-14\",\n    partition_freq=\"1h\",\n    freq=\"60s\",\n)\n\nwith pyspy(\"worker-profiles\"):\n    df.set_index(\"id\").mean().compute()\n```\n\nUsing `pyspy` or `pyspy_on_scheduler` attaches a profiler to the Python process, records a profile, and sends the file(s) back to the client.\n\nBy default, py-spy profiles are recorded in [speedscope](https://www.speedscope.app/) format.\n\n`dask-pyspy` (and, transitively, `py-spy`) must be installed in the environment where the scheduler is running.\n\n`dask-pyspy` tries hard to work out-of-the-box, but if your cluster is running inside Docker, or on macOS, you'll need to configure things so it's allowed to run. See the [privileges for py-spy](#privileges-for-py-spy) section.\n\n## Installation\n\n```\npython -m pip install dask-pyspy\n```\n\nMake sure this package is also installed in the software environment of your cluster!\n\n## Usage\n\nThe `pyspy` and `pyspy_on_scheduler` functions are context managers. Entering them starts py-spy on the workers / scheduler. Exiting them stops py-spy, sends the profile data back to the client, and writes it to disk.\n\n```python\nwith pyspy_on_scheduler(\"scheduler-profile.json\"):\n    # Profile the scheduler.\n    # Writes to the `scheduler-profile.json` file locally.\n    x.compute()\n\nwith pyspy(\"worker-profiles\"):\n    # Most basic usage.\n    # Writes a profile per worker to the `worker-profiles` directory locally.\n    # Files are named by worker addresses.\n    x.compute()\n\nwith pyspy(\"worker-profiles\", native=True):\n    # Collect stack traces from native extensions written in Cython, C or C++.\n    # You should usually turn this on to get much richer profiling information.\n    # However, only recommended when your cluster is running on Linux.\n    x.compute()\n\nwith pyspy(\"worker-profiles\", workers=2):\n    # Only profile 2 workers (selected randomly)\n    x.compute()\n\nwith pyspy(\"worker-profiles\", workers=['tcp://10.0.1.2:4567', 'tcp://10.0.1.3:5678']):\n    # Profile specific workers by specifying their addresses\n    x.compute()\n\nwith pyspy(\"worker-profiles\", format=\"flamegraph\", gil=True, idle=False, nonblocking=True, extra_pyspy_args=[\"--foo\", \"bar\"]):\n    # Look, you can pass any arguments you want to `py-spy`!\n    # Refer to the `py-spy` command reference for what these mean.\n    # You don't usually need to do this though. We've picked good defaults for you.\n    x.compute()\n\nwith pyspy(\"worker-profiles\", log_level=\"info\"):\n    # Set py-spy's internal log level.\n    # Useful if py-spy isn't behaving.\n    # Refer to the ``env_logger`` crate for details:\n    # https://docs.rs/env_logger/latest/env_logger/index.html#enabling-logging\n    x.compute()\n```\n\nFor more information, refer to the docstrings of the functions.\n\nBy default, profiles are recorded in speedscope format, so just drop them into https://www.speedscope.app to view them.\n\n### Tips \u0026 tricks\n\nThis is a handy pattern:\n\n```python\nwith pyspy(\"worker-profiles\"):\n    input(\"Press enter when done profiling\")\n    # or maybe:\n    # time.sleep(10)\n```\n\nWays you can use it:\n\n#### Profiling a cluster that's already running\n\n1. Open a second terminal/Jupyter session/etc.\n1. In that session, connect to your existing cluster.\n1. Run the block above. Press enter to stop profiling once you feel like you've got enough.\n\n#### Profiling part of a longer computation\n\n```python\npersisted = my_thing.persist()\n\nwith pyspy(\"worker-profiles\"):\n    # Watch the dashboard, hit enter when you think you've got enough.\n    input(\"Press enter when done profiling\")\n\n# optional, to get actual result:\npersisted.compute()\ndel persisted\n```\n\n## Privileges for py-spy\n\n**tl;dr:**\n* On macOS clusters, you have to launch your cluster with `sudo`\n* For `docker run`, pass `--cap-add SYS_PTRACE`, or download this newer [`seccomp.json`](https://github.com/moby/moby/blob/d39b075302c27f77b2de413697a5aacb034d8286/profiles/seccomp/default.json) file and use `--seccomp=default.json`.\n* On Windows clusters, you're on your own, sorry.\n\nYou may need to run the dask process as root for py-spy to be able to profile it (especially on macOS). See https://github.com/benfred/py-spy#when-do-you-need-to-run-as-sudo.\n\nIn a Docker container, `dask-pyspy` will \"just work\" for Docker/moby versions \u003e= 21.xx. As of right now (Nov 2022), Docker 21.xx doesn't exist yet, so read on.\n\n[moby/moby#42083](https://github.com/moby/moby/pull/42083/files) allowlisted by default the `process_vm_readv` system call that py-spy uses, which used to be blocked unless you set `--cap-add SYS_PTRACE`. Allowing this specific system call in unprivileged containers has been safe to do for a while (since linux kernel versions \u003e 4.8), but just wasn't enabled in Docker. So your options right now are:\n* (low/no security impact) Download the newer [`seccomp.json`](https://github.com/moby/moby/blob/d39b075302c27f77b2de413697a5aacb034d8286/profiles/seccomp/default.json) file from moby/master and pass it to Docker via `--seccomp=default.json`.\n* (more convenient) Pass `--cap-add SYS_PTRACE` to Docker. This enables more than you need, but it's one less step.\n\nOn Ubuntu-based containers, ptrace system calls are [further blocked](https://www.kernel.org/doc/Documentation/admin-guide/LSM/Yama.rst): processes are prohibited from ptracing each other even within the same UID. To work around this, `dask-pyspy` automatically uses [`prctl(2)`](https://man7.org/linux/man-pages/man2/prctl.2.html) to mark the scheduler process as ptrace-able by itself and any child processes, then launches py-spy as a child process.\n\n## Caveats\n\n* If you're running something that crashes your cluster, you probably won't be able to get a profile out of it. Transferring results back to the client relies on a stable connection and things in dask all working properly.\n* Profiling slows things down. Especially if using `pyspy_on_scheduler`, expect noticeably slower results. This is probably not a thing you want to have always-on.\n* This package is very much in development. I made it for my personal use and am sharing in case it's useful. Please don't be mad if it breaks.\n\n## Development\n\nInstall [Poetry](https://python-poetry.org/docs/#installation). To create a virtual environment, install dev dependencies, and install the package for local development:\n\n```\n$ poetry install\n```\n\nThere is one very very basic end-to-end test for py-spy. Running it requires Docker and docker-compose, though the building and running of the containers is managed by [pytest-docker-compose](https://github.com/pytest-docker-compose/pytest-docker-compose), so all you have to do is:\n\n```\n$ pytest tests\n```\nand wait a long time for the image to build and run.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdask%2Fdask-pyspy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdask%2Fdask-pyspy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdask%2Fdask-pyspy/lists"}