{"id":14865233,"url":"https://github.com/leptonai/gpud","last_synced_at":"2026-04-03T14:02:16.292Z","repository":{"id":253428575,"uuid":"843471391","full_name":"leptonai/gpud","owner":"leptonai","description":"GPUd automates monitoring, diagnostics, and issue identification for GPUs","archived":false,"fork":false,"pushed_at":"2026-04-02T03:46:35.000Z","size":39127,"stargazers_count":481,"open_issues_count":8,"forks_count":61,"subscribers_count":10,"default_branch":"main","last_synced_at":"2026-04-02T16:58:21.782Z","etag":null,"topics":["gpu","kubernetes","monitoring","nvidia","nvidia-gpu"],"latest_commit_sha":null,"homepage":"https://gpud.ai","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leptonai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-08-16T15:32:11.000Z","updated_at":"2026-04-01T01:41:35.000Z","dependencies_parsed_at":"2024-09-05T14:48:01.160Z","dependency_job_id":"443ab54a-0de8-4db3-abae-16cfaca37d87","html_url":"https://github.com/leptonai/gpud","commit_stats":{"total_commits":202,"total_committers":11,"mean_commits":"18.363636363636363","dds":"0.16831683168316836","last_synced_commit":"44f485d2d6b252a23f68c1f1bc1209d2ec30413d"},"previous_names":["leptonai/gpud"],"tags_count":232,"template":false,"template_full_name":null,"purl":"pkg:github/leptonai/gpud","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leptonai%2Fgpud","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leptonai%2Fgpud/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leptonai%2Fgpud/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leptonai%2Fgpud/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leptonai","download_url":"https://codeload.github.com/leptonai/gpud/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leptonai%2Fgpud/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31355684,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T08:03:20.796Z","status":"ssl_error","status_checked_at":"2026-04-03T08:00:37.834Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu","kubernetes","monitoring","nvidia","nvidia-gpu"],"created_at":"2024-09-20T00:01:30.945Z","updated_at":"2026-04-03T14:02:16.214Z","avatar_url":"https://github.com/leptonai.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"\u003cimg src=\"./assets/gpud.svg\" height=\"100\" alt=\"GPUd logo\"\u003e\n\n[![Go Report Card](https://goreportcard.com/badge/github.com/leptonai/gpud)](https://goreportcard.com/report/github.com/leptonai/gpud)\n![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/leptonai/gpud?sort=semver)\n[![Go Reference](https://pkg.go.dev/badge/github.com/leptonai/gpud.svg)](https://pkg.go.dev/github.com/leptonai/gpud)\n[![codecov](https://codecov.io/gh/leptonai/gpud/graph/badge.svg?token=G8MGRK9X4A)](https://codecov.io/gh/leptonai/gpud)\n\n## Overview\n\n[GPUd](https://www.gpud.ai) is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.\n\n## Why GPUd\n\nGPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and NVIDIA ecosystems.\n\n- **First-class GPU support**: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.\n- **Easy to run at scale**: GPUd is a self-contained binary that runs on any machine with a low footprint.\n- **Production grade**: GPUd is used in [DGX Cloud Lepton](https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/)'s production infrastructure.\n\nMost importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See [*architecture*](./docs/ARCHITECTURE.md) for more details.\n\n## Get Started\n\nThe fastest way to see `gpud` in action is to watch our 40-second demo video below. For more detailed guides, see our [Tutorials page](./docs/TUTORIALS.md).\n\n\u003ca href=\"https://www.youtube.com/watch?v=sq-7_Zrv7-8\" target=\"_blank\"\u003e\n\u003cimg src=\"https://i3.ytimg.com/vi/sq-7_Zrv7-8/maxresdefault.jpg\" alt=\"gpud-2025-06-01-01-install-and-scan\" /\u003e\n\u003c/a\u003e\n\n### Installation\n\nTo install from the official release on Linux amd64 (x86_64) machine:\n\n```bash\ncurl -fsSL https://pkg.gpud.dev/install.sh | sh\n```\n\nTo install the latest published version explicitly:\n\n```bash\ncurl -fsSL https://pkg.gpud.dev/install.sh | sh -s $(curl -fsSL https://pkg.gpud.dev/unstable_latest.txt)\n```\n\nThe install script also currently support other architectures (e.g., arm64) and OSes (e.g., macOS).\n\n---\n\n### Run GPUd on a Host\n\nThis section covers running `gpud` directly on a host machine.\n\n#### Resource Requirements (for Lepton Platform)\n\nIf you plan to join the Lepton platform (using the `--token` flag), your node must meet these minimum requirements:\n\n**Minimum:**\n- **3 CPU cores** (2-core instances will fail to join — kubelet and system pods require minimum 3 cores)\n- 4 GiB memory\n\n**Recommended:**\n- 4+ CPU cores (e.g., AWS c6a.xlarge)\n- 8+ GiB memory\n\n**Why these requirements:** GPUd periodically reads system files from `/sys/class/infiniband/`, `/proc/`, and other paths to collect telemetry data. On nodes with less than 4 GiB memory, the Linux page cache cannot retain these files between polling cycles, causing every read to hit the disk and resulting in excessive I/O (measured at 5+ MB/s on 2 GiB nodes vs. 0 MB/s on larger nodes). The 4 GiB minimum ensures sufficient page cache for GPUd to operate as a lightweight daemon without causing disk I/O pressure.\n\nFor complete hardware, software, and network requirements, see the official [NVIDIA DGX Cloud Lepton BYOC Requirements](https://docs.nvidia.com/dgx-cloud/lepton/compute/bring-your-own-compute/requirements/).\n\n\u003e **Note:** These requirements apply only when joining the Lepton platform; standalone `gpud` operation has lower requirements.\n\n#### With `systemd` (Recommended for Linux)\n\n**Start the service:**\n\n```bash\nsudo gpud up [--token \u003cDGXC_LEPTON_AI_TOKEN\u003e]\n```\n\n\u003e **Note:** The optional `--token` connects `gpud` to the Lepton Platform. You can get a token from the [Settings \u003e Tokens page](https://dashboard.dgxc-lepton.nvidia.com) on your dashboard.\n\n```bash\ngpud up \\\n--token \u003cDGXC_LEPTON_AI_TOKEN\u003e \\\n--node-group \u003cDGXC_LEPTON_NODE_GROUP\u003e\n```\n\n**Stop the service:**\n\n```bash\nsudo gpud down\n```\n\n**Uninstall:**\n\n```bash\nsudo rm /usr/local/bin/gpud\nsudo rm /etc/systemd/system/gpud.service\n```\n\n#### Without `systemd` (e.g., macOS)\n\n**Run in the foreground:**\n\n```bash\ngpud run [--token \u003cLEPTON_AI_TOKEN\u003e]\n```\n\n**Run in the background:**\n\n```bash\nnohup sudo /usr/local/bin/gpud run [--token \u003cLEPTON_AI_TOKEN\u003e] \u0026\u003e\u003e \u003cyour_log_file_path\u003e \u0026\n```\n\n**Uninstall:**\n\n```bash\nsudo rm /usr/local/bin/gpud\n```\n\n---\n\n### Run GPUd with Kubernetes\n\nThe recommended way to deploy GPUd on Kubernetes is with our official [Helm chart](./deployments/helm/gpud/README.md).\n\n### Build with Docker\n\nA Dockerfile is provided to build a container image from source. For complete instructions, please see our [Docker guide in CONTRIBUTING.md](CONTRIBUTING.md#building-with-docker).\n\n---\n\n## Key Features\n\n- Monitor critical GPU and GPU fabric metrics (power, temperature).\n- Reports  GPU and GPU fabric status (nvidia-smi parser, error checking).\n- Detects critical GPU and GPU fabric errors (kmsg, hardware slowdown, NVML Xid event, DCGM).\n- Monitor overall system metrics (CPU, memory, disk).\n\nCheck out [*components*](./docs/COMPONENTS.md) for a detailed list of components and their features.\n\n## Integration\n\nFor users looking to set up a platform to collect and process data from gpud, please refer to [INTEGRATION](./docs/INTEGRATION.md).\n\n## FAQs\n\n### Does GPUd send data to lepton.ai?\n\nGPUd collects a small anonymous usage signal by default to help the engineering team better understand usage frequencies. The data is strictly anonymized and **does not contain any sensitive data**. You can disable this behavior by setting `GPUD_NO_USAGE_STATS=true`. If GPUd is run with systemd (default option for the `gpud up` command), you can add the line `GPUD_NO_USAGE_STATS=true` to the `/etc/default/gpud` environment file and restart the service.\n\nIf you opt-in to log in to the Lepton AI platform, to assist you with more helpful GPU health states, GPUd periodically sends system runtime related information about the host to the platform. All these info are system workload and health info, and contain no user data. The data are sent via secure channels.\n\n### How to update GPUd?\n\nGPUd is still in active development, regularly releasing new versions for critical bug fixes and new features. We strongly recommend always being on the latest version of GPUd.\n\nWhen GPUd is registered with the Lepton platform, the platform will automatically update GPUd to the latest version. To disable such auto-updates, if GPUd is run with `systemd` (default option for the `gpud up` command), you may add the flag `FLAGS=\"--enable-auto-update=false\"` to the `/etc/default/gpud` environment file and restart the service.\n\n## Learn more\n\n- [Why GPUd](./docs/WHY.md)\n- [Install GPUd](./docs/INSTALL.md)\n- [GPUd components](./docs/COMPONENTS.md)\n- [GPUd architecture](./docs/ARCHITECTURE.md)\n\n## Contributing\n\nPlease see the [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleptonai%2Fgpud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleptonai%2Fgpud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleptonai%2Fgpud/lists"}