{"id":31588828,"url":"https://github.com/mchmarny/gpuid","last_synced_at":"2026-03-07T01:07:42.265Z","repository":{"id":313196181,"uuid":"1050072298","full_name":"mchmarny/gpuid","owner":"mchmarny","description":"Monitor pods on GPU-accelerated node in Kubernetes cluster and update nodes with chassis and GPU labels serial numbers. Supports serial number export to various state backends for tracking, monitoring, and analyses.","archived":false,"fork":false,"pushed_at":"2025-09-30T04:33:34.000Z","size":217,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-06T02:34:01.487Z","etag":null,"topics":["deployment","go","gpu","kubernetes","kustomize","nvidia","postgresql","s3"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mchmarny.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE-OF-CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-03T22:59:04.000Z","updated_at":"2025-09-23T15:41:27.000Z","dependencies_parsed_at":"2025-09-04T14:24:04.941Z","dependency_job_id":"60263cef-ca84-4c61-b376-bdfda98df4a8","html_url":"https://github.com/mchmarny/gpuid","commit_stats":null,"previous_names":["mchmarny/gpuid"],"tags_count":36,"template":false,"template_full_name":"mchmarny/rolesetter","purl":"pkg:github/mchmarny/gpuid","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgpuid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgpuid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgpuid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgpuid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mchmarny","download_url":"https://codeload.github.com/mchmarny/gpuid/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgpuid/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30204475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T19:07:06.838Z","status":"ssl_error","status_checked_at":"2026-03-06T18:57:34.882Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deployment","go","gpu","kubernetes","kustomize","nvidia","postgresql","s3"],"created_at":"2025-10-06T02:25:37.203Z","updated_at":"2026-03-07T01:07:42.251Z","avatar_url":"https://github.com/mchmarny.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![main](https://github.com/mchmarny/gpuid/actions/workflows/main.yaml/badge.svg)](https://github.com/mchmarny/gpuid/actions/workflows/main.yaml)\n[![release](https://github.com/mchmarny/gpuid/actions/workflows/release.yaml/badge.svg)](https://github.com/mchmarny/gpuid/actions/workflows/release.yaml)\n![Issues](https://img.shields.io/github/issues/mchmarny/gpuid)\n![PRs](https://img.shields.io/github/issues-pr/mchmarny/gpuid)\n![Go Report Card](https://goreportcard.com/badge/github.com/mchmarny/gpuid)\n[![codecov](https://codecov.io/gh/mchmarny/gpuid/branch/main/graph/badge.svg)](https://codecov.io/gh/mchmarny/gpuid)\n\n# GPU and Chassis Serial Number Exporter (gpuid)\n\nMonitor pods on GPU-accelerated node in Kubernetes cluster and update nodes with chassis and GPU labels serial numbers. Supports serial number export to various state backends for tracking, monitoring, and analyses.\n\n## Why\n\nGPU accelerated Kubernetes nodes in operator managed services (e.g. EKS in AWS or GKE in GCP) are ephemeral VMs that can run on top of physical hosts which change over time. Multiple VPs over time may run on a single physical host, so to ensure break-fix context of these nodes it's crucial to:\n\n- Track GPU health and utilization across physical hardware\n- Correlate GPU performance issues with specific hardware units\n- Maintain audit trails for GPU resource allocation\n- Monitor GPU lifecycle in multi-tenant environments\n\n`gpuid` provides a lightweight, scalable solution for GPU inventory management in Kubernetes clusters.\n\n## Features\n\n- HTTP, PostgreSQL DB, and S3 exporters\n- Connection pooling, retry logic, health checks\n- Structured logging with contextual information\n- Prometheus-compatible observability metrics for monitoring\n- SLSA build attestation and Sigstore attestation validation\n- Node labels with the GPU and chassis serial numbers\n  \n```shell\n# H100 (no chassis): \ngpuid.github.com/gpu-0=1652823054567\ngpuid.github.com/gpu-1=1652823055642\ngpuid.github.com/gpu-2=1652823055647\ngpuid.github.com/gpu-3=1652823055931\ngpuid.github.com/gpu-4=1652923033989\ngpuid.github.com/gpu-5=1652923034028\ngpuid.github.com/gpu-6=1652923034291\ngpuid.github.com/gpu-7=1653023018213\n\n# GB200:\ngpuid.github.com/chassis=1821325191344\ngpuid.github.com/gpu-0=1761025346025\ngpuid.github.com/gpu-1=1761125340419\n```\n\n\u003e GB200 nodes have 4 GPUs but only 2 unique serial numbers. These GPUs come in dual-die packaging where 2 GPU are stitched together with NVLink-C2C on the same module.\n\n## Available Exporters\n\n**gpuid** supports multiple data export backends:\n\n* **StdOut**: Development and debugging (default)\n* **HTTP**: POSTs to HTTP endpoints\n* **PostgreSQL**: Batch inserts into PostgreSQL database\n* **S3**: Puts CSV object into S3-compatible bucket\n\n### Stdout Exporter\n\n**Type**: `stdout` (default)\n**Purpose**: Development and debugging, outputs JSON to stdout\n**Configuration**: No additional environment variables required\n\n```yaml\nenv:\n  - name: CLUSTER_NAME\n    value: 'validation'\n```\n\n### HTTP Exporter\n\n**Type**: `http`\n**Purpose**: Send GPU data to HTTP endpoints via POST requests\n**Features**: Bearer token authentication, configurable timeouts, automatic retries\n\n```yaml\nenv:\n  - name: EXPORTER_TYPE\n    value: 'http'\n  - name: CLUSTER_NAME\n    value: 'validation'\n  - name: HTTP_ENDPOINT\n    value: 'https://api.example.com/gpu-data'\n  - name: HTTP_TIMEOUT\n    value: '30s'\n  - name: HTTP_AUTH_TOKEN\n    valueFrom:\n      secretKeyRef:\n        name: http-credentials\n        key: token\n```\n\n\n### PostgreSQL Exporter\n\n**Type**: `postgres`\n**Purpose**: Database storage with full ACID compliance\n**Features**: Connection pooling, automatic schema management, batch processing\n\n```yaml\nenv:\n  - name: EXPORTER_TYPE\n    value: 'postgres'\n  - name: CLUSTER_NAME\n    value: 'validation'\n  - name: POSTGRES_PORT\n    value: '5432'\n  - name: POSTGRES_DB\n    value: 'gpuid'\n  - name: POSTGRES_TABLE\n    value: 'serials'\n  - name: POSTGRES_HOST\n    valueFrom:\n      secretKeyRef:\n        name: db-credentials\n        key: host\n  - name: POSTGRES_USER\n    valueFrom:\n      secretKeyRef:\n        name: db-credentials\n        key: username\n  - name: POSTGRES_PASSWORD\n    valueFrom:\n      secretKeyRef:\n        name: db-credentials\n        key: password\n```\n\n### Amazon S3 Exporter\n\n**Type**: `s3`\n**Purpose**: Cloud storage with time-based partitioning\n**Features**: Automatic partitioning, batch uploads, configurable prefixes\n\n```yaml\nenv:\n  - name: EXPORTER_TYPE\n    value: 's3'\n  - name: CLUSTER_NAME\n    value: 'validation'\n  # GPU Serial Number Provider\n  - name: NAMESPACE\n    value: 'gpu-operator'\n  - name: LABEL_SELECTOR\n    value: 'app=nvidia-device-plugin-daemonset'\n  # S3 Exporter Configuration\n  - name: S3_BUCKET\n    value: 'gpuids'\n  - name: S3_PREFIX\n    value: 'serial-numbers'\n  - name: S3_REGION\n    value: 'us-east-1'\n  - name: S3_PARTITION_PATTERN\n    value: 'year=%Y/month=%m/day=%d/hour=%H'\n  # AWS Credentials from Kubernetes Secret\n  - name: AWS_ACCESS_KEY_ID\n    valueFrom:\n      secretKeyRef:\n        name: s3-credentials\n        key: AWS_ACCESS_KEY_ID\n  - name: AWS_SECRET_ACCESS_KEY\n    valueFrom:\n      secretKeyRef:\n        name: s3-credentials\n        key: AWS_SECRET_ACCESS_KEY\n```\n\n## Usage\n\n### Download \n\nDownload and expand either the zip or tar version of the artifacts `gpuid` and `policy` artifacts from https://github.com/mchmarny/gpuid/releases/latest.\n\n### Deployment\n\n1. **Configure the deployment** by updating the specific overlay that corresponds to your backend type:\n\n* `stdout` (default) - [deployments/gpuid/overlays/stdout/patch-deployment.yaml](deployments/gpuid/overlays/stdout/patch-deployment.yaml)\n* `http` - [deployments/gpuid/overlays/http/patch-deployment.yaml](deployments/gpuid/overlays/http/patch-deployment.yaml)   \n* `postgres` - [deployments/gpuid/overlays/postgres/patch-deployment.yaml](deployments/gpuid/overlays/postgres/patch-deployment.yaml)\n* `s3` - [deployments/gpuid/overlays/s3/patch-deployment.yaml](deployments/gpuid/overlays/s3/patch-deployment.yaml)\n\n1. **Apply the configuration**\n\n\u003e Substitute for the desired backend.\n\n```shell\nkubectl apply -k deployments/gpuid/overlays/stdout\n```\n\n1. **Verify deployment**\n\nMake sure the exporter pod is running:\n\n```shell\nkubectl -n gpuid get pods -l app=gpuid\n```\n\nAnd review its logs: \n\n```shell\nkubectl -n gpuid logs -l app=gpuid --tail=-1\n```\n\n### Monitoring and Observability\n\n`gpuid` emits structured logs in JSON format with contextual information:\n\nSince these logs are in JSON, you can filter them with `jq` for specific information, for example, error events:\n\n```shell\nkubectl -n gpuid logs -l app=gpuid --tail=-1 \\\n  | jq -r 'select(.level == \"ERROR\") | \"\\(.time) \\(.msg) \\(.error)\"'\n```\n\nOr only the serial reading events: \n\n```shell\nkubectl -n gpuid logs -l app=gpuid --tail=-1 \\\n  | jq -r 'select(.msg == \"gpu serial number reading\") \n  | \"\\(.chassis) \\(.node) \\(.machine) \\(.gpu)\"'\n```\n\nOnce deployed, you can use these new labels: \n\n```shell\nkubectl get nodes -l nodeGroup=customer-gpu -o json \\\n| jq -r '\n    [ .items[]\n      | {chassis: (.metadata.labels[\"gpuid.github.com/chassis\"] // \"na\")}\n    ]\n    | group_by(.chassis)\n    | map({(.[0].chassis): length})\n    | add\n'\n{\n  \"1821025191506\": 9,\n  \"1821225190819\": 7,\n  \"1821225192095\": 9,\n  \"1821325191344\": 9\n}\n```\n\n### Cleanup\n\n```shell\nkubectl delete -k deployments/gpuid/overlays/s3\n```\n\n## Metrics and Monitoring\n\nThe `gpuid` service exposes Prometheus-compatible metrics on the `:8080/metrics` endpoint:\n\n- `gpuid_export_success_total{exporter_type, node, pod}`: Successful export operations\n- `gpuid_export_failure_total{exporter_type, node, pod, error_type}`: Failed export operations\n\n## Exported Data Schema\n\nGPU serial number readings are exported in a consistent schema across all backends:\n\n- `cluster`: Kubernetes cluster identifier where the GPUs were observed\n- `node`: Kubernetes node name where GPU was discovered\n- `machine`: VM instance ID or physical machine identifier  \n- `source`: Namespace/Pod name that provided the GPU information\n- `gpu`: GPU serial number from nvidia-smi\n- `read_time`: Timestamp when the reading was taken (RFC3339 format)\n\n### HTTP Post Content\n\nWhen using HTTP exporter, the content includes the JSON serialized record: \n\n```json\n{\n  \"cluster\": \"production-cluster\",\n  \"node\": \"gpu-node-01\", \n  \"machine\": \"i-1234567890abcdef0\",\n  \"source\": \"gpu-operator/nvidia-device-plugin-abc123\",\n  \"gpu\": \"1234567890\",\n  \"time\": \"2025-09-10T10:30:45Z\"\n}\n```\n\n### PostgreSQL Schema\n\nWhen using the PostgreSQL exporter, data is stored in the following table structure:\n\n```sql\nCREATE TABLE serials (\n    id BIGSERIAL PRIMARY KEY,\n    cluster VARCHAR(255) NOT NULL,\n    node VARCHAR(255) NOT NULL, \n    machine VARCHAR(255) NOT NULL,\n    source VARCHAR(255) NOT NULL,\n    chassis VARCHAR(255) NOT NULL,\n    gpu VARCHAR(255) NOT NULL,\n    read_time TIMESTAMP WITH TIME ZONE NOT NULL,\n    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),\n    UNIQUE(cluster, node, machine, source, chassis, gpu, read_time)\n);\n\n-- Optimized indexes for common query patterns\nCREATE INDEX idx_serials_cluster ON serials (cluster);\nCREATE INDEX idx_serials_node ON serials (node);\nCREATE INDEX idx_serials_read_time ON serials (read_time);\nCREATE INDEX idx_serials_created_at ON serials (created_at);\n```\n\nFew queries: \n\nGPUs which have been used in more than 1 machine:\n\n```sql\nSELECT \n    gpu, \n    COUNT(DISTINCT machine) AS machines_per_gpu\nFROM serials\nGROUP BY gpu\nHAVING COUNT(DISTINCT machine) \u003e 1\nORDER BY gpu;\n```\n\nGPUs that moved across clusters:\n\n```sql\nSELECT \n    gpu,\n    COUNT(DISTINCT cluster) AS clusters_seen_in\nFROM serials\nGROUP BY gpu\nHAVING COUNT(DISTINCT cluster) \u003e 1\nORDER BY clusters_seen_in DESC;\n```\n\nNumber of GPUs per day: \n\n```sql\nSELECT \n    DATE(read_time) AS day,\n    COUNT(DISTINCT gpu) AS unique_gpus\nFROM serials\nGROUP BY day\nORDER BY day;\n```\n\n### S3 Object Structure\n\nThe S3 exporter organizes data with time-based partitioning:\n\n```\ns3://bucket-name/prefix/\n├── year=2025/month=09/day=10/hour=10/\n│   ├── cluster=prod/node=gpu-node-01/20250910-103045-uuid.json\n│   └── cluster=prod/node=gpu-node-02/20250910-103112-uuid.json\n└── year=2025/month=09/day=10/hour=11/\n    └── cluster=prod/node=gpu-node-01/20250910-110215-uuid.json\n```\n\n## Security and Validation\n\nThe `gpuid` container images are built with SLSA (Supply-chain Levels for Software Artifacts).\n\n### Manual Verification \n\nNavigate to https://github.com/mchmarny/gpuid/attestations and pick the version you want to verify. The subject digest at the bottom should match the digest of the image you are deploying.\n\n### Using CLIs\n\n\u003e Update the below image with the digest to the version you end up using.\n\n```shell\nexport IMAGE=ghcr.io/mchmarny/gpuid:latest\n```\n\n#### GitHub CLI\n\nTo verify the attestation on this image using GitHub CLI: \n\n```shell\ngh attestation verify \"oci://$IMAGE\" \\\n  --repo mchmarny/gpuid \\\n  --predicate-type https://slsa.dev/provenance/v1 \\\n  --limit 1\n```\n\n#### Cosign CLI\n\n```shell\ncosign verify-attestation \\\n    --type https://slsa.dev/provenance/v1 \\\n    --certificate-github-workflow-repository 'mchmarny/gpuid' \\\n    --certificate-identity-regexp 'https://github.com/mchmarny/gpuid/*' \\\n    --certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \\\n    $IMAGE\n```\n\n### In-Cluster Policy Enforcement\n\nTo ensure only verified images are deployed in your cluster:\n\n1. Install Sigstore Policy Controller (if not already installed):\n\n```shell\nkubectl create namespace cosign-system\nhelm repo add sigstore https://sigstore.github.io/helm-charts\nhelm repo update\nhelm install policy-controller -n cosign-system sigstore/policy-controller\n```\n\n2. Enable Sigstore policy validation:\n\n```shell\nkubectl label namespace gpuid policy.sigstore.dev/include=true\n```\n\n3. Apply the image policy:\n\n```shell\nkubectl apply -f deployments/policy/slsa-attestation.yaml\n```\n\n4. Test the admission policy:\n\n```shell\nkubectl -n gpuid run test --image=$IMAGE\n```\n\n## Disclaimer\n\nThis is my personal project and it does not represent my employer. While I do my best to ensure that everything works, I take no responsibility for issues caused by this code.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmchmarny%2Fgpuid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmchmarny%2Fgpuid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmchmarny%2Fgpuid/lists"}