{"id":50708332,"url":"https://github.com/manishklach/thermal-observatory","last_synced_at":"2026-06-09T13:30:39.343Z","repository":{"id":356476121,"uuid":"1232692811","full_name":"manishklach/thermal-observatory","owner":"manishklach","description":"A generic thermal observability framework for CPU, GPU, board, and platform telemetry across vendor APIs, kernel interfaces, and runtime correlation layers.","archived":false,"fork":false,"pushed_at":"2026-05-08T07:45:16.000Z","size":33,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-08T09:29:58.771Z","etag":null,"topics":["amd","arm64","cuda","linux","nvidia","nvml","observability","rocm","telemetry","thermal-framework","thermal-monitoring","x86-64"],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manishklach.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-08T07:11:32.000Z","updated_at":"2026-05-08T07:45:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/manishklach/thermal-observatory","commit_stats":null,"previous_names":["manishklach/thermal-observatory"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/manishklach/thermal-observatory","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fthermal-observatory","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fthermal-observatory/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fthermal-observatory/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fthermal-observatory/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manishklach","download_url":"https://codeload.github.com/manishklach/thermal-observatory/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manishklach%2Fthermal-observatory/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34110009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amd","arm64","cuda","linux","nvidia","nvml","observability","rocm","telemetry","thermal-framework","thermal-monitoring","x86-64"],"created_at":"2026-06-09T13:30:38.023Z","updated_at":"2026-06-09T13:30:39.333Z","avatar_url":"https://github.com/manishklach.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Thermal Observatory\n\n`thermal-observatory` is a hardware-aware thermal observability framework for CPU, GPU, board, and platform telemetry.\n\nCurrent release: `v0.1.0`\n\nIt is meant to become a generic framework, not just a wrapper around one operating-system path or one vendor stack. The current implementation focus is Linux because that is where the lowest-level server telemetry interfaces are easiest to access, but the repository is structured as a framework that can grow into a broader cross-platform and cross-vendor system.\n\nToday the repo covers:\n\n- CPUs: `x86_64`, `arm64`\n- GPUs: NVIDIA, AMD\n- Host interfaces: `hwmon`, `thermal_zone`, `powercap`/RAPL, vendor sysfs\n- Vendor interfaces: NVML, ROCm SMI, CUDA runtime correlation\n- Datacenter interfaces: IPMI scaffold, Redfish scaffold, DCGM scaffold\n- Experimental privileged path: kernel module scaffold for future deep collectors\n\nThe goal is not to replace vendor tools. The goal is to provide one repository and one normalized API that can:\n\n- discover thermal and power-adjacent interfaces on a host\n- collect detailed thermal telemetry with provenance\n- correlate runtime device identities with vendor telemetry\n- expose one snapshot model for higher-level tooling\n- keep risky and privileged paths clearly separated from stable userspace collectors\n\n## Why This Exists\n\nThermal data is fragmented across:\n\n- generic kernel interfaces\n- architecture-specific CPU paths\n- vendor GPU libraries\n- platform firmware and BMC surfaces\n\nIn practice that means engineers end up stitching together `nvidia-smi`, `rocm-smi`, `sensors`, ad hoc `sysfs` reads, and platform-specific scripts. This repo aims to become the clean integration layer on top of those sources.\n\n## Scope\n\nThis repo is intentionally split into two layers:\n\n1. Stable userspace collectors for interfaces that are already supported and safe to read on production systems.\n2. Experimental kernel work for deeper visibility such as direct MSR-assisted reads or future BMC/IPMI hooks.\n\nNothing here is structured as an LKML submission. This is a GitHub-oriented research/engineering repo.\n\n## Layout\n\n```text\ninclude/                 Public snapshot model and API\nsrc/                     Userspace collectors and output formatting\nsrc/cpu/                 x86 and arm64 CPU collectors\nsrc/gpu/                 NVIDIA NVML/CUDA and AMD ROCm collectors\nsrc/platform/            Generic Linux sysfs and platform helpers\nsrc/format/              Text and JSON rendering\nkernel/                  Experimental kernel module\nscripts/                 Zero-build helper scripts\nexamples/                Validation and heatload examples\ndocs/                    Design and architecture docs\n```\n\n## Coverage Matrix\n\n| Component | Primary path | Fallback path |\n| --- | --- | --- |\n| x86 CPU temperature | `coretemp` hwmon, `thermal_zone` | MSR when permitted |\n| x86 package energy/power | `powercap` RAPL | raw MSR |\n| arm64 CPU temperature | `thermal_zone`, vendor hwmon | SCMI-specific paths |\n| arm64 frequency | `cpufreq` | none |\n| NVIDIA GPU telemetry | NVML | `nvidia-smi` script fallback |\n| NVIDIA runtime correlation | CUDA runtime | PCI/UUID matching via NVML |\n| NVIDIA fleet integration | DCGM scaffold | NVML-only mode |\n| AMD GPU | ROCm SMI | `amdgpu` hwmon |\n| Chassis / board sensors | `hwmon`, IPMI scaffold, Redfish scaffold | none |\n\n## Architecture Principles\n\n- Authoritative source first: use vendor or kernel-supported APIs before raw register scraping.\n- Runtime correlation second: CUDA and future ROCm runtime helpers are there to map execution contexts to telemetry, not replace thermal APIs.\n- Snapshot-first design: collectors populate one shared model.\n- Capability bits matter: the output should say what the framework truly observed.\n- Experimental paths stay isolated until they are validated on real hardware.\n\n## Build\n\nUserspace:\n\n```bash\nmake\n```\n\nKernel module:\n\n```bash\nmake -C kernel\n```\n\nCUDA heatload example:\n\n```bash\nmake cuda-example\n```\n\n## Run\n\nSingle snapshot:\n\n```bash\n./thermal_monitor\n```\n\nJSON:\n\n```bash\n./thermal_monitor --json\n```\n\nWatch mode:\n\n```bash\n./thermal_monitor --watch --interval 2\n```\n\nQuick no-build script:\n\n```bash\n./scripts/thermal_quick.sh\n```\n\nCUDA heatload validation:\n\n```bash\n./examples/cuda_heatload 16777216 4000\n```\n\nRun the heatload in one terminal and the monitor in another to watch temperature, power, clock, and throttle-reason changes as the GPU warms up.\n\n## NVIDIA Path\n\nFor NVIDIA, the framework now has two separate roles:\n\n1. `NVML` is the primary telemetry collector.\n   It is the authoritative source here for:\n   - GPU die temperature\n   - memory temperature when exposed\n   - power draw and enforced limit\n   - clocks\n   - utilization\n   - throttle reasons\n   - PCI bus identity\n\n2. `CUDA runtime` is a correlation layer.\n   It is used for:\n   - mapping CUDA ordinal to the NVML device\n   - reporting compute capability\n   - reporting SM count\n   - reporting total global memory\n   - reporting CUDA driver/runtime versions\n\nThat separation is intentional. CUDA is not the thermal API; NVML is.\n\n## AMD Path\n\nFor AMD, the framework prefers:\n\n- `ROCm SMI` for richer telemetry\n- `amdgpu` `hwmon` for fallback when ROCm user libraries are unavailable\n\nThe next comparable addition on the AMD side is a runtime-correlation layer similar to the new CUDA path.\n\n## Datacenter Path\n\nThe framework now has the beginning of a datacenter telemetry layer:\n\n- `IPMI` scaffold via `ipmitool sdr elist all`\n- `Redfish` scaffold via `TM_REDFISH_SAMPLE`\n- `DCGM` scaffold via `dcgmi`\n\nThis is the start of the “silicon plus environment” model:\n\n- GPU temperatures and throttle reasons explain what the accelerator is doing\n- board, fan, and PSU telemetry explain whether the node or room is contributing\n- DCGM is the natural NVIDIA fleet-side integration point\n\nThe immediate value is schema and integration-point clarity. The next value is real correlation across those layers.\n\n## Output Model\n\nThe public API in [include/thermal_monitor.h](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/include/thermal_monitor.h) is the center of the repo. It currently models:\n\n- CPU packages and cores\n- ARM thermal clusters\n- NVIDIA GPU telemetry plus CUDA metadata\n- AMD GPU telemetry\n- generic `hwmon` sensors\n- generic thermal zones\n\nThe JSON output now emits full structured sections for:\n\n- CPU packages and per-core values\n- ARM clusters\n- NVIDIA GPUs with CUDA correlation metadata\n- AMD GPUs\n- `hwmon` sensors\n- thermal zones\n- board sensors\n- fan sensors\n- PSU sensors\n- capability mask plus capability names\n- summary counts\n\nThe current schema version is `0.3.0`. Metrics now carry per-metric provenance in the JSON output with:\n\n- `value`\n- `unit`\n- `source`\n- `timestamp_ns`\n- `error`\n\nSee the synthetic schema example in [samples/synthetic-linux-x86-mock-snapshot.json](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/samples/synthetic-linux-x86-mock-snapshot.json).\nSee the datacenter direction note in [docs/datacenter-telemetry.md](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/docs/datacenter-telemetry.md).\nSee the long-form writeup in [docs/blog-why-thermal-observatory.md](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/docs/blog-why-thermal-observatory.md).\n\n## Prometheus\n\nThe repo now supports Prometheus-oriented output in two ways:\n\n1. stdout mode:\n\n```bash\n./thermal_monitor --prometheus\n```\n\n2. textfile collector mode:\n\n```bash\n./thermal_monitor --prometheus-textfile /var/lib/node_exporter/thermal.prom\n```\n\nMetric families include:\n\n- `thermal_gpu_temperature_celsius`\n- `thermal_gpu_power_watts`\n- `thermal_gpu_throttle_reason`\n- `thermal_cpu_package_temperature_celsius`\n- `thermal_board_sensor_value`\n- `thermal_fan_rpm`\n- `thermal_psu_power_watts`\n\n## Testability\n\nLinux sysfs-based collectors now support `TM_SYSROOT`, which allows the repo to run against mocked fixture trees instead of live `/sys` paths.\n\nThat currently covers:\n\n- generic thermal zones\n- generic `hwmon`\n- x86 `coretemp`\n- x86 `powercap` RAPL\n- arm64 thermal zones and `cpufreq`\n- AMD `amdgpu` hwmon fallback\n\nFixture scaffolding lives under [tests/](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/tests), with the initial mocked tree at [tests/fixtures/linux_x86_mock](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/tests/fixtures/linux_x86_mock).\n\nExample fixture run on Linux:\n\n```bash\nexport TM_SYSROOT=$PWD/tests/fixtures/linux_x86_mock\n./thermal_monitor --json \u003e output.json\npython3 tests/check_json_schema.py output.json\n```\n\n## Validation Strategy\n\nRecommended validation matrix:\n\n- `x86_64 + NVIDIA`\n- `x86_64 + AMD`\n- `arm64 + NVIDIA`\n- `arm64 + AMD`\n\nFor each system:\n\n1. Compare framework output to vendor tools.\n2. Run a controlled heatload.\n3. Observe thermal ramps, power changes, clocks, and throttle transitions.\n4. Record gaps in metric availability rather than masking them.\n\n## Roadmap\n\nNear-term:\n\n- make the userspace collectors compile and run cleanly on real Linux hosts\n- add ROCm runtime correlation similar to CUDA\n- add Prometheus textfile export\n- add stronger test fixtures and sample captures\n- add CI around fixture-backed Linux collector tests\n- harden IPMI, Redfish, and DCGM collectors with real platform validation\n\nLater:\n\n- DCGM integration\n- BMC/IPMI userspace collector\n- validated MSR-assisted x86 collector path\n- safer kernel deep-collector design\n- possible non-Linux backends if a clean abstraction emerges\n\n## Current Status\n\nThis repo is now best described as a usable `v0.1.0` alpha. The architecture is stable enough for experimentation and integration work:\n\n- structured JSON with per-metric provenance\n- Prometheus textfile export\n- fixture-backed Linux collector tests\n- NVIDIA telemetry plus CUDA correlation\n- early datacenter platform telemetry scaffolding\n\nWhat it still needs most is real Linux hardware validation and hardening of the datacenter collectors.\n\n## Notes\n\n- x86 MSR-backed reads may require `modprobe msr` and root.\n- NVML requires the NVIDIA driver stack.\n- CUDA correlation requires the CUDA runtime to be installed and discoverable.\n- ROCm SMI requires the ROCm stack.\n- The kernel module is experimental and should be treated as a research path, not production-hardening.\n\nSee [docs/design.md](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/docs/design.md), [docs/architecture.md](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/docs/architecture.md), and [docs/review-notes.md](/C:/Users/ManishKL/Documents/Playground/thermal-observatory/docs/review-notes.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fthermal-observatory","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishklach%2Fthermal-observatory","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishklach%2Fthermal-observatory/lists"}