{"id":50275425,"url":"https://github.com/cuuper22/gpu_stack-","last_synced_at":"2026-05-27T20:01:56.262Z","repository":{"id":357258121,"uuid":"1214180551","full_name":"Cuuper22/gpu_stack-","owner":"Cuuper22","description":"SymPy-backed dependency graph for GPU training systems, from device physics and kernels to clusters, thermals, and economics.","archived":false,"fork":false,"pushed_at":"2026-05-12T01:50:58.000Z","size":3443,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-12T03:27:40.871Z","etag":null,"topics":["ai-infrastructure","gpu-training","performance-modeling","sympy","systems-modeling"],"latest_commit_sha":null,"homepage":"https://cuuper22.github.io/gpu_stack-/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Cuuper22.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-18T08:14:07.000Z","updated_at":"2026-05-12T01:51:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Cuuper22/gpu_stack-","commit_stats":null,"previous_names":["cuuper22/gpu_stack-"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Cuuper22/gpu_stack-","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cuuper22%2Fgpu_stack-","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cuuper22%2Fgpu_stack-/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cuuper22%2Fgpu_stack-/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cuuper22%2Fgpu_stack-/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Cuuper22","download_url":"https://codeload.github.com/Cuuper22/gpu_stack-/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cuuper22%2Fgpu_stack-/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33581559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-27T02:00:06.184Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-infrastructure","gpu-training","performance-modeling","sympy","systems-modeling"],"created_at":"2026-05-27T20:01:55.237Z","updated_at":"2026-05-27T20:01:56.247Z","avatar_url":"https://github.com/Cuuper22.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gpu_stack\n\n![A wide visual map of the training stack descending from datacenters through GPU systems, lithography, atoms, and particle-like root assumptions.](docs/assets/readme-hero.png)\n\n**Website:** \u003chttps://cuuper22.github.io/gpu_stack-/\u003e  \n**Repository:** \u003chttps://github.com/Cuuper22/gpu_stack-\u003e\n\n`gpu_stack` started as a curiosity project in the overlap between my AI work and my physics brain.\n\nThe question was simple enough to be annoying: if frontier training is supposedly \"more GPUs, more data, more money,\" where does that sentence actually bottom out?\n\nNot rhetorically. Physically.\n\nA token passes through model architecture, kernels, collectives, memory bandwidth, transistor switching, lithography, materials, thermals, power delivery, and eventually a cost line item that someone has to pay. The stack is usually explained in slices. I wanted the uncomfortable version where the slices have to talk to each other.\n\nSo `gpu_stack` is a SymPy-backed symbolic model of the GPU training stack. It is not a polished numerical simulator. It is a graph of equations, constraints, approximations, scenario values, and exposed missing assumptions. The point is not to hide the unknowns. The point is to make them visible enough that they can be attacked.\n\nIf that sounds like a weird amount of effort to understand GPU training, yes. That is more or less how the project happened.\n\n## The Shape Of The Stack\n\n![Dependency cone from datacenter economics down through GPU systems, transistor physics, lithography, atoms, nucleons, quarks, and equations.](docs/assets/readme-equation-cone.svg)\n\n`gpu_stack` treats the training stack like one inspectable dependency cone.\n\nAt the wide end are questions people actually ask:\n\n- What sets `econ.cost.per_token`?\n- Why did `training.tokens_per_second` move?\n- How much site power disappears into cooling?\n- Which missing assumptions matter most downstream?\n\nAt the narrow end are the things the model refuses to pretend away: process geometry, pulse fluence, imaging-medium composition, gate constraints, source-plasma behavior, proton and neutron counts, valence quark roots, and universal constants.\n\nMost tooling stops at the first satisfying number. `gpu_stack` keeps asking: what is that number made of?\n\nThe answer can be an equation, a sourced scenario value, a universal constant, or a root input. Root inputs are not a shame pile. They are visible modeling debt, which is much better than hidden modeling debt wearing a lab coat.\n\n## The Central Idea\n\nThe core object is a registry-backed equation graph.\n\nEvery scope self-registers on import. Variables carry identity, units, descriptions, scope metadata, symbolic assumptions, and back-references for graph traversal. Equations define relations between variables. Constants are reserved for universal physics constants. Everything else, including clocks, voltages, tensor shapes, optimizer hyperparameters, GPU counts, tariffs, and facility assumptions, remains a variable.\n\nThat choice matters.\n\nA variable with no defining value relation is a root input. Some roots should eventually be decomposed into lower-level physics. Some should remain scenario boundaries. Some require sourced calibration before the model is allowed to assign them.\n\nThis is why root count alone is not the score. Decomposing one vague root into several primitive roots can make the count rise while making the model more honest.\n\n## What The Graph Knows Right Now\n\nFresh local `stats` output reports:\n\n```text\nRegistry stats:\n  systems        16\n  variables      1517\n  constants      24\n  equations      959\n  root_inputs    619\n  leaves         253\n\nCoverage:\n  non_constant_variables         1493\n  with_sp_units                  1428\n  with_references                1324\n  equations                      959\n  equations_with_references      878\n  equations_with_unit_check      799\n```\n\nThe model spans:\n\n| Layer | What lives there |\n|---|---|\n| Physical roots | lithography source structure, imaging-medium composition, process geometry, local thermal behavior, semiconductor transport, MOSFET behavior, interconnect physics, CMOS logic, noise |\n| Memory | SRAM, DRAM, flip-flops, register file, shared memory, Tensor Memory, L1, L2, HBM capacity and bandwidth |\n| Numeric formats | IEEE formats, low-bit precision, microscaling, stochastic rounding |\n| Parallelism | data, tensor, pipeline, expert, context, and FSDP style sharding |\n| Model architecture | attention, embeddings, FFN, MoE, positions, KV cache, transformer parameter and token math |\n| Arithmetic and kernels | ALU, FMA, Tensor Core MMA, roofline, GEMM, attention kernels, occupancy |\n| Communication | NVLink, InfiniBand, Spectrum-X-style scale-out, collectives, alpha-beta costs |\n| Training | compute time, communication time, bubbles, MFU, tokens per second |\n| Cluster and facility | nodes, racks, bisection, storage, reliability, power, cooling, PUE |\n| Economics | capex, opex, amortization, power cost, run cost, cost per token |\n\nMFU means Model FLOPs Utilization. HBM means High Bandwidth Memory. PUE means Power Usage Effectiveness. The README should not assume the reader was born knowing datacenter abbreviations. Sadly, many datacenter docs do.\n\n## Try It Without Believing Me\n\nInstall in editable mode:\n\n```bash\npython -m pip install -e \".[dev]\"\n```\n\nRun the quick health check:\n\n```bash\npython -m gpu_stack.cli stats\n```\n\nRun the verifier while iterating:\n\n```bash\npython -m gpu_stack.cli verify --profile fast\npython -B -m gpu_stack.cli verify --profile fast --read-only\n```\n\nBefore broader graph edits, use the full verifier:\n\n```bash\npython -m gpu_stack.cli verify --profile full\n```\n\nThe installed entry point is also available as:\n\n```bash\ngpu-stack stats\ngpu-stack verify --profile fast\n```\n\n## See One Output As A Cone\n\nStart with a target such as `econ.cost.per_token`.\n\n```python\nimport gpu_stack\nfrom gpu_stack import Registry, subgraph\n\ntarget = Registry.variables[\"econ.cost.per_token\"]\ncone = subgraph(target, direction=\"dependencies\")\n\nprint(target.name)\nprint(f\"{len(cone)} variables upstream\")\nprint(\"first few roots:\")\n\nfor var in sorted(v for v in cone if v.is_root_input)[:12]:\n    print(\"  \", var.name, f\"[{var.units}]\")\n```\n\nThe exact count is not the important part. The posture is. Every cost number has an ancestry, and every unresolved ancestor is named.\n\n## Root Debt\n\n`root-debt` ranks unresolved root inputs by downstream blast radius.\n\n```bash\npython -m gpu_stack.cli root-debt --families --limit 5\n```\n\nObserved summary:\n\n```text\nRoot-debt family ranking:\n  total_roots        619\n  include_constraints False\n  grouped_roots      619\n  family_count       151\n  shown              5\n\ntotal_weight  root_count  family                                      boundary_category  primitive_boundary\n        3014          15  physical.lithography.medium                 primitive-root     True\n        2185          11  physical.lithography                        primitive-root     True\n        1943           8  physical.lithography.source_plasma_drive    primitive-root     True\n        1866          18  physical.mosfet                             primitive-root     True\n        1293           8  physical.process                            primitive-root     True\n```\n\nThis is one of the more useful commands because it prevents the project from drifting into \"add equations wherever it feels cool.\" The graph can tell which unknowns are currently expensive.\n\n## Scenario Reports\n\nPresets can evaluate named targets and return structured artifacts.\n\n```python\nfrom gpu_stack.presets import scenarios\n\nreport = scenarios.dense_training_cost_fixture.evaluate_targets([\n    (\"tokens_per_second\", \"training.tokens_per_second\"),\n    (\"job_dc_power\", \"econ.job.dc_power\"),\n    (\"run_power_cost\", \"econ.run.power_cost\"),\n    (\"cost_per_token\", \"econ.cost.per_token\"),\n])\n\nprint(report.status)\nfor target in report.targets:\n    print(target.label, target.status, target.missing_count)\n```\n\nThe CLI equivalent:\n\n```bash\npython -m gpu_stack.cli scenario-report scenarios.dense_training_cost_fixture --json\n```\n\nObserved summary:\n\n```json\n{\n  \"preset\": \"dense_training_cost_fixture\",\n  \"status\": \"ok\",\n  \"assignment_count\": 30,\n  \"target_count\": 4,\n  \"ok_count\": 4,\n  \"error_count\": 0,\n  \"issue_count\": 0,\n  \"ok_target_labels\": [\n    \"tokens_per_second\",\n    \"job_dc_power\",\n    \"run_power_cost\",\n    \"cost_per_token\"\n  ]\n}\n```\n\nRepresentative resolved values:\n\n```text\ntraining.tokens_per_sec = 6666666.66666667\necon.job.dc_power       = 5200.0\necon.run.power_cost     = 0.00078\necon.cost.per_token     = 3.000078e-06\n```\n\nThat fixture is synthetic. It is a deterministic test anchor, not vendor truth, historical data, or a price recommendation. The distinction matters. Fake authority is how technical debt gets a haircut and calls itself strategy.\n\n## Resolver Workflows\n\nResolve a target with explicit assignments:\n\n```bash\npython -m gpu_stack.cli resolve physical.gate.elmore_delay \\\n  --assign physical.gate.r_on=1 \\\n  --assign physical.gate.fanout=1 \\\n  --assign physical.gate.c_input=1 \\\n  --assign physical.interconnect.c_total=1 \\\n  --assign physical.interconnect.r_per_length=0 \\\n  --assign physical.interconnect.c_per_length=1 \\\n  --assign physical.wire_length=1 \\\n  --assign physical.clock_frequency=0.1 \\\n  --constraints\n```\n\nFor stricter runs, pair `--constraints` with `--fail-on-violated-constraints`. Invalid assignments report named feasibility relations before returning nonzero.\n\nScenario-audit surfaces are also available:\n\n```bash\npython -m gpu_stack.cli scenario-audit --json\npython -m gpu_stack.cli scenario-audit --missing-families\n```\n\n## The Next-Work Compass\n\nThe project now has a small continuation compass built from graph evidence:\n\n```bash\npython -m gpu_stack.cli next-work\n```\n\nObserved summary:\n\n```text\nNext work:\n  graph evidence: variables=1517 equations=959 root_inputs=619\n\nTop 3 highest impact:\n  1. Close the sourced Pythia cost frontier\n  2. Pay down the heaviest root-debt family\n  3. Finish metadata coverage before widening scenarios\n\n4 best implementations:\n  1. Registry import graph is currently coherent\n  2. Pythia sourced pack resolves the non-cost targets\n  3. EUV tin120 assumption pack is cleanly bounded\n  4. Dense cost fixture still exercises the full rollup\n```\n\nCaveat: `next-work` currently supports `--json`, but not `--limit`.\n\n## Design Rules\n\nThese rules keep the package honest:\n\n1. Only universal physics constants are `Constant`s.\n2. Everything else is a `Variable`, including clocks, voltages, tensor shapes, GPU counts, tariffs, and optimizer hyperparameters.\n3. Every scope self-registers on import.\n4. `gpu_stack.scopes.SCOPE_MODULES` is the authoritative load order.\n5. The project is symbolic first. It is a graph of definitions, constraints, approximations, variants, iterative updates, and stochastic relations.\n6. A root input is visible modeling debt. It should be decomposed, sourced, or intentionally left as a scenario boundary.\n\n## What This Is Good For Now\n\n- Inspecting symbolic dependencies across hardware, software, thermal, and economic layers.\n- Writing and checking new equations in a single registry.\n- Ranking unresolved roots by downstream blast radius.\n- Resolving selected scenario targets with variant selection, equation traces, missing-family reporting, constraints, and approximation-validity feedback.\n- Exporting structured `ScenarioReport` and `ScenarioTargetReport` artifacts.\n- Auditing sourced scenario packs.\n- Demonstrating how training throughput and cost metrics reduce to lower-level assumptions.\n\n## What This Is Not Yet\n\nThis is the part where the README earns the numbers above.\n\n`gpu_stack` is not yet a calibrated training-cost oracle. It does not solve simultaneous systems. It does not optimize over scenario choices. It does not automatically switch relations when an approximation validity check is symbolic or violated. It does not fill missing physical or economic quantities with convenient defaults and call that wisdom.\n\nThe resolver is intentionally conservative. It propagates one selected defining relation per variable. Unassigned symbolic boundaries are reported as `missing`. Constraints and approximation-validity checks are surfaced instead of treated as decorative comments.\n\nCalibration presets are still skeletal. Some presets are exact composition fixtures. Some are regression anchors. Some are synthetic dense-training cost fixtures. They are useful because they are explicit, not because they are universal.\n\n## Current Snapshot\n\n| Signal | Value |\n|---|---:|\n| Systems | 16 |\n| Variables | 1517 |\n| Constants | 24 |\n| Equations | 959 |\n| Root inputs | 619 |\n| Leaves | 253 |\n| Cycles | 0 |\n| Topological order length | 1517 |\n| Hard audit failures | 0 |\n| Non-constant variables with `sp_units` | 1428 |\n| Non-constant variables with references | 1324 |\n| Equations with references | 878 |\n| Equations with unit checks | 799 |\n| Root-debt families | 151 |\n| Package version | 0.23.0 |\n\nTest counts can move as the model grows. Recheck locally with:\n\n```bash\npython -m pytest --collect-only -q\n```\n\n## Future Visual Demos\n\nThe long-term README should not just explain the graph. It should let the reader see it.\n\nPlanned visual-first demos:\n\n- A live dependency-cone browser for `econ.cost.per_token`, `training.tokens_per_second`, and `thermal.dc.pue`.\n- A root-debt heatmap where unresolved assumptions glow by downstream blast radius.\n- A layer slider that walks from quark-count roots to lithography to transistor delay to GPU peak FLOPs to training step time.\n- A scenario trace view that shows which equations fired, which constraints were checked, and which roots stayed missing.\n\nFor browser 3D work, shipped assets should be GLB or glTF 2.0, optimized after export, with normalized transforms, meaningful hierarchy names, reused materials, explicit pivots, and texture budgets tied to actual screen use. If the equations become spatial demos, the assets should be as disciplined as the equations.\n\n## Core Types\n\n- `Variable`: identity, units, description, scope, symbol assumptions, metadata, and dependency back-references.\n- `Constant`: an immutable `Variable` with a fixed numeric value. This should stay rare.\n- `Equation`: a relation over variables.\n- `Inequality`: a feasibility constraint.\n- `Approximation`: a relation with a validity regime.\n- `PiecewiseEquation`, `DifferentialEquation`, `IterativeEquation`, `StochasticRelation`: richer relation types for the parts of reality that refuse to be one clean line.\n- `System`: a scope-level collection of variables and equations.\n- `Registry`: the global lookup surface.\n- `Preset`: scenario assignments, variants, and target evaluation support.\n\n## Inspect The Registry In Python\n\n```python\nimport gpu_stack\nfrom gpu_stack import Registry, find_cycles, topological_sort\n\nprint(Registry.stats())\nprint(find_cycles())\nprint(len(topological_sort()))\n```\n\nRebuild after a registry reset:\n\n```python\nimport gpu_stack\nfrom gpu_stack import Registry\n\nRegistry.reset()\nstats = gpu_stack.bootstrap()\nprint(stats)\n```\n\nInspect defining equations:\n\n```python\nfrom gpu_stack import Registry\n\npeak_gpu = Registry.variables[\"gpu.peak_flops\"]\nfor eq in peak_gpu.defining_equations:\n    print(eq.name)\n    print(eq.as_sympy())\n    print(eq.description)\n```\n\nSubstitute numeric values into one equation:\n\n```python\nimport sympy as sp\nfrom gpu_stack import Registry\n\nnode_eq = Registry.equations[\"cluster.eq.node_peak_flops\"]\nrack_eq = Registry.equations[\"cluster.eq.rack_peak_flops\"]\n\nnode_peak = node_eq.evaluate_rhs({\n    Registry.variables[\"cluster.node.n_gpus\"].symbol: 8,\n    Registry.variables[\"gpu.peak_flops\"].symbol: sp.Float(15e15),\n})\n\nrack_peak = rack_eq.evaluate_rhs({\n    Registry.variables[\"cluster.rack.n_nodes\"].symbol: 9,\n    Registry.variables[\"cluster.node.peak_flops\"].symbol: node_peak,\n})\n\nprint(sp.N(rack_peak))\n# 1.08e18\n```\n\nExport a graph slice:\n\n```python\nfrom gpu_stack import Registry, subgraph, to_dot\n\nroot = Registry.variables[\"econ.cost.per_token\"]\ncone = sorted(subgraph(root, direction=\"dependencies\"), key=lambda v: v.name)\ndot_text = to_dot(cone)\nprint(dot_text[:400])\n```\n\n## Repository Layout\n\n```text\n.\n├── README.md\n├── PRODUCT.md\n├── DESIGN.md\n├── pyproject.toml\n├── docs/\n│   ├── assets/\n│   └── readme_fragments/\n├── tests/\n└── gpu_stack/\n    ├── __init__.py\n    ├── constants.py\n    ├── demo.py\n    ├── next_work.py\n    ├── core/\n    ├── presets/\n    └── scopes/\n```\n\n## Project Status Docs\n\nThe README is the front door. The moving project ledger lives here:\n\n- [`./IMPROVEMENT_MAP.md`](./IMPROVEMENT_MAP.md)\n- [`./ROADMAP.md`](./ROADMAP.md)\n- [`./HANDOFF.md`](./HANDOFF.md)\n- [`./CHANGELOG.md`](./CHANGELOG.md)\n- [`./SESSION_STATE.md`](./SESSION_STATE.md)\n- [`./VISIBLE_BACKLOG.md`](./VISIBLE_BACKLOG.md)\n- [`./AGENT_DIARY.md`](./AGENT_DIARY.md)\n- [`./rest_breaks/README.md`](./rest_breaks/README.md)\n\nThe diary and break-room files are not part of the package API. They are there because long-running work needs memory, and apparently so do the agents doing it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcuuper22%2Fgpu_stack-","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcuuper22%2Fgpu_stack-","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcuuper22%2Fgpu_stack-/lists"}