{"id":50916216,"url":"https://github.com/borodark/zed","last_synced_at":"2026-06-16T15:30:32.666Z","repository":{"id":351080603,"uuid":"1208949283","full_name":"borodark/zed","owner":"borodark","description":"Declarative BEAM deployment on FreeBSD/illumos. ZFS properties as state store. No etcd, no YAML.","archived":false,"fork":false,"pushed_at":"2026-05-02T03:53:34.000Z","size":433,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-02T05:33:59.736Z","etag":null,"topics":["deployment","elixir","elixir-lang","erlang","freebsd","freebsd-jails","zfs"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/borodark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-13T00:17:02.000Z","updated_at":"2026-04-27T20:44:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/borodark/zed","commit_stats":null,"previous_names":["borodark/zed"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/borodark/zed","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borodark%2Fzed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borodark%2Fzed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borodark%2Fzed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borodark%2Fzed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/borodark","download_url":"https://codeload.github.com/borodark/zed/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borodark%2Fzed/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34412784,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deployment","elixir","elixir-lang","erlang","freebsd","freebsd-jails","zfs"],"created_at":"2026-06-16T15:30:30.732Z","updated_at":"2026-06-16T15:30:32.658Z","avatar_url":"https://github.com/borodark.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Zed — ZFS + Elixir Deploy\n\nDeclarative BEAM application deployment on FreeBSD and illumos, using ZFS as the state store and rollback mechanism.\n\n## The Idea\n\nZFS user properties (`com.zed:version=1.4.2`) are a built-in, replicated key-value store that travels with snapshots and `zfs send/receive`. No external state store required — **the deployment state IS the filesystem metadata.**\n\n## Why\n\n**~85% of companies with servers run ≤50 nodes.** They don't need Kubernetes. They need something that works.\n\n| Traditional Stack | Zed |\n|-------------------|-----|\n| etcd/consul cluster (3-5 nodes) | ZFS properties (zero infra) |\n| Terraform state in S3 | State IS the filesystem |\n| Ansible/Chef/Puppet | Elixir DSL, compile-time validated |\n| Container runtime + orchestrator | FreeBSD jails (kernel feature) |\n| 10+ tools to learn | One tool, ~2000 lines |\n\n```\nRollback with K8s:        Rollback with Zed:\n  kubectl rollout undo      zfs rollback tank/app@v1\n  (hope state matches)      (data + state, atomic, O(1))\n```\n\nZed trades global coordination for local simplicity. Each host is authoritative for its own state. That's a feature when your failure domain is per-host anyway.\n\n## Features\n\n- **DSL** — Elixir macros for declaring infrastructure\n- **Convergence** — diff → plan → apply → verify\n- **Instant Rollback** — `zfs rollback` is O(1) and atomic\n- **Jails** — FreeBSD jail.conf.d generation\n- **Multi-Host** — Erlang distribution + `:rpc.call`\n- **Replication** — `zfs send/receive` moves state with data\n\n## Quick Example\n\n```elixir\ndefmodule MyInfra.Prod do\n  use Zed.DSL\n\n  deploy :prod, pool: \"tank\" do\n    dataset \"apps/myapp\" do\n      compression :lz4\n    end\n\n    app :myapp do\n      dataset \"apps/myapp\"\n      version \"1.0.0\"\n      cookie {:env, \"RELEASE_COOKIE\"}\n    end\n\n    snapshots do\n      before_deploy true\n      keep 5\n    end\n  end\nend\n\n# Use it\nMyInfra.Prod.diff()       # Show what would change\nMyInfra.Prod.converge()   # Apply changes\nMyInfra.Prod.status()     # Read state from ZFS\nMyInfra.Prod.rollback(\"@latest\")  # Instant rollback\n```\n\n## Multi-Host Deployment\n\n```elixir\n# Start agents on each host\nZed.Agent.start_link()\n\n# From controller, connect and deploy\nZed.Cluster.connect(:\"zed@host2\")\nZed.Cluster.converge_all(ir)\n\n# Coordinated deploy with automatic rollback on failure\nZed.Cluster.converge_coordinated(ir)\n```\n\n## GPU node abstraction — current progress\n\nThe vision below is intact. The infrastructure for the *runtime side*\n(driving the GPU from BEAM) shipped in May 2026 in the sibling\n[`nx_vulkan`](https://github.com/borodark/nx_vulkan) repository; the *deploy side* (zed's\ndeclarative DSL for GPU clusters) is still on the roadmap — see \"Road\nto Production\" below.\n\n### What shipped (in `nx_vulkan@main`, May 2026)\n\n| Capability | Where | Status |\n|---|---|---|\n| Vulkan compute backend (no CUDA, no Metal) | `nx_vulkan/lib/nx_vulkan/native.ex` + spirit | ✅ Cross-platform validated on Linux RTX 3060 Ti + FreeBSD GT 750M + GT 650M (178/178 tests) |\n| Long-lived per-machine GPU node GenServer | `Nx.Vulkan.Node` + `with_node/2` | ✅ |\n| Persistent `vkPipelineCache` (disk, header-validated) | `Nx.Vulkan.PipelineCache` | ✅ 4× cold-start speedup |\n| Runtime shader synthesis from per-family spec | `Nx.Vulkan.Synthesis` + `ShaderTemplate` | ✅ \u003c200 ms cold path; 6 hand-written + 3 synthesized chain shader families |\n| MCMC integration (NUTS leapfrog, persistent buffers, EXLA fallback) | `pymc/exmc@main` `Exmc.NUTS.Vulkan.*` | ✅ |\n| Per-shader suspect tracking (W6 Phase 1) | `Exmc.NUTS.Vulkan.SuspectTracker` | ✅ Eviction policy + cross-shader sliding window |\n\n### What zed needs to add (deploy side)\n\nThe runtime substrate exists. Turning it into the declarative\n`deploy :gpu_cluster` block below requires zed-specific work that\nhasn't started yet:\n\n```elixir\ndeploy :gpu_cluster, pool: \"tank\" do\n  node :workstation do\n    gpu \"RTX 4090\", vram: 24\n  end\n\n  model :llama70b do\n    dataset \"models/llama-70b\"\n    requires vram: 48\n  end\n\n  job :finetune do\n    model :llama70b\n    checkpoint_every \"1 epoch\"  # checkpoint = zfs snapshot\n  end\nend\n```\n\nThe mapping from `nx_vulkan`'s capabilities into a zed deploy spec is:\n\n| Vision DSL block | What zed must build | Estimated effort |\n|---|---|---|\n| `node :workstation do gpu ... end` | Inventory primitive that calls `nvidia-smi` / `pciconf` to enumerate GPU(s); reflect into ZFS properties (`com.zed:gpu.vendor`, `com.zed:gpu.vram_mb`) on the host. | 1 week |\n| `model :llama70b do requires vram: 48 end` | Scheduler that matches model VRAM requirements against host `com.zed:gpu.vram_mb`. Refuses to deploy if no host has enough VRAM. Pure Elixir, no new infrastructure. | 1 week |\n| `model do dataset \"models/...\" end` | Already covered by zed's existing dataset primitive — model files are just ZFS datasets. Zero new code. | 0 |\n| `job :finetune do checkpoint_every \"1 epoch\" end` | Hooks into a training loop callback. Triggers `zfs snapshot dataset@epoch-N`. Probably a behavior the user-app implements; zed provides the snapshot primitive (already exists). | 1 week |\n| GPU node lifecycle (start `Nx.Vulkan.Node` under app supervisor, restart on driver crash, persist cache on shutdown) | Agent verb that reads `com.zed:gpu.driver` and starts the right OTP application. Standard zed agent pattern. | 1 week |\n| mDNS service discovery (`_exmc_gpu._tcp.local`) | Coordinate with `nx_vulkan` Phase 3 work — both projects plan to use `mdns_lite`. Need a service-name convention. | 2-3 weeks (joint with nx_vulkan) |\n\n```\nModel versioning?    zfs snapshot\nModel distribution?  zfs send/receive\nExperiment tracking? ZFS properties (com.zed:loss=0.0023)\nCheckpoint/resume?   Snapshots travel to any node\nRollback bad run?    zfs rollback (O(1))\nGPU dispatch?        Nx.Vulkan.Node.with_node/2  ← shipped\nPer-host inventory?  zed agent reads PCIe + /dev/nvidia*  ← TODO\n```\n\nSee [docs/gpu-cluster.md](docs/gpu-cluster.md) for the original vision.\n\n## Road to Production\n\nHonest assessment of what's missing before zed should be trusted with\nproduction workloads. Categorized by risk to a deployment, not by\nchronological order. Each line is a real deficit, not a polish item.\n\n### P0 — must fix before anyone runs zed in prod\n\n- [ ] **Convergence engine end-to-end on a real deploy.** A1-A5a are\n      individual layers; the *combined* `Module.converge()` on a\n      multi-host deploy with ZFS + Bastille + cluster has been\n      live-tested only on the dev machines, not on a clean prod-shaped\n      target. **Effort: 1-2 weeks live-burn.**\n- [ ] **Health checks wired to convergence.** `app :foo do health\n      :http, url: \"...\" end` exists in the spec; the executor does not\n      yet wait on health checks before declaring success. Critical:\n      without this, a \"successful\" deploy can leave the app crashed.\n      **Effort: 1 week.**\n- [ ] **Rollback under partial failure.** If a multi-host deploy\n      succeeds on hosts A+B and fails on C, `Zed.Cluster.converge_coordinated`\n      is supposed to roll all three back. The path exists but hasn't\n      been chaos-tested under realistic failure modes (network\n      partition during apply, ZFS pool full, jail.conf syntax error\n      mid-apply). **Effort: 2 weeks chaos-test + harden.**\n- [ ] **Secrets at rest.** A1 produces encrypted `\u003cbase\u003e/zed/secrets`\n      with fingerprint-stamped properties. The pipeline that gets\n      secrets *into* the deploying app's env is partly designed\n      ([`docs/SECRETS_DESIGN.md`](docs/SECRETS_DESIGN.md)) but not\n      fully shipped — current deploys rely on env files placed by the\n      operator. **Effort: 2 weeks to ship the agent-side decrypt path.**\n- [ ] **Erlang-distribution security**. Cluster RPC currently uses\n      cookie auth + Unix sockets between zedweb/zedops. Production\n      deployments need either TLS distribution or a hardened\n      `epmd_proxy`. The `getpeereid` NIF covers local IPC; cross-host\n      cookies on the open network do not. **Effort: 1 week.**\n\n### P1 — should fix before scaling beyond a single operator\n\n- [ ] **No CI/CD integration.** No GitHub Actions / Forgejo / etc.\n      runner that runs `mix test` + `mix test --include zfs_live` on\n      every push. Currently the only verification is the operator\n      running the live tests by hand. **Effort: 2 days.**\n- [ ] **No telemetry / observability beyond log files.** No\n      `:telemetry` events on convergence steps, no Prometheus/StatsD\n      hooks. `LiveDashboard` is wired in zedweb but the converger\n      itself is opaque. **Effort: 1 week.**\n- [ ] **No upgrade strategies.** A `Module.converge()` either replaces\n      a service entirely or doesn't. No rolling upgrade, no\n      blue-green, no canary. For a small fleet (\u003c10 hosts) this is\n      fine; beyond that an operator wants finer control. **Effort: 2-3\n      weeks for rolling; another 2 for blue-green.**\n- [ ] **DSL coverage is shallow.** The DSL handles `dataset`, `app`,\n      `jail`, `snapshots`. It doesn't handle: nested deploys,\n      conditional resources (`if env == :prod`), resource hooks\n      (`before_deploy`, `after_deploy`), depends_on graphs. Current\n      workaround is multiple deploy modules. **Effort: 1 week per\n      hook, 2-3 weeks for the dependency graph.**\n- [ ] **No supported-version policy.** OTP 26+ / Elixir 1.17+ is the\n      stated minimum, but the live-test rig pins OTP 27 + Elixir 1.18\n      and there's no LTS commitment. Production needs a written\n      promise about what zed will and won't break across point\n      releases. **Effort: 1 day to write the policy.**\n\n### P2 — nice to have, not blockers for first prod use\n\n- [ ] **mDNS discovery for multi-host deploys.** Currently `Zed.Cluster.connect`\n      takes an explicit node name. mDNS would auto-discover. Coordinated\n      with `nx_vulkan` Phase 3 (see \"GPU node abstraction\" above).\n      **Effort: 2-3 weeks joint.**\n- [ ] **Web UI for non-Erlang operators.** The Phoenix LiveView admin\n      foundation (A2a/A2b/A3/A4) ships; the actual *deploy* UI on top\n      of it (form for editing `Module.converge` parameters,\n      visual diff before apply) doesn't yet. The `zed` command-line is\n      the only deploy interface today. **Effort: 3-4 weeks.**\n- [ ] **No security review.** No external audit; no fuzz testing of\n      the DSL parser; no formal threat model for the Bastille adapter\n      privilege boundary. The `getpeereid` boundary is small and\n      reviewable, but no one outside the dev team has reviewed it.\n      **Effort: 1-2 weeks for an internal pen-test sprint; budget\n      $5-15K for an external audit.**\n- [ ] **Documentation gap for non-FreeBSD users.** README claims\n      \"FreeBSD or illumos (Linux for dev/test only)\". A user wanting to\n      try zed on Ubuntu currently has no guidance — the dev-loop docs\n      assume FreeBSD primitives (Bastille, ZFS-on-root, doas).\n      **Effort: 1 week to write a Linux quickstart.**\n- [ ] **Larger test fleet.** Current dev runs on two FreeBSD Macs +\n      one Linux box. Production validation needs ≥5 hosts, mixed\n      hardware, real network failures. The Spirit project's CI ran on\n      a 12-node cluster; zed has nothing comparable yet. **Effort: 1-2\n      months including hardware acquisition.**\n\n### What zed *won't* do (deliberate scope discipline)\n\n- ❌ **Linux as a first-class deployment target.** Linux is supported\n      for dev/test only. ZFS-on-Linux works but isn't the design center.\n- ❌ **Container orchestration.** Kubernetes / Docker / Podman are out\n      of scope. Zed deploys mix releases into FreeBSD jails or illumos\n      zones. Containers exist; this isn't them.\n- ❌ **Single-host high availability.** Zed is per-host authoritative.\n      For HA you run multiple hosts and let zed coordinate — but each\n      host is its own root of trust. Quorum protocols (Raft, Paxos)\n      are not on the roadmap.\n- ❌ **Cross-cloud abstraction.** No AWS / GCP / Azure terraform-style\n      provider layer. Zed manages BEAM applications on hosts you\n      already have. How those hosts came into existence is your\n      problem.\n\n## Installation\n\n```sh\ngit clone \u003crepo\u003e\ncd zed\nmix deps.get\nmix compile          # builds priv/peer_cred.so via elixir_make\n\n# Run tests\nmix test                                       # 216 unit/integration tests\nZED_TEST_DATASET=\u003cpool\u003e/zed-test \\\n  doas mix test --include zfs_live             # + 24 ZFS-on-FreeBSD tests\nmix test --include bastille_live               # + 7 Bastille-on-FreeBSD tests\n```\n\n## Requirements\n\n- FreeBSD or illumos (Linux for dev/test only)\n- ZFS pool with a delegated test subtree (any name; pass via `ZED_TEST_DATASET`)\n- Erlang/OTP 26+, Elixir 1.17+\n- C compiler for the `peer_cred` NIF (`cc` from FreeBSD base; `gcc`/`clang` on Linux)\n\n## Iteration Arc\n\nThe roadmap lives in [`specs/iteration-plan.md`](specs/iteration-plan.md); each `A*` layer has a per-iteration spec under [`specs/`](specs/). Headline status:\n\n| # | Layer | Status | Notes |\n|---|-------|--------|-------|\n| A0 | DSL slot validation | ✅ Done | Compile-time `storage:` mode check |\n| A1 | `Zed.Bootstrap` (init / status / **rotate** / verify / export-pubkey) | ✅ Done | Encrypted `\u003cbase\u003e/zed/secrets`, fingerprint-stamped ZFS properties, archived rotation history |\n| A2a | Phoenix LiveView admin foundation | ✅ Done | Password login + 8h session + dashboard |\n| A2b | QR admin first-login | ✅ Done | `Zed.QR` + `Zed.Admin.OTT` (single-use, rate-limited, audit-logged) |\n| A3 | Passkey (WebAuthn) auth | ✅ Done | `wax_`-backed; Chrome desktop + Safari iOS + Chrome Android |\n| A4 | SSH-key challenge auth | ✅ Done | `ssh-keygen -Y sign` flow + login script |\n| A5.1 | Bastille jail adapter | ✅ Done | 540 LOC; live-verified after seven real-world bugs ([blog](http://www.dataalienist.com/blog-lie-at-exit-zero.html)) |\n| A5a | **Privilege boundary** (zedweb / zedops split) | ✅ Done | Two `mix release` targets, Unix-socket transport, `getpeereid(2)` NIF, capability-scoped doas, `host-bring-up.sh` |\n| B0 | `zedz` mobile QR scanner | Planned | Fork of probnik with `zed_admin` payload handler |\n\nLayers C (NAS-adjacent: SMB + Time Machine) and D (Probnik Vault + Shamir) are shelved per the iteration plan; unshelve only on explicit decision.\n\n## Architecture\n\n```\nDSL (macros) → IR (validated) → Converge (diff→plan→execute) → ZFS\n                                       ↓\n                              Agent ←──:rpc.call──→ Cluster\n\nAfter A5a:\n   zedweb (no privilege)         zedops (capability-scoped doas)\n   ────────                      ────────\n   Phoenix endpoint              Zed.Ops.Socket   ── Unix socket\n   OpsClient.Pool ──────►        (peer-cred check on accept)\n                                 Zed.Ops.Bastille.Handler\n                                 Runner.System  ──► doas bastille …\n```\n\n## Documentation\n\n**Specs (the plan)**\n- [specs/iteration-plan.md](specs/iteration-plan.md) — full roadmap, decisions log, layer rollup\n- [specs/a5-bastille-plan.md](specs/a5-bastille-plan.md) — Bastille adapter design (A5)\n- [specs/a5a-privilege-boundary.md](specs/a5a-privilege-boundary.md) — privilege boundary spec (A5a)\n- [specs/b0-zedz-plan.md](specs/b0-zedz-plan.md) — mobile companion (B0)\n- [specs/qr-schema.md](specs/qr-schema.md) — QR payload term shapes\n\n**Operational**\n- [docs/doas.conf.zedops](docs/doas.conf.zedops) — production doas template (capability-scoped)\n- [docs/SECRETS_DESIGN.md](docs/SECRETS_DESIGN.md) — secrets pipeline design\n- [docs/MULTI_HOST_TEST.md](docs/MULTI_HOST_TEST.md) — multi-host test setup\n- [scripts/host-bring-up.sh](scripts/host-bring-up.sh) — idempotent FreeBSD setup\n- [scripts/verify-bastille-host.sh](scripts/verify-bastille-host.sh) — readiness checker\n- [scripts/a5a-live-runbook.md](scripts/a5a-live-runbook.md) — Mac Pro live-test runbook\n\n**Background**\n- [docs/BLOG_ZED_MANIFESTO.md](docs/BLOG_ZED_MANIFESTO.md) — the manifesto\n- [docs/gpu-cluster.md](docs/gpu-cluster.md) — GPU cluster vision\n- [docs/pitches.md](docs/pitches.md) — why ZFS properties replace etcd\n- [docs/market.md](docs/market.md) — market analysis\n- [docs/elixirforum-update-1.md](docs/elixirforum-update-1.md) — community progress note\n\n**Project meta**\n- [CONTRIBUTING.md](CONTRIBUTING.md) — how to contribute\n- [CLAUDE.md](CLAUDE.md) — project context and architecture\n\n## Integration with `nx_vulkan`\n\nZed and [`nx_vulkan`](https://github.com/borodark/nx_vulkan) are sibling repos, not coupled at the Mix dependency level. The deployment pattern:\n\n1. Zed orchestrates BEAM nodes (start, supervise, health-check, rollback).\n2. Each node's own `mix.exs` lists `nx_vulkan` (and `exmc`, etc.) as Hex deps — zed doesn't import `nx_vulkan` itself.\n3. The deployed application's supervisor starts `Nx.Vulkan.Node` (the long-lived GPU-node GenServer) under its own tree.\n4. Zed treats it identically to any other OTP application — deploys it, supervises it, doesn't need to know about Vulkan APIs.\n\nPractical compatibility holds today: both pin OTP 27 / Elixir 1.18, share the NAS git server, and have no conflicting global state. See [`specs/nx-vulkan-execution.md`](specs/nx-vulkan-execution.md) for the full integration story (and the historical execution plan).\n\nOpen coordination work (Phase 3 of `nx_vulkan/PLAN_GPU_NODE.md`): both projects plan to use `mdns_lite` for service discovery. Once the multi-client GPU node lands, the two need to agree on service-name conventions (`_zed._tcp.local` vs `_exmc_gpu._tcp.local`) so they don't collide on the local-link advertisement bus.\n\n## Status\n\nPre-1.0, design-iterating, single-maintainer. The iteration plan is being walked one layer at a time with live FreeBSD verification after each landed merge. Issues / PRs are welcome but expect short discussion before sizable changes — the design surface is still being negotiated.\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborodark%2Fzed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fborodark%2Fzed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborodark%2Fzed/lists"}