{"id":50749464,"url":"https://github.com/musharna/jobd","last_synced_at":"2026-06-11T00:02:02.884Z","repository":{"id":361747515,"uuid":"1255513733","full_name":"musharna/jobd","owner":"musharna","description":"Self-hostable GPU-aware job broker for your own machines, with native MCP/agent integration","archived":false,"fork":false,"pushed_at":"2026-06-09T02:23:26.000Z","size":471,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-09T04:18:55.715Z","etag":null,"topics":["distributed-systems","fastapi","gpu","homelab","job-queue","job-scheduler","mcp","self-hosted","tailscale"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/jobd/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/musharna.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-31T23:14:59.000Z","updated_at":"2026-06-09T02:23:29.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/musharna/jobd","commit_stats":null,"previous_names":["musharna/jobd"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/musharna/jobd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fjobd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fjobd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fjobd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fjobd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/musharna","download_url":"https://codeload.github.com/musharna/jobd/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/musharna%2Fjobd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34175887,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-systems","fastapi","gpu","homelab","job-queue","job-scheduler","mcp","self-hosted","tailscale"],"created_at":"2026-06-11T00:02:02.116Z","updated_at":"2026-06-11T00:02:02.880Z","avatar_url":"https://github.com/musharna.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# jobd\n\n[![CI](https://github.com/musharna/jobd/actions/workflows/ci.yml/badge.svg)](https://github.com/musharna/jobd/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/jobd)](https://pypi.org/project/jobd/)\n![Python](https://img.shields.io/pypi/pyversions/jobd)\n[![Glama](https://glama.ai/mcp/servers/musharna/jobd/badges/score.svg)](https://glama.ai/mcp/servers/musharna/jobd)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n**A self-hostable, GPU-aware job broker for your own machines — with native MCP/agent integration.**\n\n\u003e Like [task-spooler](https://manpages.ubuntu.com/manpages/noble/man1/tsp.1.html), but across more than one machine — and VRAM-aware.\n\n\u003c/div\u003e\n\nmcp-name: io.github.musharna/jobd\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/musharna/jobd/main/docs/assets/demo.svg\" alt=\"jobd in action: submit a GPU job, watch it route to a worker with free VRAM and stream back, then inspect the full lifecycle\" width=\"100%\"\u003e\n\u003c/p\u003e\n\nYou have a couple of boxes with GPUs — a workstation, a server, maybe a laptop — wired together over [Tailscale](https://tailscale.com/) or a LAN. You want to fire off training runs, data pipelines, and long batch jobs from anywhere, have them land on whichever machine actually has the VRAM free, survive across sessions, and get preempted cleanly when something more important shows up. You don't have a cloud, a Kubernetes cluster, or a Slurm install, and you don't want one.\n\njobd is that missing piece: a small broker that turns a handful of personal machines into a single queue. Think _SkyPilot / Modal, for people without a cloud_ — except the fleet is the hardware you already own, and an LLM agent can drive it directly.\n\n```bash\n# from any machine on your tailnet:\njob submit --project myproj --gpu --vram-required 16 --wait -- python train.py\n# → routed to whichever worker has ≥16 GB VRAM free, streamed back to your terminal\n```\n\n## Why it exists\n\nMost schedulers assume a datacenter. The lightweight ones that don't (a bare `nohup`, a tmux session, an ssh-and-pray script) give you nothing: no queue, no VRAM-aware routing, no preemption, no record of what ran where. jobd fills the gap between \"ssh in and run it\" and \"stand up Slurm\":\n\n- **VRAM-fit routing.** The broker matches each job against live worker capacity (free VRAM / RAM / CPUs, capability tags, arch/OS) and dispatches to a worker that actually fits — instead of you guessing which box is free.\n- **Preempt + checkpoint.** A higher-priority job can preempt a running one: the worker sends `SIGTERM`, the workload gets a grace window to checkpoint, then `SIGKILL`. A preempted job reaches a terminal `preempted` state with a durable checkpoint to resume from — it isn't silently re-run. (See [docs/preemption.md](docs/preemption.md).)\n- **Survives sessions.** Submit, close your laptop, check back tomorrow. Jobs live in the broker, not your shell.\n- **Agent-native.** Ships a first-class [MCP](https://modelcontextprotocol.io/) server so an LLM agent (Claude Code, etc.) can submit, monitor, and babysit jobs as tool calls — the thing most schedulers bolt on as an afterthought, if at all.\n- **Yours.** One broker process you run on a machine you own. No accounts, no egress, no per-GPU-hour billing. Tailnet-bound by default.\n\n## Why not just use…?\n\n| Tool                                                                           | What it gives you                                           | Why jobd instead                                                                                               |\n| ------------------------------------------------------------------------------ | ----------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |\n| **`nohup` / `tmux` / ssh-and-pray**                                            | Runs a command on one box                                   | No queue, no VRAM-aware routing, no preemption, no record of what ran where                                    |\n| **[task-spooler](https://manpages.ubuntu.com/manpages/noble/man1/tsp.1.html)** | A real job queue — on a single machine                      | jobd queues across _all_ your machines and routes by live VRAM/CPU fit                                         |\n| **Slurm**                                                                      | Datacenter-grade scheduling                                 | Heavy to stand up and operate for 2–3 personal boxes; jobd is one process + a poller per host                  |\n| **SkyPilot / Modal / dstack**                                                  | Provision and run on clouds (SkyPilot also on-prem via SSH) | jobd targets hardware you _already own_, with no cloud/K8s assumptions and a much smaller footprint            |\n| **Ray**                                                                        | A distributed-compute framework                             | jobd is a job _queue_, not a programming model — submit any command, no code changes, GPU-fit routing built in |\n\nClosest in spirit are task-spooler (single-node) and on-prem SkyPilot (heavier, cloud-shaped). jobd's niche is the 2–3-GPU homelab: multi-machine VRAM-fit routing + preempt/checkpoint + a native agent interface, with nothing to stand up.\n\n## Architecture\n\n```mermaid\nflowchart TD\n    CLI[\"job CLI\"]:::client --\u003e B\n    MCP[\"jobd-mcp\u003cbr/\u003eMCP tools\"]:::client --\u003e B\n    API[\"HTTP · SSE\"]:::client --\u003e B\n    B[\"\u003cb\u003ejobd broker\u003c/b\u003e — FastAPI\u003cbr/\u003equeue · matcher · priorities · SQLite\"]:::broker\n    B \u003c--\u003e|poll · dispatch| WA[\"worker A\u003cbr/\u003e24 GB GPU\"]:::worker\n    B \u003c--\u003e|poll · dispatch| WB[\"worker B\u003cbr/\u003e8 GB GPU\"]:::worker\n    B \u003c--\u003e|poll · dispatch| WC[\"worker C\u003cbr/\u003eCPU-only\"]:::worker\n    classDef client fill:#1f2937,stroke:#4b5563,color:#e5e7eb;\n    classDef broker fill:#0e7490,stroke:#155e75,color:#ecfeff;\n    classDef worker fill:#14532d,stroke:#166534,color:#dcfce7;\n```\n\nWorkers **poll** the broker (pull model — no inbound connection to a worker); the broker matches each job against live capacity and hands it back on the poll. One broker process, one poller per host.\n\n- **Broker** — a FastAPI + SQLite service. Holds the queue, runs the matcher, resolves per-project priorities and defaults, exposes a small HTTP API and an SSE stream. Single source of truth.\n- **Workers** — lightweight polling agents, one per host. Each advertises live capacity via heartbeat, claims jobs it can run, executes them (`shell=False`, no shell-injection surface), streams logs back, and honors preemption signals.\n- **Clients** — the `job` CLI, the `jobd-mcp` MCP server, or anything that speaks the HTTP API.\n\n## Install\n\n```bash\npip install jobd               # broker + CLI\npip install \"jobd[mcp]\"        # adds the MCP server\npip install \"jobd[worker]\"     # adds the worker daemon (jobd-worker)\n```\n\nRequires Python ≥ 3.11. Everything ships in the one `jobd` package: the broker (`jobd`), the CLI (`job`), the MCP server (`jobd-mcp`), and the worker (`jobd-worker`). The worker's runtime deps (httpx, psutil, pyyaml, nvidia-ml-py) live behind the `[worker]` extra since they're only needed on machines that actually run jobs. `scripts/install-worker.sh` sets a worker up under `~/jobd-worker` with its own venv and a generated config.\n\n## Quickstart (single host)\n\n```bash\n# 1. start the broker (binds 127.0.0.1:8765 by default)\nJOBD_ALLOW_NO_AUTH=1 jobd          # no-auth is fine for a loopback-only broker\n\n# 2. in another shell, install + start a worker pointed at it\npip install \"jobd[worker]\"\nJOBD_URL=http://127.0.0.1:8765 JOBD_WORKER_HOST=local jobd-worker\n\n# 3. submit a job and wait for it\njob submit --project demo --wait -- echo hello\njob list\njob logs \u003cid\u003e\n```\n\nFor a real multi-host deployment (Docker broker + systemd workers, Tailscale binding, shared auth token), see **[docs/security.md](docs/security.md)** and the templates in `docker-compose.yml` and `scripts/` (broker compose, `install-worker.sh`, `job-worker.service`). Day-2 operations (health, draining a worker, upgrades, token rotation, backups) are in **[docs/runbook.md](docs/runbook.md)**.\n\n## Supported platforms\n\nPython 3.11+ everywhere.\n\n| Component                              | Linux   | macOS       | Windows              |\n| -------------------------------------- | ------- | ----------- | -------------------- |\n| **Broker** (`jobd`)                    | ✅      | ✅          | ✅ (WSL recommended) |\n| **CLI** (`job`) / **MCP** (`jobd-mcp`) | ✅      | ✅          | ✅                   |\n| **Worker** (`jobd-worker`)             | ✅ full | ⚠️ degraded | ⚠️ degraded          |\n\nThe **worker** runs its best on Linux with a systemd user instance: memory caps, process reaping, and preemption use `systemd-run --user` scopes and cgroups. On non-systemd hosts the worker still executes jobs, but silently drops those guarantees — fine for a single trusted box, not for hard resource isolation. GPU features need NVIDIA + `nvidia-ml-py`. The broker, CLI, and MCP server are pure-Python and portable.\n\n## CLI\n\n```\njob submit -p PROJ [--gpu] [--vram-required N] [--needs TAG]... [--count N | --sweep K=v1,v2]... [--wait] -- CMD...\njob list [--state STATE] [--project P] [--array A\u003cid\u003e]   # queue + recent jobs\njob status ID | A\u003cid\u003e [--watch]             # one job, or an array's aggregate\njob logs ID [-n BYTES]                      # tail captured output\njob wait ID                                 # block until terminal\njob cancel ID  /  job preempt ID            # stop a job\njob workers                                 # fleet snapshot + health\njob projects list | set NAME PRI | nudge NAME DELTA\njob audit [--project P] [--since 24h]       # event history\n```\n\n`job submit --explain` dry-runs the resolution (priority, profile, project defaults, host pin) and prints the effective config without enqueuing anything.\n\n### Job arrays\n\nSubmit N jobs from one template with `--count N`. Each member is a normal job — it routes, runs, preempts, and checkpoints independently — and `{i}` in the command is replaced by the member's 0-based index:\n\n```bash\njob submit -p train --count 8 -- python train.py --fold {i}\n# → Submitted array A42: 8 jobs (ids 42..49)\n\njob list --array A42         # the members, with their index annotations\njob status A42               # aggregate: state tally + per-member rollup\n```\n\nThe array is identified as `A\u003cid\u003e` (the first member's job id). `job status A42` exits non-zero if any member ended in a non-completed terminal state, so it composes with shell `\u0026\u0026`.\n\nFor a grid search, use `--sweep KEY=v1,v2,v3` (repeatable) instead of `--count`. The broker fans out the cartesian product of all axes, substituting `{KEY}` per member; `{i}` (the flat member index) is also available:\n\n```bash\njob submit -p train --sweep lr=0.1,0.01 --sweep seed=1,2,3 \\\n  -- python train.py --lr {lr} --seed {seed} --out run-{i}\n# → Submitted array A50: 6 jobs (ids 50..55)   # 2 × 3 = 6 members\n```\n\n`--sweep` and `--count` are mutually exclusive, the product is capped at 1000 members, and `i` is reserved as an axis key. Substitution is a literal `{key}` replace (not `str.format`), so JSON literals and shell braces in the command pass through untouched.\n\n## MCP / agent integration\n\njobd ships an MCP server (`jobd-mcp`) exposing the queue as nine tools — `jobd_submit`, `jobd_status`, `jobd_logs`, `jobd_list`, `jobd_cancel`, `jobd_preempt`, `jobd_workers`, `jobd_job_get`, `jobd_worker_delete`. Point your MCP client at it:\n\n```json\n{\n  \"mcpServers\": {\n    \"jobd\": {\n      \"command\": \"jobd-mcp\",\n      \"env\": {\n        \"JOBD_URL\": \"http://127.0.0.1:8765\",\n        \"JOBD_API_TOKEN\": \"\u003cyour-token\u003e\"\n      }\n    }\n  }\n}\n```\n\n`JOBD_API_TOKEN` must match the broker's token, or every call returns 401. Omit it only when the broker runs with `JOBD_ALLOW_NO_AUTH=1`.\n\nNow an agent can \"run this overnight,\" check on it next session, and route GPU work through the broker instead of colliding on a shared card. The `examples/claude-code-hooks/` directory has optional [Claude Code](https://docs.claude.com/en/docs/claude-code) hooks that _nudge_ (or hard-block) an agent toward submitting heavy commands through jobd — including a VRAM-aware GPU guard with `# NO_GPU` / `# CONCURRENT_OK` / `# VRAM=NGB` override markers.\n\n## Configuration\n\nThree optional YAML files under `JOBD_CONFIG_DIR` (defaults shipped in `config/`):\n\n- **`projects.yaml`** — per-project base priority and submit defaults (preemptibility, wall/idle timeouts, host pins, capability requirements). See [docs/plans/projects-yaml.md](docs/plans/projects-yaml.md) for the full resolution model.\n- **`profiles.yaml`** — named resource bundles (`--profile gpu-train-large`) the matcher uses to size a job.\n- **`classifier.yaml`** — rules that auto-suggest a profile from the command string.\n\nAll three are optional; with none present, every job runs at the global default priority.\n\n## Concurrency (multislotting)\n\nBy default each worker runs **one job at a time** (`JOBD_WORKER_MAX_CONCURRENT_JOBS=1`). Raise it to let a worker bin-pack several jobs that fit side by side:\n\n```bash\nJOBD_WORKER_MAX_CONCURRENT_JOBS=3 jobd-worker\n```\n\nThe matcher is resource-aware, so this is **not** blind N-up oversubscription. Each in-flight job reserves its `vram_gb` / `ram_gb` / `cpus` footprint, and the worker's heartbeat advertises only what's left (`free_vram = raw − Σ in-flight`). The broker won't place a job that doesn't fit the remaining headroom. The practical payoff: **a CPU-only job and a GPU job run at the same time** — the CPU job reserves 0 VRAM, so it never blocks the GPU slot, and vice-versa. Two GPU jobs co-run only if both fit live VRAM (the `/next-job` admission gate is the final safety net against an overstated ad).\n\n`job workers` reports each worker's slot usage — `running` jobs out of `max_concurrent` — alongside the live resource ad:\n\n```jsonc\n// job workers\n{ \"host\": \"desktop\", \"state\": \"online\", \"running\": 2, \"max_concurrent\": 3,\n  \"free_vram_gb\": 9.1, \"idle_cpus\": 6, ... }\n```\n\nSet the limit per worker from its environment (systemd unit, shell, or `worker.yaml` env) — it's a worker-local knob, not a broker setting.\n\n## Retention\n\nBy default jobd **keeps every job record and `.log` file forever** — history is never lost. On a long-running broker, opt into pruning:\n\n```bash\nJOBD_JOB_RETENTION_DAYS=30 jobd   # delete terminal jobs + their logs after 30 days\n```\n\nThe sweeper deletes jobs in a terminal state whose `finished_at` is older than the horizon, unlinks their per-job `.log`, and emits a `jobs_pruned` event. Freed SQLite pages are reused under WAL, so the DB file stays bounded without a global-locking `VACUUM`. The default (`0`) keeps everything; pruning old terminal parents is safe for any still-pending dependents.\n\n## Security\n\nThe broker has **no TCP-layer auth beyond a shared bearer token**, so it is meant to run on a trusted network (loopback or a Tailscale tailnet), never on a public interface. Two stacked controls:\n\n1. **Interface binding** — `JOBD_HOST` must be `127.0.0.1` or a Tailscale CGNAT address (`100.64.0.0/10`), never `0.0.0.0`. A CI lint (`tests/test_deploy_lint.py`) enforces this on the Docker deployment.\n2. **Bearer token** — set `JOBD_API_TOKEN` (≥32 random bytes) on every broker/worker/CLI/MCP host. The broker refuses to start without it unless you explicitly set `JOBD_ALLOW_NO_AUTH=1`. **`JOBD_ALLOW_NO_AUTH=1` is for a loopback-only broker (`JOBD_HOST=127.0.0.1`) — for local dev/tests.** Combined with a non-loopback `JOBD_HOST` it exposes an unauthenticated RCE endpoint to your whole tailnet; the broker logs a startup warning if you do this. Don't.\n\nFull threat model, env-var reference, and token rotation: **[docs/security.md](docs/security.md)**.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmusharna%2Fjobd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmusharna%2Fjobd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmusharna%2Fjobd/lists"}