https://github.com/lrdinsu/workron
A distributed job scheduler built in Go from scratch. Built incrementally to explore concurrency and distributed systems patterns.
https://github.com/lrdinsu/workron
concurrency distributed-systems go golang heartbeat job-scheduler task-queue worker-pool
Last synced: about 2 months ago
JSON representation
A distributed job scheduler built in Go from scratch. Built incrementally to explore concurrency and distributed systems patterns.
- Host: GitHub
- URL: https://github.com/lrdinsu/workron
- Owner: lrdinsu
- License: mit
- Created: 2026-03-10T00:12:37.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-29T23:29:06.000Z (3 months ago)
- Last Synced: 2026-03-30T01:35:12.710Z (3 months ago)
- Topics: concurrency, distributed-systems, go, golang, heartbeat, job-scheduler, task-queue, worker-pool
- Language: Go
- Homepage: https://lrdinsu.github.io/posts/designing-distributed-job-scheduler-go/
- Size: 104 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Workron
A distributed job scheduler written in Go, designed for ML and batch workloads.
---
## Overview
Workron is a distributed job scheduler that accepts jobs via a REST API and executes them across concurrent workers. It supports two deployment modes: a single-process standalone mode where the scheduler and workers share memory, and a distributed mode where the scheduler and workers run as separate binaries communicating over HTTP.
Jobs can declare dependencies on other jobs, forming a DAG (directed acyclic graph). The scheduler validates the dependency graph at submission time, rejecting cycles, and only makes downstream jobs available for execution once all their upstream dependencies have completed.
Jobs are persisted to SQLite or PostgreSQL, so in-flight and completed work survives a full scheduler restart. PostgreSQL uses `FOR UPDATE SKIP LOCKED` for safe concurrent job claiming across multiple connections, preparing for multi-scheduler deployments. An in-memory store is also available for development and testing.
Workers register with the scheduler on startup, reporting their resource capacity (VRAM, memory) and execution address. Workers send periodic heartbeats while processing jobs. A background reaper on the scheduler detects stale job heartbeats and re-queues orphaned jobs, and marks workers with stale heartbeats as offline, ensuring no work is silently lost when a worker crashes.
For distributed training and other multi-worker workloads, Workron supports gang scheduling: a single job request creates N coordinated tasks that must all be placed on suitable workers before any of them start. A background admission cycle reserves workers atomically, so tasks never start partially. When one task fails or its worker dies mid-run, the scheduler enters coordinated drain: running siblings receive a preempt signal on their next heartbeat, send SIGTERM to their child processes, optionally upload a checkpoint, and acknowledge back. Once all in-flight tasks have drained, the gang returns to `blocked` atomically and is re-admitted on the next tick. No task is left running when its peers are already dead.
If you are curious about the design decisions and trade-offs behind this project, I wrote about the journey here:
- π [Before the Code: Designing a Distributed Job Scheduler in Go](https://lrdinsu.github.io/posts/designing-distributed-job-scheduler-go/)
- π [Building the Concurrent Monolith: Atomic Job Claiming in Go](https://lrdinsu.github.io/posts/building-concurrent-monolith-atomic-job-claiming-go/)
- π [Splitting and Surviving Failures: HTTP Workers and Heartbeat Detection in Go](https://lrdinsu.github.io/posts/splitting-and-surviving-failures-workron/)
- π [Surviving the Crash: Adding SQLite Persistence Without Touching Business Logic](https://lrdinsu.github.io/posts/persisting-jobs-with-sqlite-workron/)
- π [DAG Dependencies: Teaching a Job Scheduler to Wait](https://lrdinsu.github.io/posts/dag-dependencies-workron/)
- π [Making the Invisible Visible: Structured Logging, Metrics, and Request Tracing](https://lrdinsu.github.io/posts/observability-slog-prometheus-workron/)
- π [All or Nothing: Gang Scheduling in Workron](https://lrdinsu.github.io/posts/gang-scheduling-workron/)
---
## Features
- **REST API:** Submit, monitor, and manage jobs over HTTP
- **DAG pipelines:** Jobs declare upstream dependencies, validated at submission with cycle detection; downstream jobs run only after all dependencies complete
- **Pluggable storage:** In-memory for development, SQLite for single-node persistence, PostgreSQL for concurrent multi-connection access
- **Atomic job claiming:** Mutex in memory, `UPDATE ... RETURNING` in SQLite, `FOR UPDATE SKIP LOCKED` in PostgreSQL
- **Automatic retry:** Failed jobs re-queued up to `MaxRetries` times before marked permanently failed
- **Two deployment modes:** Standalone (single process) or distributed (separate scheduler + worker binaries over HTTP)
- **Heartbeat-based failure detection:** Workers send 5s heartbeats; the scheduler re-queues orphaned jobs after 30s of silence
- **Graceful shutdown:** Workers finish their current job before exiting
- **Structured logging:** JSON output via `log/slog` with typed fields, log levels, and per-component logger injection
- **Prometheus metrics:** Counters for job lifecycle events, histograms for execution duration and queue wait, gauges for queue state via custom collector
- **Request ID tracing:** UUID per HTTP request, `X-Request-ID` header, request-scoped logger via context
- **Multi-scheduler coordination:** Multiple instances share one PostgreSQL database; advisory locks ensure only one reaper runs at a time
- **Health endpoint:** `GET /health` returns instance ID, uptime, and status for load balancer checks
- **Job resource requirements:** Jobs declare VRAM and memory needs, with priority and queue assignment
- **Worker registration:** Workers register with resource capacity, execution address, and tags; stale workers marked offline automatically
- **Gang scheduling:** Submit N coordinated tasks that all reserve workers atomically before any start; background admission cycle places largest gangs first with capacity accounting across running and reserved jobs; gang env vars (`GANG_ID`, `GANG_SIZE`, `GANG_INDEX`, `GANG_PEERS`) injected at claim time
- **Gang preemption:** When one gang task fails while siblings run, the scheduler drains the whole gang β running siblings receive a preempt action on their next heartbeat, send SIGTERM with a grace window, then SIGKILL if needed. The gang returns to `blocked` atomically and is re-admitted. A 45-second timeout force-drains workers that stop heartbeating mid-drain. The trigger task keeps its claim-time retry increment; innocent siblings are refunded so preemption does not consume their retry budget.
- **Checkpoint and resume:** Workers can `POST /jobs/{id}/checkpoint` opaque bytes while preempting. The scheduler surfaces them as `CHECKPOINT_DATA` (base64) in the job's env on the next claim, so re-admitted tasks can resume from their last saved state instead of starting over.
- **Kubernetes deployment:** Kustomize manifests for a local kind cluster (multi-replica scheduler with PostgreSQL advisory-lock coordination, worker `Deployment`, in-cluster Postgres `StatefulSet`, init containers for ordering, liveness/readiness probes, NodePort access). `make k8s-up` brings up the full stack; `make k8s-demo` runs the gang-preemption flow against real in-cluster pods. See [Running on Kubernetes](#running-on-kubernetes).
**Planned β Scheduling Intelligence**
- [ ] Priority-based preemption across gangs and single jobs
- [ ] Queue resource quotas with cross-queue borrowing
**Planned β Execution Semantics**
- [ ] Job cancellation (with cascading cancel for DAGs)
- [ ] Configurable retry backoff and per-job timeouts
- [ ] Backpressure and concurrency control
---
## Architecture
### Job Lifecycle
```
Submit βββΊ pending βββΊ running βββΊ done
β² β
βββ retry ββ (if attempts < max_retries)
β
βΌ
failed (if retries exhausted)
With dependencies:
Submit βββΊ blocked βββΊ pending βββΊ running βββΊ done
(waits for all (normal lifecycle)
deps to be done)
Gang scheduling (gang_size > 1):
Submit βββΊ blocked βββΊ reserved βββΊ running βββΊ done
(all N (admission (each worker
tasks cycle claims its
start places all pre-assigned
blocked) N at once) task)
Gang preemption (a task fails while siblings run):
running βββΊ preempting βββΊ preempted βββΊ blocked βββΊ (re-admitted)
(drain (worker ack (all siblings
started; via POST drained; gang
SIGTERM sent /preempted, returns to
via heartbeat or reaper admission pool)
response) force-drains
after 45s)
β
βΌ
failed
(if any sibling has
attempts >= max_retries
or is already done/failed
at drain completion)
```
### Standalone Mode
Everything runs in a single process. Workers access the job store directly through shared memory, protected by a mutex.
```
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Single Go Process β
β β
β REST API βββΊ Job Store βββ Workers β
β (HTTP) (mem/SQLite/PG) (goroutines) β
β [blocked] Worker 1 β
β [pending] Worker 2 β
β [running] Worker 3 β
β [done] β
β β
β Reaper (background goroutine) β
β ββ scans running jobs every 10s β
β ββ re-queues jobs with stale heartbeats β
β ββ unblocks ready downstream jobs β
ββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Distributed Mode
The scheduler and workers run as separate binaries. Workers poll the scheduler over HTTP to claim jobs, send heartbeats, and report results. Workers can run on different machines. Multiple scheduler instances can share one PostgreSQL database for high availability.
```
βββββββββββββββββββββββ ββββββββββββββββββββββββ
β Scheduler β β Workers β
β β HTTP β β
β REST API ββββββββββΊβ Worker Process A β
β Job Store β β Worker Process B β
β (mem/SQLite/PG) β β Worker Process C β
β Reaper β β β
β β β Sends heartbeats β
β Validates DAGs β β every 5s while β
β Unblocks ready β β processing a job β
β jobs on completion β β β
βββββββββββββββββββββββ ββββββββββββββββββββββββ
Source of truth Any machine
```
Both modes use the same `JobSource` interface, so the worker code is identical regardless of whether it talks to a local store or a remote scheduler.
---
## Getting Started
### Prerequisites
- Go 1.22+
- Docker (for PostgreSQL only)
### Installation
```bash
git clone https://github.com/lrdinsu/workron.git
cd workron
go mod tidy
```
### Standalone Mode
Run the scheduler and workers in a single process:
```bash
# In-memory store
make run-standalone
# With SQLite persistence
make run-standalone-sqlite
# With PostgreSQL persistence (requires: make run-postgres)
make run-standalone-postgres
```
### Distributed Mode
Start the scheduler and workers separately:
```bash
# Terminal 1: start the scheduler (with SQLite)
make run-scheduler-sqlite
# Terminal 2: start remote workers
make run-worker
# Terminal 3: submit jobs
curl -X POST http://localhost:8080/jobs -d '{"command":"echo hello"}'
```
### PostgreSQL Setup
```bash
# Copy the example env file and adjust credentials if needed
cp .env.example .env
# Start PostgreSQL via Docker Compose
make run-postgres
# Run PostgreSQL compliance tests
make test-postgres
# Stop PostgreSQL
make stop-postgres
```
### CLI Flags
**Scheduler** (`cmd/scheduler`)
| Flag | Default | Description |
|------|---------|-------------|
| `--mode` | `scheduler` | `scheduler` (HTTP API only) or `standalone` (API + local workers) |
| `--port` | `8080` | Port for the REST API |
| `--workers` | `3` | Number of local workers (standalone mode only) |
| `--db-driver` | `memory` | Storage backend: `memory`, `sqlite`, `postgres` |
| `--db-url` | `""` | Database connection string (SQLite file path or PostgreSQL URL) |
**Worker** (`cmd/worker`)
| Flag | Default | Description |
|------|---------|-------------|
| `--scheduler` | `http://localhost:8080` | Scheduler base URL |
| `--workers` | `3` | Number of concurrent worker goroutines |
| `--worker-id` | auto-generated UUID | Worker identifier; passed to `GET /jobs/next?worker_id=` to claim gang-reserved tasks |
---
## API
| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/jobs` | Submit a job, or a gang when `gang_size > 1` (with optional dependencies, resources, priority) |
| `GET` | `/jobs` | List all jobs |
| `GET` | `/jobs/{id}` | Get job status |
| `GET` | `/jobs/next` | Claim next pending job; pass `?worker_id=` to also claim gang-reserved tasks for that worker |
| `POST` | `/jobs/{id}/done` | Report job completed |
| `POST` | `/jobs/{id}/fail` | Report job failed (scheduler decides retry vs permanent failure; triggers gang drain when siblings are running) |
| `POST` | `/jobs/{id}/heartbeat` | Worker heartbeat for running or preempting job; response carries `{"action":"preempt","preemption_epoch":N}` when the scheduler wants the worker to stop |
| `POST` | `/jobs/{id}/preempted` | Worker acknowledgement that a preempted process has exited (requires `?epoch=N` matching current `PreemptionEpoch`) |
| `POST` | `/jobs/{id}/checkpoint` | Upload opaque checkpoint bytes for a preempting job (requires `?epoch=N`) |
| `GET` | `/jobs/{id}/checkpoint` | Fetch last stored checkpoint bytes for a job (204 if none) |
| `GET` | `/gangs/{gang_id}` | List all tasks belonging to a gang |
| `POST` | `/workers/register` | Register a worker with resource capacity |
| `POST` | `/workers/{id}/heartbeat` | Worker liveness heartbeat |
| `GET` | `/workers` | List all registered workers |
| `GET` | `/health` | Health check (instance ID, uptime, status) |
| `GET` | `/metrics` | Prometheus metrics |
Job endpoints
#### Submit a job
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{"command": "echo hello"}'
```
Response (`201 Created`):
```json
{
"id": "job-1",
"command": "echo hello",
"status": "pending",
"created_at": "2026-03-13T12:00:00Z",
"max_retries": 3,
"attempts": 0
}
```
#### Submit a job with dependencies
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{"command": "echo step2", "depends_on": ["job-1"]}'
```
Response (`201 Created`):
```json
{
"id": "job-2",
"command": "echo step2",
"status": "blocked",
"created_at": "2026-03-13T12:00:01Z",
"max_retries": 3,
"attempts": 0,
"depends_on": ["job-1"]
}
```
The job starts as `blocked` and transitions to `pending` automatically once all dependencies reach `done`. Returns `400` if any dependency ID does not exist or if the dependency graph would contain a cycle.
#### Submit a job with resource requirements
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{"command": "python train.py", "resources": {"vram_mb": 16384, "memory_mb": 32768}, "priority": 5, "queue_name": "training"}'
```
Jobs can declare VRAM and memory requirements, a priority level (higher is more important), and a queue name. These fields are stored and returned on the job, ready for resource-aware scheduling in a future PR.
#### Submit a gang (multi-worker job)
```bash
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{"command": "torchrun train.py", "gang_size": 4, "resources": {"vram_mb": 16384}}'
```
Response (`201 Created`):
```json
{
"gang_id": "gang-a1b2c3",
"tasks": ["job-1", "job-2", "job-3", "job-4"]
}
```
All four tasks start as `blocked`. A background admission cycle places them on four distinct workers with enough VRAM each, transitions them to `reserved`, and each worker claims its pre-assigned task via `GET /jobs/next?worker_id=`. At claim time, the scheduler injects `GANG_ID`, `GANG_SIZE`, `GANG_INDEX`, and `GANG_PEERS` environment variables so the worker processes can form their collective communication ring.
#### Submit a pipeline
```bash
JOB1=$(curl -s -X POST http://localhost:8080/jobs \
-d '{"command":"echo step1"}' | jq -r .id)
JOB2=$(curl -s -X POST http://localhost:8080/jobs \
-d "{\"command\":\"echo step2\", \"depends_on\":[\"$JOB1\"]}" | jq -r .id)
JOB3=$(curl -s -X POST http://localhost:8080/jobs \
-d "{\"command\":\"echo step3\", \"depends_on\":[\"$JOB2\"]}" | jq -r .id)
# step1 runs immediately, step2 waits for step1, step3 waits for step2
```
#### Other job endpoints
```bash
curl http://localhost:8080/jobs/{id} # Get job status
curl http://localhost:8080/jobs # List all jobs
curl http://localhost:8080/jobs/next # Claim next job (200 or 204 No Content)
curl -X POST http://localhost:8080/jobs/{id}/done # Report done
curl -X POST http://localhost:8080/jobs/{id}/fail # Report failed
curl -X POST http://localhost:8080/jobs/{id}/heartbeat # Send heartbeat
```
Worker endpoints
#### Register a worker
```bash
curl -X POST http://localhost:8080/workers/register \
-H "Content-Type: application/json" \
-d '{"id": "worker-1", "exec_addr": "192.168.1.10:9000", "resources": {"vram_mb": 24576, "memory_mb": 65536}, "tags": ["gpu", "a100"]}'
```
Response (`201 Created`):
```json
{
"id": "worker-1",
"exec_addr": "192.168.1.10:9000",
"resources": {"vram_mb": 24576, "memory_mb": 65536},
"tags": ["gpu", "a100"],
"status": "active",
"last_heartbeat": "2026-04-05T12:00:00Z",
"registered_at": "2026-04-05T12:00:00Z"
}
```
#### Worker heartbeat
```bash
curl -X POST http://localhost:8080/workers/worker-1/heartbeat
```
Returns `200` on success, `404` if worker not found.
#### List workers
```bash
curl http://localhost:8080/workers
```
Returns all registered workers with their current status and resource capacity.
Health and metrics
#### Health check
```bash
curl http://localhost:8080/health
```
Response (`200 OK`):
```json
{
"instance_id": "a1b2c3d4",
"uptime": "2h15m30s",
"status": "ok"
}
```
Each scheduler instance generates a unique short ID at startup. Useful for load balancer health checks and identifying which instance you're talking to in multi-scheduler deployments.
#### Prometheus metrics
```bash
curl http://localhost:8080/metrics
```
Returns Prometheus-compatible metrics including `workron_jobs_submitted_total`, `workron_jobs_completed_total`, `workron_job_execution_duration_seconds`, `workron_jobs_pending`, `workron_reaper_leader`, and more.
---
## Job Dependencies (DAG)
Jobs can declare dependencies on other jobs using the `depends_on` field. This creates a directed acyclic graph (DAG) where downstream jobs only execute after all their upstream dependencies complete.
**How it works:**
- A job with `depends_on` starts in `blocked` status instead of `pending`
- When a job completes (`done`), the scheduler checks all `blocked` jobs and transitions any whose dependencies are fully satisfied to `pending`
- At submission time, the scheduler validates that all referenced job IDs exist and that the new dependency would not create a cycle (using DFS-based cycle detection)
- The reaper also checks for unblockable jobs on each tick as a safety net
**What happens when a dependency fails?**
Currently, if a dependency fails permanently, downstream jobs remain `blocked` indefinitely. This is a known limitation, and a future improvement would cascade the failure or provide a way to manually unblock or cancel downstream jobs.
---
## Gang Scheduling
Distributed training jobs (PyTorch with NCCL, Jax on TPUs, and similar collective-communication workloads) need N workers starting simultaneously. If only some start, the rest hang waiting for peers that never join. Workron solves this with reservation-based gang scheduling.
**How it works:**
- Submitting a job with `gang_size > 1` creates N tasks sharing a `gang_id`. All N start in `blocked` status. No worker can claim them through the normal path.
- A background admission cycle runs every 5 seconds. On each tick it:
1. Rolls back reservations older than 30 seconds (workers that died before claiming).
2. Finds gangs where every task is `blocked`.
3. Sorts candidates by size (largest first), then priority, then submission time.
4. Computes per-worker available capacity, counting both `running` and `reserved` jobs as used.
5. Atomically transitions each gang's tasks to `reserved`, one per worker, and subtracts the placed capacity before considering the next gang.
- Workers claim their assigned task by passing `?worker_id=` to `GET /jobs/next`. The scheduler returns only tasks reserved for that specific worker.
- At claim time, the scheduler injects `GANG_ID`, `GANG_SIZE`, `GANG_INDEX`, and `GANG_PEERS` environment variables into the job's env map, so worker processes can discover each other and form their communication ring.
- When a gang task fails while any sibling is still running, `PreemptGang` enters coordinated drain (see the [Gang Preemption](#gang-preemption) section below). When all siblings are already blocked/reserved/pending, the legacy `FailGang` path propagates the failure to those siblings directly.
**Multi-instance safety:** the admission cycle uses a separate PostgreSQL advisory lock (different lock ID from the reaper), so only one scheduler instance runs placement at a time. The two coordination loops never block each other.
---
## Gang Preemption
A gang task failing or its worker dying mid-run used to leave siblings running forever, with no path back to a re-admissible state. Preemption closes that gap by draining the whole gang as a unit.
**How it works:**
- `POST /jobs/{id}/fail` on a gang task (or the reaper detecting a stale heartbeat) calls `PreemptGang(gangID, triggerJobID)`. Under one transaction:
- Every `running` sibling moves to `preempting`, with a bumped `PreemptionEpoch` and a `DrainStartedAt` timestamp. The trigger task keeps its claim-time `Attempts` increment (real failure); innocent siblings have `Attempts` decremented by 1 so scheduler-driven preemption does not burn their retry budget.
- Reserved and pending siblings skip `preempting` and go straight to `blocked` β they have no running process to signal.
- Blocked siblings stay blocked.
- Each worker's next heartbeat returns JSON `{"action":"preempt","preemption_epoch":N}`. The worker signals its child with SIGTERM and waits up to 15 seconds for a clean exit; if the process ignores SIGTERM, the executor escalates to SIGKILL.
- Workers can optionally `POST /jobs/{id}/checkpoint?epoch=N` with opaque bytes while still in `preempting`. The payload is stored on the job row and surfaced as `CHECKPOINT_DATA` (base64) in the env map on the next claim.
- After its process exits, the worker calls `POST /jobs/{id}/preempted?epoch=N`. The scheduler validates the epoch, transitions the task from `preempting` to `preempted`, and the worker moves on.
- A reaper pass finishes each drain:
1. Any task stuck in `preempting` past 45 seconds (the grace window plus slack) is force-drained by calling `ForceDrainPreempting`. This covers workers that die mid-drain and will never ack.
2. For every gang with no task still in `running` or `preempting`, `CompletePreemption` normalizes the verdict: gang goes to `blocked` if retries remain, or `failed` if any task has `Attempts >= MaxRetries` or is already `done`/`failed` (partial completion cannot cleanly restart-all).
- A gang returned to `blocked` is picked up by the next admission tick and re-placed on fresh workers.
**The failure case that used to be stuck:** a gang of three is reserved on workers A/B/C; A and B claim and start running; C dies before claiming. Before preemption, tasks 1 and 2 ran forever waiting for a peer that never joined. Now task 0's worker failure triggers the full drain, tasks 1 and 2 receive the preempt signal on their next heartbeat and exit cleanly, and the gang is re-admitted with three fresh worker assignments.
**Why not just send a cancel RPC:** heartbeats already exist, workers already poll every 5 seconds, and reusing the channel keeps the worker binary behind one firewall-traversal direction. The 5-second heartbeat cadence is well inside the 15-second SIGTERM grace window, so the worker learns about preemption within one full heartbeat and the executor still has time to let the child exit cleanly before the hard-kill fallback.
---
## Failure Detection
When a worker crashes mid-job, the scheduler detects it through missing heartbeats. A background reaper goroutine runs every 10 seconds and checks all running jobs:
- If a job's last heartbeat is older than 30 seconds (or was never set), the worker is assumed dead
- If the job has retries remaining, it is re-queued as `pending` for another worker to pick up
- If retries are exhausted, the job is marked as permanently `failed`
This ensures no job gets stuck in `running` forever, even if a worker process is killed without warning.
The reaper also checks worker heartbeats. Workers that haven't sent a heartbeat within 60 seconds are marked as `offline`. This is separate from job heartbeats: a worker might be healthy but a specific job process could have hung, or a worker might go down entirely. Both cases are detected and handled.
---
## Observability
Workron provides three layers of observability:
**Structured logging** (`log/slog`): all log output is JSON with typed fields (`job_id`, `worker_id`, `request_id`, `attempt`, `error`). Log levels distinguish normal events (Info), unusual events like reaper actions (Warn), and failures (Error). Each component receives its logger via dependency injection. Workers use `logger.With("worker_id", id)` so every log line automatically identifies the worker.
**Prometheus metrics** (`GET /metrics`): counters track job lifecycle events (`workron_jobs_submitted_total`, `workron_jobs_claimed_total`, `workron_jobs_completed_total`, `workron_jobs_failed_total`, `workron_jobs_retried_total`, `workron_jobs_reaped_total`). Gang preemption has its own counters (`workron_gangs_preempted_total`, `workron_gang_preemptions_force_drained_total`, `workron_gang_preemptions_completed_total{outcome="blocked"|"failed"}`) and a histogram (`workron_gang_preemption_drain_seconds`) that tracks drain-start to completion latency. Histograms also track execution duration and queue wait time. Gauges report current queue state (`workron_jobs_pending`, `workron_jobs_running`, `workron_jobs_blocked`) via a custom `prometheus.Collector` that queries the store on each scrape. The `workron_reaper_leader` gauge indicates whether this instance is the active reaper leader (1) or a follower (0), useful for monitoring multi-scheduler deployments.
**Request ID tracing**: every HTTP request receives a UUID, set as the `X-Request-ID` response header and included in all log lines for that request. A request-scoped child logger is created in `ServeHTTP` and propagated to handlers via context, so `request_id` appears automatically without manual threading.
---
## Persistence
Workron supports three storage backends, selectable at startup via `--db-driver`:
**In-memory store** (default): jobs live only as long as the process runs. Fast, zero dependencies, ideal for development and testing.
**SQLite store** (`--db-driver=sqlite --db-url=workron.db`): jobs persist to a single file on disk. The scheduler can crash and restart without losing any job state. Uses WAL mode for write performance and a single-connection pool to avoid SQLite's write lock contention. Uses `modernc.org/sqlite` (pure Go, no CGo).
**PostgreSQL store** (`--db-driver=postgres --db-url=postgres://...`): jobs persist to PostgreSQL with a configurable connection pool (default 10 connections). Uses `FOR UPDATE SKIP LOCKED` for atomic job claiming, allowing multiple scheduler instances to claim different jobs concurrently without blocking each other. The reaper uses `pg_try_advisory_xact_lock` so only one instance runs heartbeat timeout detection at a time. Dependencies stored as JSONB, queried with `jsonb_array_elements_text()`. Uses `pgx/v5` for native PostgreSQL support.
All three backends implement the same `JobStore` interface. The server, workers, and reaper are unaware of which store they're using.
---
## Running on Kubernetes
Workron ships with a kustomize-managed Kubernetes deployment for [kind](https://kind.sigs.k8s.io/) that runs the full stack β multi-replica scheduler, worker pool, in-cluster Postgres β on a local single-node cluster. The same manifests are the starting point for cloud deployment; only the overlay changes.
### Prerequisites
- Docker (or any compatible engine β OrbStack, Colima, etc.)
- `kind` (`brew install kind`)
- `kubectl`, `jq`, and a POSIX-y `bash` for the demo script
### Bring it up
```bash
make k8s-up # creates the kind cluster, builds & loads images, applies manifests, waits for Ready
```
After ~2 minutes you should see:
```bash
kubectl get pods -n workron
# NAME READY STATUS RESTARTS AGE
# workron-postgres-0 1/1 Running 0 90s
# workron-scheduler-xxxxxxxxxx-aaaaa 1/1 Running 0 90s
# workron-scheduler-xxxxxxxxxx-bbbbb 1/1 Running 0 90s
# workron-worker-yyyyyyyyyy-ccccc 1/1 Running 0 90s
# workron-worker-yyyyyyyyyy-ddddd 1/1 Running 0 90s
# workron-worker-yyyyyyyyyy-eeeee 1/1 Running 0 90s
curl http://localhost:30080/healthz
# 200 OK
curl http://localhost:30080/health | jq
# { "instance_id": "...", "uptime": "1m12s", "status": "ok" }
```
The scheduler's NodePort Service is mapped to host port `30080` via the kind config, so `localhost:30080` reaches the API directly.
### Run the gang-preemption demo
```bash
make k8s-demo
```
This script (`scripts/k8s-demo.sh`) submits a 3-task gang of `demo:sleep 60` workloads, waits for in-cluster worker pods to claim them, fails one task to trigger gang preemption, watches the siblings drain through `preempting` β `preempted` while emitting synthetic checkpoints, then waits for re-admission. The "resume proof" stage greps the worker pod logs for the `"checkpoint_data_present":true` line that confirms the next claim received the previously-saved bytes via the `CHECKPOINT_DATA` env var.

End-to-end runtime: ~50 seconds. The demo is driven by real worker Deployment pods polling the scheduler β `curl` is only used to submit the gang, trigger the failure, and observe state, never to claim jobs on the workers' behalf.
### Recommended terminal layout for recording
For a clean view of what the system is doing, split into four panes:
```
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β kubectl get pods -n workronβ make k8s-logs β
β -w β (both schedulers tail) β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ€
β kubectl logs -l β make k8s-demo β
β app=workron-worker -f β β
β -c worker --prefix=true β β
ββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ
```
The bottom-right pane drives, the other three observe.
### What's actually running in-cluster
- **Postgres**: a `StatefulSet` with a `PersistentVolumeClaim` for durable demo data. This is a local demo database, not production-grade. For a real deployment you would swap it for managed Postgres (RDS, Cloud SQL, or similar) and store the connection string in your platform's secret manager. The credentials in `deploy/k8s/overlays/local/` are deliberately hardcoded dev values; replace them via External Secrets Operator, sealed-secrets, or SOPS for any environment that isn't your laptop.
- **Scheduler**: a 2-replica `Deployment` with `/healthz` liveness and `/readyz` readiness probes. The two replicas coordinate through `pg_try_advisory_xact_lock` β only one runs the reaper or admits a gang per tick, the other handles HTTP traffic. `kubectl logs -l app=workron-scheduler` shows both replicas interleaved; you can watch the lock holder change over time.
- **Workers**: a 3-replica `Deployment`. Each worker uses its pod name as its registered worker ID (via `metadata.name` from the Downward API) so logs and the `/workers` endpoint correlate cleanly. The worker has no HTTP listener, so probes are intentionally absent β the scheduler's worker-heartbeat ledger is the source of truth for liveness, not Kubernetes pod state.
- **Init containers**: scheduler pods block on `pg_isready`, worker pods block on the scheduler's `/healthz`. This is what eliminates the startup race that would otherwise cause `RESTARTS=3` while pods waited for their dependency to come up.
### Tear it down
```bash
make k8s-down # deletes the kind cluster, including the local PV
```
`kind delete cluster` removes everything, so don't store anything you care about in the demo Postgres. If you want to keep state across `make k8s-up` cycles, leave the cluster up between runs.
### Cloud deployment
The base manifests under `deploy/k8s/base/` are cloud-agnostic. The `deploy/k8s/overlays/local/` overlay carries everything that's specific to a kind environment: the NodePort Service, the dev Secret, image tags pinned to `:dev`. For EKS or GKE you would write a sibling overlay (`deploy/k8s/overlays/aws/` or similar) that:
- Replaces the dev Secret with an External Secrets Operator `ExternalSecret` referencing AWS Secrets Manager / Google Secret Manager.
- Removes the in-cluster Postgres `StatefulSet` and points the scheduler at a managed database (RDS/Cloud SQL).
- Swaps the NodePort Service for a `LoadBalancer` (or keeps it ClusterIP behind an Ingress + cert-manager).
- Pulls images from a real registry (ECR/GCR) instead of `kind load docker-image`.
None of this requires changes to the Go code or the base manifests β it's pure overlay work, which is the whole reason the deployment is structured as base + overlay rather than a single rendered manifest. HPA on a custom queue-depth metric, Helm chart packaging, and Prometheus Operator integration are explicit non-goals for v1; each can land later as separate small PRs.
---
## Project Structure
```
workron/
βββ cmd/
β βββ scheduler/
β β βββ main.go # Scheduler entry point (standalone or distributed)
β β βββ Dockerfile # Multi-stage build, distroless runtime
β βββ worker/
β βββ main.go # Standalone worker entry point
β βββ Dockerfile # Multi-stage build, alpine runtime
βββ internal/
β βββ metrics/
β β βββ metrics.go # Prometheus counters, histograms, registration
β β βββ collector.go # Custom gauge collector (queries store on scrape)
β β βββ metrics_test.go
β βββ store/
β β βββ store.go # JobStore, WorkerStore, GangStore, Pinger interfaces, Job/Worker structs
β β βββ memory.go # In-memory store implementation
β β βββ memory_test.go
β β βββ sqlite.go # SQLite store implementation
β β βββ sqlite_test.go
β β βββ postgres.go # PostgreSQL store implementation
β β βββ postgres_test.go # PG tests (build tag: postgres)
β β βββ store_test.go # Shared compliance tests for all backends
β β βββ dag.go # Cycle detection + dependency validation
β β βββ dag_test.go
β βββ scheduler/
β β βββ server.go # HTTP handlers (jobs, gangs, workers, health, healthz, readyz)
β β βββ server_test.go
β β βββ gang.go # Gang admission cycle + placement logic
β β βββ gang_test.go # Unit tests for placement and capacity
β β βββ gang_integration_test.go # End-to-end gang lifecycle tests
β β βββ reaper.go # Background heartbeat timeout checker (gang-aware)
β β βββ reaper_test.go
β βββ worker/
β βββ worker.go # Poll and execute loop, demo: prefix handling, checkpoint emission on preempt
β βββ worker_test.go
β βββ executor.go # Runs shell commands via os/exec (context-cancelable, env injection)
β βββ executor_test.go
β βββ client.go # HTTP client for talking to scheduler
β βββ client_test.go
βββ deploy/
β βββ kind-config.yaml # kind cluster definition (NodePort 30080 host mapping)
β βββ k8s/
β βββ base/ # Cloud-agnostic Kustomize base
β β βββ kustomization.yaml
β β βββ namespace.yaml
β β βββ postgres-secret.yaml
β β βββ postgres-service.yaml # Headless + ClusterIP
β β βββ postgres-statefulset.yaml # 1 replica, PVC, pg_isready probe
β β βββ scheduler-configmap.yaml
β β βββ scheduler-deployment.yaml # 2 replicas, /healthz + /readyz, init container waits for Postgres
β β βββ worker-configmap.yaml
β β βββ worker-deployment.yaml # 3 replicas, init container waits for scheduler
β βββ overlays/
β βββ local/ # kind-specific overlay
β βββ kustomization.yaml
β βββ postgres-secret-patch.yaml # Dev password
β βββ scheduler-service-patch.yaml # NodePort 30080
βββ scripts/
β βββ k8s-demo.sh # Gang-preemption + checkpoint demo against the kind cluster
βββ docs/
β βββ k8s-demo.cast # asciinema recording of the demo
β βββ k8s-demo.gif # Rendered GIF embedded in the README
βββ docker-compose.yml # Local PostgreSQL for development
βββ .env.example # Environment variable template
βββ .dockerignore # Build-context exclusions for the Dockerfiles
βββ Makefile # Includes k8s-up / k8s-down / k8s-demo / k8s-logs / k8s-build / k8s-load
βββ .gitignore
βββ go.mod
βββ go.sum
βββ README.md
```
---
## Tech Stack
| Component | Choice |
|-----------|--------|
| Language | Go 1.26 |
| HTTP | `net/http` (stdlib only) |
| Job execution | `os/exec` (stdlib) |
| Logging | `log/slog` (stdlib, JSON output) |
| Metrics | `prometheus/client_golang` |
| Storage | In-memory, SQLite, or PostgreSQL |
| SQLite driver | `modernc.org/sqlite` (pure Go, no CGo) |
| PostgreSQL driver | `jackc/pgx/v5` (native pgxpool, no database/sql) |
| ID generation | `google/uuid` (UUID v4) |
| Local dev | Docker Compose (PostgreSQL 16) |
| Container images | Multi-stage; distroless runtime (scheduler), alpine runtime (worker) |
| Kubernetes | kind, kustomize (base + overlay), `pg_isready` / `/healthz` init containers, NodePort access |
---
## Key Technical Decisions
**PostgreSQL advisory locks over Raft for reaper coordination.** When multiple scheduler instances share a database, only one should run the reaper (heartbeat timeout scan) at a time. Instead of adding a full consensus protocol like Raft, the reaper acquires a transaction-scoped advisory lock (`pg_try_advisory_xact_lock`) on each tick. If another instance holds it, the tick is skipped. If the lock holder crashes, PostgreSQL automatically releases the lock when the connection drops. This gives leader election for free, no external coordination service, no Raft state machine, no split-brain risk. PostgreSQL is already the shared coordination layer, so using it for this is simpler and sufficient.
**`FOR UPDATE SKIP LOCKED` over application-level locking.** Multiple workers (or multiple schedulers) claiming jobs concurrently need atomicity. The naive approach is reading a pending job, then updating it, which has a race window where two workers read the same job. Application-level distributed locks (Redis, Zookeeper) add operational complexity. PostgreSQL's `FOR UPDATE SKIP LOCKED` solves this at the database level: when one transaction locks a row, other transactions skip it and move to the next row. Zero contention, zero double-claims, no external dependencies. This is the same pattern used by production job queues like Graphile Worker.
**GPU-aware bin-packing over simple tag filtering.** For ML workload scheduling, simple tag matching ("this worker has a GPU") doesn't prevent over-commitment. If a worker has 24GB VRAM and one job is using 16GB, a naive tag filter would still assign a second 16GB job to the same worker. Resource accounting tracks allocated vs. available capacity per worker, and first-fit-decreasing bin-packing places the largest pending job on the smallest worker that still fits. This is a real scheduling problem in ML infrastructure, the same approach used by Kubernetes resource requests.
**Pull-based scheduling with server-side selection over push-based assignment.** Workers poll `GET /jobs/next?worker_id=xxx` and the scheduler picks the best job for that worker based on its registered resources. The alternative is scheduler pushes jobs to workers, which requires the scheduler to track worker availability in real time and handle push failures. Pull-based is simpler: workers ask when they're ready, the scheduler has a consistent view of what's running where (from the database), and there's no push failure mode to handle.
**Transaction-scoped advisory locks over session-scoped.** `pg_try_advisory_xact_lock` releases automatically when the transaction commits, even if the application crashes before calling unlock. Session-scoped locks (`pg_advisory_lock`) persist until explicitly released or the connection closes, but with connection pooling (pgxpool), a returned connection might carry a stale lock into a different goroutine. Transaction-scoped locks eliminate this entire class of bugs.
---
## Contributing
This is a personal learning project and not yet ready for production use. Feedback and suggestions are welcome, feel free to open an issue.
---
## License
MIT