https://github.com/tensorzero/durable
Durable execution in Postgres
https://github.com/tensorzero/durable
Last synced: about 2 months ago
JSON representation
Durable execution in Postgres
- Host: GitHub
- URL: https://github.com/tensorzero/durable
- Owner: tensorzero
- License: apache-2.0
- Created: 2025-12-01T19:38:30.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-01-30T22:53:07.000Z (about 2 months ago)
- Last Synced: 2026-01-31T12:31:16.993Z (about 2 months ago)
- Language: Rust
- Size: 569 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# durable
A Rust SDK for building durable, fault-tolerant workflows using PostgreSQL.
This project is derived from [absurd](https://github.com/earendil-works/absurd).
It is experimental software to be used in TensorZero.
Use at your own risk.
## Overview
`durable` enables you to write long-running tasks that can:
- **Checkpoint progress** - Steps are persisted, so tasks resume where they left off after crashes
- **Sleep and wait** - Suspend execution for durations or until specific times
- **Await events** - Pause until external events arrive (with optional timeouts)
- **Retry on failure** - Configurable retry strategies with exponential backoff
- **Scale horizontally** - Multiple workers can process tasks concurrently
Unlike exception-based durable execution systems (Python, TypeScript), this SDK uses Rust's `Result` type for suspension control flow, making it idiomatic and type-safe.
## Why Durable Execution?
Traditional background job systems execute tasks once and hope for the best. Durable execution is different - it provides **guaranteed progress** even when things go wrong:
- **Crash recovery** - If your process dies mid-workflow, tasks resume exactly where they left off. No lost progress, no duplicate work.
- **Long-running workflows** - Execute workflows that span hours or days. Sleep for a week waiting for a subscription to renew, then continue.
- **External event coordination** - Wait for webhooks, human approvals, or other services. The task suspends until the event arrives.
- **Reliable retries** - Transient failures (network issues, rate limits) are automatically retried with configurable backoff.
- **Exactly-once semantics** - Checkpointed steps don't re-execute on retry. Combined with idempotency keys, achieve exactly-once side effects.
Use durable execution when your workflow is too important to fail silently, too long to hold in memory, or too complex for simple retries.
## Installation
Add to your `Cargo.toml`:
```toml
[dependencies]
durable = "0.1"
```
## Quick Start
```rust
use durable::{Durable, MIGRATOR, Task, TaskContext, TaskResult, WorkerOptions, async_trait};
use serde::{Deserialize, Serialize};
// Define your task parameters and output
#[derive(Serialize, Deserialize)]
struct ResearchParams {
query: String,
}
#[derive(Serialize, Deserialize)]
struct ResearchResult {
summary: String,
sources: Vec,
}
// Implement the Task trait
struct ResearchTask;
#[async_trait]
impl Task for ResearchTask {
fn name() -> Cow<'static, str> { Cow::Borrowed("research") }
type Params = ResearchParams;
type Output = ResearchResult;
async fn run(params: Self::Params, mut ctx: TaskContext) -> TaskResult {
// Phase 1: Find relevant sources (checkpointed)
// If the task crashes after this step, it won't re-run on retry
let sources: Vec = ctx.step("find-sources", (), |_, _| async {
// Search logic here...
Ok(vec![
"https://example.com/article1".into(),
"https://example.com/article2".into(),
])
}).await?;
// Phase 2: Analyze the sources (checkpointed)
let analysis: String = ctx.step("analyze", (), |_, _| async {
// Analysis logic here...
Ok("Key findings from sources...".into())
}).await?;
// Phase 3: Generate summary (checkpointed)
let summary: String = ctx.step("summarize", params, |params, _| async {
// Summarization logic here...
Ok(format!("Research summary for '{}': {}", params.query, analysis))
}).await?;
Ok(ResearchResult { summary, sources })
}
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Create the client
let client = Durable::builder()
.database_url("postgres://localhost/myapp")
.queue_name("research")
.build()
.await?;
// Run migrations (idempotent - safe to call on every startup)
MIGRATOR.run(client.pool()).await?;
// Create the queue (idempotent - safe to call on every startup)
client.create_queue(None).await?;
// Register your task
client.register::().await?;
// Spawn a task
let result = client.spawn::(ResearchParams {
query: "distributed systems consensus algorithms".into(),
}).await?;
println!("Spawned task: {}", result.task_id);
// Start a worker to process tasks
let worker = client.start_worker(WorkerOptions::default()).await;
// Wait for shutdown signal
tokio::signal::ctrl_c().await?;
worker.shutdown().await;
Ok(())
}
```
## Core Concepts
### Tasks
Tasks are defined by implementing the [`Task`] trait:
```rust
#[async_trait]
impl Task for MyTask {
fn name() -> Cow<'static, str> { Cow::Borrowed("my-task") } // Unique identifier
type Params = MyParams; // Input (JSON-serializable)
type Output = MyOutput; // Output (JSON-serializable)
async fn run(params: Self::Params, mut ctx: TaskContext) -> TaskResult {
// Your task logic here
}
}
```
### User Errors
Return user errors with structured data using `TaskError::user()`:
```rust
// With structured data (message extracted from "message" field if present)
Err(TaskError::user(json!({"message": "Not found", "code": 404})))
// With any serializable type
Err(TaskError::user(MyError { code: 404, details: "..." }))
// Simple string message
Err(TaskError::user_message("Something went wrong"))
```
The error data is serialized to JSON and stored in the database for debugging and analysis.
### TaskContext
The [`TaskContext`] provides methods for durable execution:
- **`step(name, params, closure)`** - Execute a checkpointed operation. The closure receives `(params, state)`. If the step completed in a previous run with the same name and params, returns the cached result.
- **`spawn::(name, params, options)`** - Spawn a subtask and return a handle.
- **`spawn_by_name(name, task_name, params, options)`** - Spawn a subtask by task name (dynamic version).
- **`join(handle)`** - Wait for a subtask to complete and get its result.
- **`sleep_for(name, duration)`** - Suspend the task for a duration.
- **`await_event(name, timeout)`** - Wait for an external event.
- **`emit_event(name, payload)`** - Emit an event to wake waiting tasks.
- **`heartbeat(duration)`** - Extend the task lease for long operations.
- **`rand()`** - Generate a durable random value in [0, 1). Checkpointed.
- **`now()`** - Get the current time as a durable checkpoint.
- **`uuid7()`** - Generate a durable UUIDv7. Checkpointed.
### Checkpointing
Steps provide "at-least-once" execution. To achieve "exactly-once" semantics for side effects, use the `task_id` as an idempotency key:
```rust
ctx.step("charge-payment", ctx.task_id, |task_id, state| async {
let idempotency_key = format!("{}:charge", task_id);
stripe::charge(amount, &idempotency_key).await
}).await?;
```
### Events
Tasks can wait for and emit events:
```rust
// In one task: wait for an event
let shipment: ShipmentEvent = ctx.await_event(
&format!("packed:{}", order_id),
Some(Duration::from_secs(7 * 24 * 3600)), // 7 day timeout
).await?;
// From another task or service: emit the event
client.emit_event(
&format!("packed:{}", order_id),
&ShipmentEvent { tracking: "1Z999".into() },
None,
).await?;
```
### Subtasks (Spawn/Join)
Tasks can spawn subtasks and wait for their results using `spawn()` and `join()`:
```rust
async fn run(params: Self::Params, mut ctx: TaskContext) -> TaskResult {
// Spawn subtasks (runs on same queue)
let h1 = ctx.spawn::("item-1", Item { id: 1 }, Default::default()).await?;
let h2 = ctx.spawn::("item-2", Item { id: 2 }, SpawnOptions {
max_attempts: Some(3),
..Default::default()
}).await?;
// Do local work while subtasks run...
let local = ctx.step("local-work", (), |_params, _state| async { Ok(compute()) }).await?;
// Wait for subtask results
let r1: ItemResult = ctx.join(h1).await?;
let r2: ItemResult = ctx.join(h2).await?;
Ok(Output { local, children: vec![r1, r2] })
}
```
**Key behaviors:**
- **Checkpointed** - Spawns and joins are cached. If the parent retries, it gets the same subtask handles and results.
- **Cascade cancellation** - When a parent fails or is cancelled, all its subtasks are automatically cancelled.
- **Error propagation** - If a subtask fails, `join()` returns an error that the parent can handle.
- **Same queue** - Subtasks run on the same queue as their parent.
### Event-Based Coordination
For coordination between independent tasks (not parent-child), use events:
```rust
// Task A: Waits for a signal from Task B
let approval: ApprovalPayload = ctx.await_event(
&format!("approved:{}", request_id),
Some(Duration::from_secs(24 * 3600)), // 24 hour timeout
).await?;
// Task B (or external service): Sends the signal
client.emit_event(
&format!("approved:{}", request_id),
&ApprovalPayload { approved_by: "admin".into() },
None,
).await?;
```
### Transactional Spawning
You can atomically enqueue a task as part of a larger database transaction. This ensures that either both your write and the task spawn succeed, or neither does:
```rust
let mut tx = client.pool().begin().await?;
// Your application write
sqlx::query("INSERT INTO orders (id, status) VALUES ($1, $2)")
.bind(order_id)
.bind("pending")
.execute(&mut *tx)
.await?;
// Enqueue task in the same transaction
client.spawn_with::(&mut *tx, ProcessOrderParams { order_id }).await?;
tx.commit().await?;
// Both succeed or both fail - atomic
```
This is useful when you need to guarantee that a task is only enqueued if related data was successfully persisted. The `_with` variants accept any SQLx executor:
- `spawn_with(executor, params)` - Spawn with default options
- `spawn_with_options_with(executor, params, options)` - Spawn with custom options
- `spawn_by_name_with(executor, task_name, params, options)` - Dynamic spawn by name
## API Overview
### Client
| Type | Description |
|------|-------------|
| [`Durable`] | Main client for spawning tasks and managing queues |
| [`DurableBuilder`] | Builder for configuring the client |
| [`Worker`] | Background worker that processes tasks |
### Task Definition
| Type | Description |
|------|-------------|
| [`Task`] | Trait for defining task types |
| [`TaskContext`] | Context passed to task execution |
| [`TaskResult`] | Result type alias for task returns |
| [`TaskError`] | Error type with control flow signals and user errors |
| [`TaskError::user()`] | Helper to create user errors with JSON data |
| [`TaskError::user_message()`] | Helper to create string user errors |
| [`TaskHandle`] | Handle to a spawned subtask (returned by `ctx.spawn()`) |
### Configuration
| Type | Description |
|------|-------------|
| [`SpawnOptions`] | Options for spawning tasks (retries, headers, queue) |
| [`WorkerOptions`] | Options for worker configuration (concurrency, timeouts) |
| [`RetryStrategy`] | Retry behavior: `None`, `Fixed`, or `Exponential` |
| [`CancellationPolicy`] | Auto-cancel tasks based on delay or duration |
### Results
| Type | Description |
|------|-------------|
| [`SpawnResult`] | Returned when spawning a task (task_id, run_id, attempt) |
| [`ControlFlow`] | Signals for suspension and cancellation |
## Environment Variables
- `DURABLE_DATABASE_URL` - Default PostgreSQL connection string (if not provided to builder)
## Benchmarks
Performance benchmarks run automatically on every push to `main` using [Criterion](https://github.com/bheisler/criterion.rs). Results are published to GitHub Pages:
**[View Benchmark Results](https://tensorzero.github.io/durable/dev/bench/)**
To run benchmarks locally:
```bash
cargo bench
```
## License
See LICENSE file.