https://github.com/alexpota/jobguard

PostgreSQL durability for Redis-backed job queues (Bull, BullMQ, Bee-Queue)
https://github.com/alexpota/jobguard
background-jobs bee-queue bull bullmq durability fault-tolerance job-queue nodejs postgresql queue-persistence redis typescript
Last synced: 20 days ago
JSON representation
PostgreSQL durability for Redis-backed job queues (Bull, BullMQ, Bee-Queue)
Host: GitHub
URL: https://github.com/alexpota/jobguard
Owner: alexpota
License: mit
Created: 2025-10-01T10:24:21.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-10-18T12:26:45.000Z (4 months ago)
Last Synced: 2025-10-19T07:11:56.147Z (4 months ago)
Topics: background-jobs, bee-queue, bull, bullmq, durability, fault-tolerance, job-queue, nodejs, postgresql, queue-persistence, redis, typescript
Language: TypeScript
Homepage:
Size: 1.21 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md
Awesome Lists containing this project

README

          # JobGuard

[![npm](https://img.shields.io/npm/v/jobguard?logo=npm)](https://www.npmjs.com/package/jobguard)

[![node](https://img.shields.io/node/v/jobguard)](https://nodejs.org/)

[![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue?logo=typescript)](https://www.typescriptlang.org/)

[![CI](https://github.com/alexpota/jobguard/workflows/CI/badge.svg)](https://github.com/alexpota/jobguard/actions)

[![coverage](https://img.shields.io/badge/coverage-85%25-brightgreen)](https://github.com/alexpota/jobguard)

[![License](https://img.shields.io/npm/l/jobguard)](https://opensource.org/licenses/MIT)

[![downloads](https://img.shields.io/npm/dm/jobguard)](https://www.npmjs.com/package/jobguard)

PostgreSQL durability for Redis-backed job queues (Bull, BullMQ, Bee-Queue) with minimal integration.

## Quick Start

### Installation

```bash

npm install jobguard pg

```

### Basic Usage

```typescript

import Bull from 'bull';

import { JobGuard } from 'jobguard';

// Create your queue as usual

const queue = new Bull('my-queue', 'redis://localhost:6379');

// Add JobGuard for durability

const jobGuard = await JobGuard.create(queue, {

  postgres: 'postgresql://localhost:5432/mydb',

});

// Use your queue normally - JobGuard works transparently

await queue.add('email', { to: 'user@example.com' });

// Gracefully shutdown when done

process.on('SIGTERM', async () => {

  await jobGuard.shutdown();

  await queue.close();

});

```

## 🎬 Demo

![JobGuard Stress Test](./assets/demo.gif)

✅ **10,000 jobs • 60 workers • Redis crash at peak load • Zero jobs lost**

[▶️ Run the interactive demo yourself →](./demo#readme)

## Features

- 🔒 **Drop-In Integration**: Wraps existing queues without modifying your queue code

- 🔄 **Automatic Recovery**: Client-side reconciliation detects and recovers stuck jobs

- 💓 **Heartbeat Support**: Long-running jobs signal liveness for accurate stuck detection

- 📊 **Multi-Queue Support**: Works with Bull, BullMQ, and Bee-Queue

- ⚡ **Low Overhead**: <5ms per job operation, minimal memory footprint

- 🛡️ **Fault Tolerant**: Circuit breaker pattern protects against PostgreSQL failures

- 🎯 **Type Safe**: Full TypeScript support with strict typing

## Table of Contents

- [Quick Start](#quick-start)

- [Demo](#-demo)

- [Features](#features)

- [Why JobGuard?](#why-jobguard)

- [Database Setup](#database-setup)

- [Configuration](#configuration)

- [Advanced Usage](#advanced-usage)

- [API Reference](#api-reference)

- [Queue Library Support](#queue-library-support)

- [How It Works](#how-it-works)

- [Performance](#performance-considerations)

- [Known Limitations](#known-limitations)

- [Security](#security)

- [Requirements](#requirements)

- [FAQ](#faq)

- [License](#license)

- [Contributing](#contributing)

## Why JobGuard?

Redis-backed queues are fast but **volatile**. When Redis crashes or restarts, you lose:

- ❌ Jobs currently being processed

- ❌ Jobs waiting in the queue

- ❌ Job history and audit trail

- ❌ Ability to recover stuck jobs

**JobGuard solves this** by adding PostgreSQL durability as a safety net, without changing your existing queue code.

### The Problem: Speed vs Safety Trade-off

Most teams face this dilemma:

| Option | Result |

|--------|--------|

| Use Redis-only queues (Bull/BullMQ/Bee-Queue) | ⚡ Fast but lose jobs on crash |

| Use PostgreSQL-only queues | 🛡️ Safe but sacrifice Redis speed |

| Configure Redis AOF persistence | ⚠️ Still can lose data + complex setup |

### The Solution: Best of Both Worlds

JobGuard lets you keep Redis speed **and** get PostgreSQL safety:

```typescript

// Your existing queue

const queue = new Bull('my-queue', 'redis://localhost:6379');

// Add JobGuard (just 3 lines)

const jobGuard = await JobGuard.create(queue, {

  postgres: 'postgresql://localhost:5432/mydb',

});

// That's it! Your queue now has 100% durability

```

### Stress Test Results

**Benchmark** (10,000 jobs, 60 workers, Redis crash at peak load):

- 🎯 **Zero jobs lost** - 100% recovery after crash

- 🛡️ **100% durability** - Every job persisted to PostgreSQL

- ⏱️ **55 seconds** - Full stress test with crash recovery

- 📊 **60 concurrent workers** - Proven scalability under load

[▶️ Run the interactive stress test yourself](./demo#readme)

## Database Setup

**One-time setup:** Create the JobGuard table in your PostgreSQL database.

### Option 1: Using psql (Recommended)

```bash

psql -d mydb -f node_modules/jobguard/schema/001_initial.sql

```

### Option 2: Programmatically

```typescript

import { Pool } from 'pg';

import { readFileSync } from 'fs';

import { join } from 'path';

const pool = new Pool({ connectionString: 'postgresql://localhost:5432/mydb' });

const schema = readFileSync(

  join(__dirname, 'node_modules/jobguard/schema/001_initial.sql'),

  'utf8'

);

await pool.query(schema);

```

### Option 3: Add to Your Existing Migrations

Copy `node_modules/jobguard/schema/001_initial.sql` into your project's migration system (Knex, TypeORM, Prisma, etc.).

## Configuration

### Full Configuration Example

```typescript

const jobGuard = await JobGuard.create(queue, {

  // PostgreSQL connection (required)

  postgres: {

    host: 'localhost',

    port: 5432,

    database: 'mydb',

    user: 'postgres',

    password: 'secret',

    max: 10, // Connection pool size

    ssl: false,

  },

  // Or use connection string

  // postgres: 'postgresql://localhost:5432/mydb',

  // Reconciliation settings (optional)

  reconciliation: {

    enabled: true,

    intervalMs: 30000, // Check every 30 seconds

    stuckThresholdMs: 300000, // 5 minutes (minimum: 60000ms)

    maxAttempts: 3,

    batchSize: 100,

    adaptiveScheduling: true, // Adjust interval based on load

    rateLimitPerSecond: 20, // Max jobs to re-enqueue per second (default: 20)

  },

  // Logging settings (optional)

  logging: {

    enabled: true,

    level: 'info', // 'debug' | 'info' | 'warn' | 'error'

    prefix: '[JobGuard]',

  },

  // Persistence settings (optional)

  persistence: {

    retentionDays: 7, // Keep completed jobs for 7 days

    cleanupEnabled: true,

    cleanupIntervalMs: 3600000, // Cleanup every hour

  },

});

```

## Advanced Usage

### Force Reconciliation

Trigger immediate reconciliation:

```typescript

await jobGuard.forceReconciliation();

```

### Get Queue Statistics

```typescript

const stats = await jobGuard.getStats();

console.log(`

  Queue: ${stats.queueName}

  Pending: ${stats.pending}

  Processing: ${stats.processing}

  Completed: ${stats.completed}

  Failed: ${stats.failed}

  Stuck: ${stats.stuck}

  Total: ${stats.total}

`);

```

### Multiple Queues

```typescript

const emailQueue = new Bull('emails', redisUrl);

const emailGuard = await JobGuard.create(emailQueue, { postgres: postgresUrl });

const paymentQueue = new Bull('payments', redisUrl);

const paymentGuard = await JobGuard.create(paymentQueue, { postgres: postgresUrl });

// Each queue is tracked independently

```

### Heartbeat for Long-Running Jobs

**Problem**: For jobs with dynamic or long execution times (e.g., 20 seconds to 2 hours), a fixed `stuckThresholdMs` can cause false positives or slow recovery.

**Solution**: Use heartbeats to signal that a job is still alive, regardless of how long it runs.

```typescript

import { Worker } from 'bullmq';

import { JobGuard } from 'jobguard';

const queue = new Queue('data-sync', { connection: { host: 'localhost' } });

const jobGuard = await JobGuard.create(queue, {

  postgres: postgresUrl,

  reconciliation: {

    stuckThresholdMs: 300000, // 5 minutes - short threshold works with heartbeats!

  },

});

// Worker: Update heartbeat every 30 seconds during long-running jobs

const worker = new Worker('data-sync', async (job) => {

  const heartbeatInterval = setInterval(async () => {

    await jobGuard.updateHeartbeat(job.id!);

  }, 30000); // Update every 30 seconds

  try {

    // Your long-running job logic

    for (let i = 0; i < largeDataset.length; i++) {

      await processItem(largeDataset[i]);

      // Heartbeat automatically updates in the background

    }

  } finally {

    clearInterval(heartbeatInterval);

  }

}, { connection: { host: 'localhost' } });

```

**How it works**:

- `updateHeartbeat(jobId)` updates the `last_heartbeat` timestamp in PostgreSQL

- Stuck detection uses `COALESCE(last_heartbeat, updated_at)` - falls back to `updated_at` if no heartbeat

- With regular heartbeats, jobs can run for hours without being marked stuck

- If a worker crashes mid-heartbeat, the job is detected as stuck within `stuckThresholdMs` (fast recovery!)

**Benefits**:

- ✅ Fast recovery (5 minutes) for crashed jobs

- ✅ No false positives for long-running jobs

- ✅ Works with dynamic job durations (20 sec to 2 hours)

- ✅ Backward compatible (jobs without heartbeats fall back to `updated_at`)

## API Reference

### `JobGuard.create(queue, config)`

Creates and initializes a new JobGuard instance.

**Parameters:**

- `queue` **(required)** - Bull, BullMQ, or Bee-Queue instance

- `config` **(required)** - Configuration object

**Returns:** `Promise`

**Example:**

```typescript

const jobGuard = await JobGuard.create(queue, {

  postgres: 'postgresql://localhost:5432/mydb'

});

```

### `jobGuard.getStats()`

Retrieves current queue statistics from PostgreSQL.

**Returns:** `Promise`

**JobStats interface:**

```typescript

{

  queueName: string;

  pending: number;

  processing: number;

  completed: number;

  failed: number;

  stuck: number;

  dead: number;

  total: number;

}

```

### `jobGuard.forceReconciliation()`

Manually triggers immediate reconciliation of stuck jobs.

**Returns:** `Promise`

### `jobGuard.updateHeartbeat(jobId)`

Updates the heartbeat timestamp for a processing job to indicate it's still alive.

**Parameters:**

- `jobId` **(required)** - The job ID to update (string or number)

**Returns:** `Promise`

**Example:**

```typescript

// In your worker process

const worker = new Worker('my-queue', async (job) => {

  const heartbeat = setInterval(() => {

    await jobGuard.updateHeartbeat(job.id);

  }, 30000); // Every 30 seconds

  try {

    await longRunningTask(job.data);

  } finally {

    clearInterval(heartbeat);

  }

});

```

**Notes:**

- Only updates heartbeat for jobs in `processing` status

- Silently fails if job is not found or not processing (doesn't throw)

- Recommended heartbeat interval: 30-60 seconds for most workloads

### `jobGuard.shutdown()`

Gracefully shuts down JobGuard, stopping reconciliation and closing database connections.

**Returns:** `Promise`

**Example:**

```typescript

process.on('SIGTERM', async () => {

  await jobGuard.shutdown();

  await queue.close();

});

```

### Configuration Types

For full TypeScript type definitions and configuration options, see:

- [Configuration Types](./src/types/config.ts)

- [Job Types](./src/types/job.ts)

## Queue Library Support

### Bull

```typescript

import Bull from 'bull';

import { JobGuard } from 'jobguard';

const queue = new Bull('my-queue', 'redis://localhost:6379');

const guard = await JobGuard.create(queue, { postgres: postgresUrl });

```

### BullMQ

```typescript

import { Queue } from 'bullmq';

import { JobGuard } from 'jobguard';

const queue = new Queue('my-queue', { connection: { host: 'localhost' } });

const guard = await JobGuard.create(queue, { postgres: postgresUrl });

```

### Bee-Queue

```typescript

import Queue from 'bee-queue';

import { JobGuard } from 'jobguard';

const queue = new Queue('my-queue', { redis: { host: 'localhost' } });

const guard = await JobGuard.create(queue, { postgres: postgresUrl });

```

## How It Works

JobGuard provides durability through three mechanisms:

1. **Job Tracking**: Intercepts job creation and tracks jobs in PostgreSQL

2. **Event Monitoring**: Listens to queue events to update job status

3. **Reconciliation**: Periodically checks for stuck jobs and re-enqueues them

### Architecture

![JobGuard Architecture](./assets/architecture.svg)

**How it works:**

1. **Queue Adapter** intercepts `queue.add()` and writes to both Redis (fast) and PostgreSQL (durable)

2. **Event Monitor** listens to queue events and updates job status in PostgreSQL

3. **Worker** (optional) sends heartbeats to PostgreSQL to signal long-running jobs are still alive

4. **Reconciler** runs every 30 seconds to detect stuck jobs (using heartbeat or last update time) and re-enqueue them to Redis

## Performance Considerations

- **Overhead**: <5ms per job operation

- **Memory**: <50MB for tracking 10,000 jobs

- **Database**: Uses connection pooling (default: 10 connections)

- **Reconciliation**: Adaptive scheduling reduces load during idle periods

## Error Handling

JobGuard uses a circuit breaker to prevent cascading failures:

```typescript

import { CircuitBreakerOpenError } from 'jobguard';

try {

  await jobGuard.getStats();

} catch (error) {

  if (error instanceof CircuitBreakerOpenError) {

    console.error('PostgreSQL is unavailable, circuit breaker is open');

  }

}

```

When PostgreSQL is unavailable, JobGuard logs errors but allows your queue to continue operating normally. Jobs will be reconciled once PostgreSQL recovers.

## Known Limitations

### Race Condition Scenarios

While JobGuard provides strong durability guarantees, some edge-case race conditions are **inherent to distributed systems** and cannot be completely eliminated:

#### 1. Worker Crash During Job Processing

**Scenario**: Worker processes a job successfully → crashes before sending completion event → reconciler re-enqueues the job

**Impact**: Job may be processed twice

**Mitigation**:

- Implement idempotent job handlers in your application

- Use database transactions or unique constraints for non-idempotent operations

- Monitor duplicate processing via PostgreSQL job history

#### 2. Bee-Queue Duplicate Jobs

**Scenario**: Bee-Queue generates new job IDs when re-enqueueing stuck jobs (architectural limitation)

**Impact**: Two job records exist in PostgreSQL (old marked 'failed', new marked 'pending')

**Why this happens**: Unlike Bull/BullMQ, Bee-Queue doesn't support custom job IDs

**Mitigation**:

- The old job is marked as 'failed' to prevent conflict with partial index constraint

- Only one job will be active in Redis at any time

- Consider using Bull or BullMQ if this is a concern

#### 3. Very Short-Lived Jobs

**Scenario**: Job completes in <100ms before event listeners attach

**Impact**: Job may be marked as 'stuck' initially, then corrected

**Mitigation**:

- Use `stuckThresholdMs: 300000` (5 minutes) to avoid false positives

- Very short jobs complete before reconciliation runs anyway

### Configuration Constraints

- **Minimum `stuckThresholdMs`**: 60,000ms (60 seconds) - prevents marking healthy jobs as stuck

- **Rate limiting**: Reconciliation re-enqueues at 20 jobs/second by default (configurable via `rateLimitPerSecond`)

- **Error message truncation**: Error messages are truncated to 5,000 characters and sanitized for security

### Multi-Instance Reconciliation

**⚠️ Not Supported**: Running multiple JobGuard instances with reconciliation enabled for the same queue can cause duplicate re-enqueue attempts.

**Best Practice**: Only enable reconciliation (`reconciliation.enabled: true`) on **one** instance per queue:

```typescript

// Worker instances - reconciliation disabled

const jobGuard = await JobGuard.create(queue, {

  postgres: postgresUrl,

  reconciliation: { enabled: false },

});

// Single orchestrator instance - reconciliation enabled

const jobGuard = await JobGuard.create(queue, {

  postgres: postgresUrl,

  reconciliation: { enabled: true },

});

```

### Performance Trade-offs

- **PostgreSQL overhead**: Each job operation adds ~5ms latency

- **Reconciliation impact**: Checking 10,000 stuck jobs takes ~2-5 seconds

- **Memory usage**: ~50MB for tracking 10,000 jobs

## Security

### Reporting Vulnerabilities

🔒 **Please do NOT open public issues for security vulnerabilities.**

If you discover a security issue, please **[Create a private security advisory](https://github.com/alexpota/jobguard/security/advisories/new)**

We will respond within 48 hours and work with you to address the issue.

### Best Practices

**Production Deployment:**

- ✅ Use SSL/TLS for PostgreSQL connections (`ssl: true`)

- ✅ Store connection strings in environment variables, not code

- ✅ Use least-privilege database user with only required permissions:

  ```sql

  GRANT SELECT, INSERT, UPDATE, DELETE ON jobguard_jobs TO jobguard_user;

  ```

- ✅ Rotate database credentials regularly

- ✅ Set appropriate `max_connections` for your PostgreSQL instance

- ✅ Enable PostgreSQL audit logging for compliance requirements

**What JobGuard Does NOT Do:**

- ❌ JobGuard does not encrypt job data at rest (use PostgreSQL encryption)

- ❌ JobGuard does not implement authentication (secure your PostgreSQL)

- ❌ JobGuard does not sanitize job data (validate in your application)

## Requirements

- **Node.js**: 22.0+ (LTS)

- **PostgreSQL**: 14+ (for B-tree deduplication)

- **Queue Library**: Bull 4.12+, BullMQ 5.1+, or Bee-Queue 1.7+

## FAQ

### Why PostgreSQL only? Can I use MySQL/MongoDB?

**No** - JobGuard currently requires PostgreSQL 14+.

JobGuard uses PostgreSQL-specific features that are difficult to replicate in other databases:

| Feature | Why It Matters | Other Databases |

|---------|----------------|-----------------|

| **JSONB** | Fast job data storage and queries without deserialization | MySQL JSON is slower; MongoDB has native JSON but lacks other features |

| **Partial Indexes** | Only indexes active jobs - reduces storage and improves performance | MySQL has limited support; MongoDB supports but lacks transactional guarantees |

| **ACID Transactions** | Guarantees zero data loss during writes | MongoDB added in 4.0 but still limited; MySQL supports but lacks JSONB |

| **Advanced Indexes** | B-tree deduplication (PostgreSQL 14+) reduces index size by ~40% | Not available in MySQL/MongoDB |

**Could other databases be supported?**

Supporting MySQL or MongoDB would require:

- Abstract database layer (adds complexity and maintenance burden)

- Different schema implementations for each database

- Performance compromises (MySQL's JSON is measurably slower than JSONB)

- Extensive testing across multiple database versions

This significantly increases complexity for a feature that most users don't need. PostgreSQL is widely adopted in the Node.js ecosystem and provides the best combination of performance, reliability, and features for job durability.

**What if my team uses MySQL/MongoDB?**

You have three options:

1. **Add PostgreSQL for job tracking only** - JobGuard uses a single table with minimal overhead. Many teams run PostgreSQL alongside their primary database specifically for features like job durability.

2. **Use PostgreSQL-only alternatives** - [Graphile Worker](https://github.com/graphile/worker) and [pg-boss](https://github.com/timgit/pg-boss) are PostgreSQL-native job queues (no Redis).

3. **Request MySQL support** - If there's significant demand, MySQL support may be considered in the future. [Open an issue](https://github.com/alexpota/jobguard/issues) to discuss your use case.

### Why not just use Redis persistence (RDB/AOF)?

Redis persistence has limitations that JobGuard addresses:

**Redis AOF with `appendfsync everysec` (recommended setting):**

- Can lose up to 1 second of data on crash

- Does not detect stuck jobs (worker crashes mid-processing)

- Requires manual recovery after Redis restarts

**Redis AOF with `appendfsync always` (100% durable):**

- Significantly slower (every write waits for disk fsync)

- Still doesn't detect stuck jobs

- Still requires manual intervention for recovery

**JobGuard provides:**

- Zero data loss (PostgreSQL ACID guarantees)

- Automatic stuck job detection and re-enqueueing

- Full job history and audit trail

- Minimal performance impact (~5ms overhead per job)

You can use Redis persistence AND JobGuard together for defense in depth, but JobGuard provides features that Redis persistence alone cannot.

## License

MIT

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](./.github/CONTRIBUTING.md) for development setup, testing, and code guidelines.

---

**Built by [Alex Potapenko](https://github.com/alexpota) • [Report Issues](https://github.com/alexpota/jobguard/issues)**
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alexpota/jobguard

Awesome Lists containing this project

README