https://github.com/tilework-tech/nori-premortem

Push your machines to the max - diagnose machine issues before the crash
https://github.com/tilework-tech/nori-premortem

Last synced: 16 days ago
JSON representation

Push your machines to the max - diagnose machine issues before the crash

Host: GitHub
URL: https://github.com/tilework-tech/nori-premortem
Owner: tilework-tech
License: apache-2.0
Created: 2025-11-13T16:44:22.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-11-20T23:38:58.000Z (2 months ago)
Last Synced: 2025-11-21T00:11:13.413Z (2 months ago)
Language: TypeScript
Homepage: https://tilework.tech/products/nori-premortem
Size: 87.9 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Nori Premortem

![Node Version](https://img.shields.io/badge/node-%3E%3D20-brightgreen)

A system monitoring daemon that intelligently diagnoses machine issues before critical failure using Claude AI.

## Installation

```
npm install -g nori-premortem@latest
// Add config to config.json
nori-premortem --config config.json
```

## Why Premortem?

When a machine dies, you are often left with no real idea what happened and why, because the machine takes everything with it. Traditional monitoring rarely captures meaningful diagnostics, because it makes strong assumptions up front about possible sources of failure and is not able to dynamically adjust based on in-stream information. You can figure out that your system OOM'd, but you won't easily figure out why, or even more important, where in your code the problem came from.

**Premortem** spawns a Claude agent the moment issues arise, analyzing the system in real-time and streaming diagnostics to a safe backend. Instead of metric graphs, engineers get AI-powered root cause analysis.

## Configuration

Create your configuration file from the example template:

```bash
cp defaultConfig.example.json defaultConfig.json
# Edit defaultConfig.json with your webhookUrl, anthropicApiKey, and desired thresholds
```

Example configuration:

```json
{
"webhookUrl": "https://your-server.com/webhook-endpoint",
"anthropicApiKey": "sk-ant-your-api-key-here",
"pollingInterval": 10000,
"thresholds": {
"memoryPercent": 90,
"diskPercent": 85,
"cpuPercent": 80
},
"agentConfig": {
"customPrompt": "You are diagnosing system performance issues. Focus on memory usage, disk space, CPU utilization, and process behavior."
},
"heartbeat": {
"url": "https://your-server.com/heartbeat-endpoint",
"interval": 60000,
"processName": "my-process"
}
}
```

### Configuration Options

- **webhookUrl** (required): HTTP endpoint to receive diagnostic output
- Must accept POST requests with JSON payloads containing Claude SDK message objects
- Messages are grouped by `session_id` field
- Each message follows the format: `{type: string, session_id: string, ...other_fields}`
- **anthropicApiKey** (required): Your Anthropic API key for Claude
- **pollingInterval** (optional, default: 10000): Milliseconds between system checks
- **thresholds** (required): At least one threshold must be configured
- **memoryPercent**: Trigger when memory usage exceeds this percentage (uses "available" memory, not "used", to avoid false alerts from Linux buffer/cache)
- **diskPercent**: Trigger when disk usage exceeds this percentage
- **cpuPercent**: Trigger when CPU usage exceeds this percentage
- **agentConfig** (optional): Claude agent configuration
- **customPrompt**: Additional context prepended to diagnostic prompt (default: null)
- Note: Model, allowed tools, and max turns are controlled by SDK defaults and not user-configurable
- **heartbeat** (optional): Health check configuration
- **url**: Endpoint to receive periodic heartbeat signals
- **interval** (default: 60000): Milliseconds between heartbeat signals
- **processName**: Process name to monitor and report in heartbeat

## Usage

Running premortem will:

1. Validate the Anthropic API key with a test query (fail-fast if invalid)
2. Create the archive directory at `~/.premortem-logs` if it doesn't exist
3. Validate the archive directory is writable (fail-fast if not)
4. Start monitoring system metrics
5. When a threshold is breached, spawn a Claude agent with system context
6. Stream all agent output to your webhook endpoint
7. Save complete session transcripts to `~/.premortem-logs/agent-{sessionId}.jsonl`
8. Reset after the agent completes, ready to trigger again

Stop the daemon with `Ctrl+C`.

## Webhook Integration

Premortem streams diagnostic data to any HTTP endpoint that accepts POST requests. This allows integration with existing monitoring infrastructure, logging systems, or custom backends.

### Webhook Endpoint Requirements

The configured webhook endpoint must:
- Accept POST requests with raw Claude SDK message payloads
- Handle messages grouped by `session_id` field
- Be highly available (premortem uses fire-and-forget delivery with no retry logic)

### Message Format

Messages are sent as raw Claude SDK output, one message per POST:

```json
{
"type": "assistant",
"session_id": "session-abc123",
"message": {
"role": "assistant",
"content": "Analyzing system metrics..."
}
}
```

The `session_id` field groups messages into a single diagnostic transcript artifact on the backend.

## Architecture

```
Daemon (monitoring loop)
↓ (threshold breach detected)
Agent SDK (Claude diagnostics)
↓ (immediate streaming)
Webhook Endpoint (your server)
```

**Key Design Decisions:**

1. **First-breach-only**: When multiple thresholds breach, only the first (memory > disk > cpu) triggers
2. **Reset on completion**: Agent finish resets daemon state, allowing new breaches to trigger
3. **Fire-and-forget webhooks**: No retries - webhook endpoint must be reliable
4. **API key in config**: anthropicApiKey stored in config file, set to env before SDK calls

## Development

Run tests:

```bash
npm test
```

Watch mode:

```bash
npm run test:watch
```

Build:

```bash
npm run build
```

## Troubleshooting

**Daemon not starting:**

- Check that `anthropicApiKey` is valid in config
- Verify webhook URL is reachable

**No agent triggering:**

- Check threshold values - may need to lower them for testing
- Review daemon logs for system metrics

**Webhook not receiving data:**

- Test webhook endpoint separately
- Check firewall/network settings
- Remember: no retries, so endpoint must be reliable

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tilework-tech/nori-premortem

Awesome Lists containing this project

README