https://github.com/seatgeek/nomad-crashloop-detector
detect Nomad allocation crash-loops, by consuming the allocation stream from nomad-firehose
https://github.com/seatgeek/nomad-crashloop-detector
devops hashicorp nomad rabbitmq
Last synced: 8 months ago
JSON representation
detect Nomad allocation crash-loops, by consuming the allocation stream from nomad-firehose
- Host: GitHub
- URL: https://github.com/seatgeek/nomad-crashloop-detector
- Owner: seatgeek
- License: bsd-3-clause
- Created: 2017-07-12T14:30:29.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-07-12T14:52:56.000Z (almost 9 years ago)
- Last Synced: 2024-06-20T03:34:53.351Z (about 2 years ago)
- Topics: devops, hashicorp, nomad, rabbitmq
- Language: Go
- Homepage:
- Size: 15.6 KB
- Stars: 5
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# nomad-crashloop-detector
`nomad-crashloop-detector` is a tool meant to detect allocation crash-loops, by consuming the allocation stream from [nomad-firehose](https://github.com/seatgeek/nomad-firehose) in RabbitMQ.
## Running
The project got build artifacts for linux, darwin and windows in the [GitHub releases tab](https://github.com/seatgeek/nomad-crashloop-detector/releases).
A docker container is also provided at [seatgeek/nomad-crashloop-detector](https://hub.docker.com/r/seatgeek/nomad-crashloop-detector/tags/)
## Requirements
- Go 1.8
## Building
To build a binary, run the following
```shell
# get this repo
go get github.com/seatgeek/nomad-crashloop-detector
# go to the repo directory
cd $GOPATH/src/github.com/seatgeek/nomad-crashloop-detector
# build the `nomad-crashloop-detector` binary
make build
```
This will create a `nomad-crashloop-detector` binary in your `$GOPATH/bin` directory.
## Configuration
Any `NOMAD_*` env that the native `nomad` CLI tool supports are supported by this tool.
- `$AMQP_CONNECTION` is identical to `$SINK_AMQP_CONNECTION`, but is for the consuming stream from `nomad-firehose`
- `$AMQP_QUEUE` is the RabbitMQ queue to consume the `nomad-firehose` from.
- `$RESTART_COUNT` how many restarts to allow within `$RESTART_INTERVAL` time (example: `5`)
- `$RESTART_INTERVAL` within what time frame `$RESTART_COUNT` allocation restarts must happen to trigger an notification (example: `5m`)
- `$NOTIFICATION_INTERVAL` how often a notification should happen on a crash-looping allocation (example: `5m`)
## Sinks
The sink type is configured using `$SINK_TYPE` environment variable. Valid values are: `stdout`, `kinesis` and `amqp`.
The `amqp` sink is configured using `$SINK_AMQP_CONNECTION` (`amqp://guest:guest@127.0.0.1:5672/`), `$SINK_AMQP_EXCHANGE` and `$SINK_AMQP_ROUTING_KEY` environment variables.
The `kinesis` sink is configured using `$SINK_KINESIS_STREAM_NAME` and `$SINK_KINESIS_PARTITION_KEY` environment variables.
The `stdout` sink do not have any configuration, it will simply output the JSON to stdout for debugging.
## Example
Assuming the following setup:
- `nomad` exchange (type=topic)
- `nomad.crash-loop-in` queue which is bound to `nomad` exchange with routing key `allocations`
- `nomad.crash-loop-out` queue which is bound to `nomad` exchange with routing key `crash-loop`
Running `nomad-firehose`:
```sh
SINK_TYPE=amqp \
SINK_AMQP_CONNECTION="amqp://guest:guest@127.0.0.1:5672/" \
SINK_AMQP_EXCHANGE=nomad \
SINK_AMQP_ROUTING_KEY=allocations \
nomad-firehose allocations
```
Running `nomad-crashloop-detector`:
```sh
RESTART_COUNT=2 \
RESTART_INTERVAL=5m \
NOTIFICATION_INTERVAL=5m \
SINK_TYPE=amqp \
SINK_AMQP_CONNECTION="amqp://guest:guest@127.0.0.1:5672/" \
SINK_AMQP_EXCHANGE=nomad \
SINK_AMQP_ROUTING_KEY=crash-loop \
AMQP_CONNECTION=$SINK_AMQP_CONNECTION \
AMQP_QUEUE=nomad.crash-loop-in \
nomad-crashloop-detector
```
The setup will make `nomad-firehose` send all nomad allocation changes to the `nomad` exchange, that will forward messages to the `nomad.crash-loop-in` queue.
`nomad-crashloop-detector` will consume the messages in `nomad.crash-loop-in`, and when a restart threshold is reached, submit a AMQP job to the `nomad` exchange, which will redirect the message to `nomad.crash-loop-in`.
## Example crash-loop payload
```json
{
"LastEvent": {
"Name": "job.task[0]",
"AllocationID": "fd4deb1f-405b-93a6-3eb4-a84e0670049d",
"DesiredStatus": "run",
"DesiredDescription": "",
"ClientStatus": "running",
"ClientDescription": "",
"JobID": "job",
"GroupName": "group",
"TaskName": "task",
"EvalID": "db0064ab-a44d-e450-4f66-2cabbec536bb",
"TaskState": "pending",
"TaskFailed": false,
"TaskStartedAt": "2017-07-12T13:56:30.932498912Z",
"TaskFinishedAt": "0001-01-01T00:00:00Z",
"TaskEvent": {
"Type": "Restarting",
"Time": 1499867806677609000,
"FailsTask": false,
"RestartReason": "Restart within policy",
"SetupError": "",
"DriverError": "",
"DriverMessage": "",
"ExitCode": 0,
"Signal": 0,
"Message": "",
"KillReason": "",
"KillTimeout": 0,
"KillError": "",
"StartDelay": 17425840945,
"DownloadError": "",
"ValidationError": "",
"DiskLimit": 0,
"DiskSize": 0,
"FailedSibling": "",
"VaultError": "",
"TaskSignalReason": "",
"TaskSignal": ""
}
},
"EventLog": [
"2017-07-12T15:56:15.401013209+02:00",
"2017-07-12T15:56:46.677608921+02:00"
]
}
```