https://github.com/ccnmtl/hound
Alert off graphite metrics
https://github.com/ccnmtl/hound
Last synced: 5 months ago
JSON representation
Alert off graphite metrics
- Host: GitHub
- URL: https://github.com/ccnmtl/hound
- Owner: ccnmtl
- Created: 2014-04-08T20:10:33.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2025-07-09T16:03:22.000Z (6 months ago)
- Last Synced: 2025-07-10T01:08:29.526Z (6 months ago)
- Language: Go
- Size: 861 KB
- Stars: 6
- Watchers: 5
- Forks: 2
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[](https://travis-ci.org/ccnmtl/hound)
[](https://coveralls.io/github/ccnmtl/hound?branch=master)
## Hound
This is a simple service that watches a number of Graphite metrics and
sends alert emails when they cross a threshold.
It automatically backs off on failing metrics. You'll get an email
when the metric first fails, another 5 minutes later, another 30
minutes after that, one hour after that, 2 hours after that, 4 hours,
8 hours, then every 24 hours thereafter. Finally, you will get an
email when the metric has recovered.
### Docker image
There is a docker image for running hound and postfix. See
,
and
for more information.
### Without docker
#### Dependencies
1. Obviously enough, hound needs a running graphite server, accessible via
network.
2. In addition, an SMTP host is necessary (without authentication or
encryption) to send the emails out.
### Configuration
There are a couple example configs in the `examples/` directory.
* `CheckInterval` is how many minutes to wait between checks
* `GlobalThrottle` is the maximum number of alerts that Hound will send in a
cycle. Ie, if there's a major network outage and all the metrics start
failing, you want to stop it once you've figured that out. Once this
threshold is passed, Hound sends just one more message saying how many
metrics are failing.
The rest of the values in this file should be self-explanatory.
The alerts configuration is set in `config.json` (by default - it is passed as
an argument to `hound` in `run_nohup.sh`).
Each Alert has:
* `Name`: obvious.
* `Type`: defaults to 'Alert'. You can also set it to 'Notice'. The
convention is that an alert means "something is broken and a human
needs to fix it NOW" while a Notice is just "you should know about
this" and it's either not directly actionable or not urgent. Eg,
high CPU load is probably a Notice, while high request error rates
would warrant an alert.
* `Metric`: the actual Graphite metric being checked. This can be as
complicated as you like and use the full suite of Graphite
functions.
* `Threshold`: fairly obvious. Format it as a float. Treat it as ">="
or "<=". Ie, it will trigger if the metric matches the threshold.
* `Direction`: "above" or "below". Specified whether a failure is when
the metric crosses above or below the threshold, respectively.