Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/m-lab/rebot
https://github.com/m-lab/rebot
machine platform reboot
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/m-lab/rebot
- Owner: m-lab
- License: apache-2.0
- Created: 2016-06-27T20:33:04.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-08-22T23:10:35.000Z (over 1 year ago)
- Last Synced: 2024-10-29T14:44:56.069Z (about 2 months ago)
- Topics: machine, platform, reboot
- Language: Go
- Size: 179 KB
- Stars: 0
- Watchers: 11
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: history/history.go
- License: LICENSE.md
Awesome Lists containing this project
README
[![GoDoc](https://godoc.org/github.com/m-lab/rebot?status.svg)](https://godoc.org/github.com/m-lab/rebot) [![Build Status](https://travis-ci.org/m-lab/rebot.svg?branch=master)](https://travis-ci.org/m-lab/rebot) [![Coverage Status](https://coveralls.io/repos/github/m-lab/rebot/badge.svg?branch=master)](https://coveralls.io/github/m-lab/rebot?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/m-lab/rebot)](https://goreportcard.com/report/github.com/m-lab/rebot)
ReBot
======
The rebot tool identifies machines on the M-Lab infrastructure that are not
reachable anymore and should be rebooted (according to various criteria) and
attempts to reboot them through iDRAC.Criteria for reboot candidates
---This is the list of criteria ReBot will check to determine if a machine needs
to be rebooted.- machine is offline - port 806 down for the last 15m
- machine is not lame-ducked - lame_duck_node is not 1
- site and machine are not in GMX maintenance - gmx_machine_maintenance and gmx_site_maintenance are not 1
- switch is online - probe_success{instance=~"s1.*", module="icmp"} has been 0 for the last 15m
- there are no NDT tests running - rate(inotify_extension_create_total{ext=".s2c_snaplog"}[15m]) is 0 or not present
- metrics are actually being collected for all probes (i.e. prometheus was up)
- count_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) >= 14Additionally, ReBot checks the following:
- the machine has not been rebooted already in the last 24hrs
- no more than 5 machines should be rebooted together at any time