{"id":15699815,"url":"https://github.com/jacobbednarz/operability_guidelines","last_synced_at":"2026-01-08T16:06:15.205Z","repository":{"id":143283017,"uuid":"57083115","full_name":"jacobbednarz/operability_guidelines","owner":"jacobbednarz","description":":clipboard: Guidelines to define operability standards ","archived":false,"fork":false,"pushed_at":"2016-06-19T23:58:56.000Z","size":16,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-05T16:17:24.130Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jacobbednarz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-04-25T23:34:04.000Z","updated_at":"2020-11-06T13:45:55.000Z","dependencies_parsed_at":"2023-07-31T21:00:49.526Z","dependency_job_id":null,"html_url":"https://github.com/jacobbednarz/operability_guidelines","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacobbednarz%2Foperability_guidelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacobbednarz%2Foperability_guidelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacobbednarz%2Foperability_guidelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacobbednarz%2Foperability_guidelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jacobbednarz","download_url":"https://codeload.github.com/jacobbednarz/operability_guidelines/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246330657,"owners_count":20760156,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T19:41:49.141Z","updated_at":"2026-01-08T16:06:15.174Z","avatar_url":"https://github.com/jacobbednarz.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"Defining an operability guideline is a way to ensure that all your production\nsystems meet the minimum requirements to be supported within teams. Without this\nguideline, different teams will have varying standards on what is required to\nmake a system \"production\" ready.\n\n## Tests\n\n- **Does the project have an automated test suite?**\n\n  Having an automated test suite (such as run CI on push) is great to ensure\n  that changes can be made with confidence that the proposed changes will not\n  break the existing application. By having a decent amount of test coverage you\n  can quickly iterate on changes and spend less time manually reviewing changes\n  post deploy.\n\n## Logging\n\n- **Are the logs stored somewhere other than on the host itself?**\n\n  Logs should never be stored on a single host. Instead, you should mount (and\n  write to) an  external volume or have the data streamed to a centralised\n  logging platform.  This becomes very benefitical if a host should die\n  unexpectedly and you need to dig through the black box to determine the\n  cause.\n\n- **Are you using centralised logging?**\n\n  Having all the logs in a single system for searching or parsing makes\n  investigation and monitoring easy for all involved. If you need to jump\n  between multiple logging services to gather some data, it can add overhead to\n  the investigation.\n\n- **Do you have saved search queries for common issues?**\n\n  When starting from scratch this won't be in place however over time you should\n  be able to build dashboards or save reusable queries to pull out data related\n  to commonly encountered issues or data sets. This allows you to speed up the\n  time to identifying the problem and not worry about small syntax issues in\n  your queries.\n\n## Monitoring\n\n- **Is there a place where metrics (such as request per minute, response times,\n  etc) are aggregated?**\n\n  Being able to visualise changes in patterns is extremely helpful when tracking\n  down when an issue started or what the impact to a particular component will\n  be.\n\n- **Do you have visual thresholds in place for your collected metrics?**\n\n  Using visual markers in your monitoring systems allows someone with very\n  little context to know where the thresholds are for a component. This can be\n  in the form of a line or shaded area of the metric and can speed up incident\n  response times if someone can quickly identify something has burst through a\n  defined threshold and needs investigation.\n\n## Alerting\n\n- **Do you have automated escalation policies in place for your on call\n  rotation?**\n\n  We are all human and just because people are on call doesn't mean you get\n  super powers being able to stay awake for days on end. If you support an\n  application or service outside of business hours, your on call policy should\n  include an automated escalation after _n_ number of calls/SMS/emails go\n  unanswered. This ensures that despite a (potentially fatigued) human allowing\n  an alert to go unanswered, someone else will be able to respond.\n\n## Security\n\n- **Are you only exposing the bare minimum that is required to operate your\n  service?**\n\n  A key to securing your infrastructure is ensuring that your attack surface is\n  as small as possible. By doing this, you limit the number of potential attack\n  vectors available and it gives you the oppurtunity to focus on those exposed\n  surfaces.\n\n- **Is re-rolling your secrets automated?**\n\n  If your application uses secrets, it is a good idea to automate the process of\n  rolling these secrets so that you are able to do it at a moments notice and\n  not have to be concerned about where else the secret may be used.\n\n## Version control\n\n- **Are your instances managed using configuration management?**\n\n  Configuration management tools allow you to describe how your instances\n  should look and function. It also gives you the ability to quickly roll out a\n  series of changes to the whole fleet without needing to worry about\n  inconsistencies between them.\n\n- **Is your infrastructure in code?**\n\n  While configuration management looks after the individual instances,\n  Infrastructure As Code also extends to the orchestration of how all the moving\n  parts fit together. Taking this approach, you define things such as (but not\n  limited to):\n\n    - Instance size\n    - Which load balancer it is allowed to talk to\n    - How many of these instances you intend to create\n    - Which instances are networked together\n\n  Having your infrastructure in code allows you to quickly rebuild and get back\n  to the golden source in the event anything goes wrong.\n\n## Automation\n\n- **Are tasks automated where possible?**\n\n  Whenever a task requires a human to intervene, there is a possibility that\n  there will also be human error. To minimise the risk of human error, you can\n  write automated scripts that a person can trigger or can be automatically\n  triggered based on system events\n  ([ChatOps](https://speakerdeck.com/jnewland/chatops-at-github) is a great\n  example of this).\n\n## Resiliency\n\n- **Have you created a [resilency\n  matrix](https://speakerdeck.com/sirupsen/dockercon-2015-resilient-routing-and-discovery)\n  and addressed single points of failure?**\n\n  A resiliency matrix will allow you to identify parts of your infrastructure\n  that are too tightly coupled and work on ways to eliminate SPOF's and degrade\n  gracefully during an outage of a single component.\n\n- **Do you have multiple (load balanced) instances in operation?**\n\n  Regardless of whether it is a user facing service or a database\n  server, it is important to introduce at least 2 (but commonly 3) instances so\n  in the event of an isolated failure, you will still have capacity to continue\n  on. It's also important to ensure these instances are load balanced and health\n  checks are being performed so should any of the instances return unhealthy,\n  they can be ejected from the pool.\n\n- **Do you perform canary testing or staged rollouts for your deployments?**\n\n  Whether you are changing a part of the application or your infrastructure you\n  should be performing small and incremental rollouts to ensure your changes do\n  not introduce regressions. Many teams will also automate this step and perform\n  comparative analysis against the stable rollout so that if a particular\n  theshold (such as error rate) vary too much, the change will be rolled back to\n  a known good state.\n\n- **Do you regularly firedrill your disaster recovery or failover processes?**\n\n  Having a disaster recovery process in place is good start however if you do\n  not regularly test that it still works flawlessly as your application changes,\n  there is a chance that when you actually need it, the process will not work or\n  partially fail leaving you in limbo. Firedrilling processes often and randomly\n  will help you combat any snowflakes or resilency holes in your system.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacobbednarz%2Foperability_guidelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjacobbednarz%2Foperability_guidelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacobbednarz%2Foperability_guidelines/lists"}