{"id":13838399,"url":"https://github.com/Shopify/semian","last_synced_at":"2025-07-10T21:32:43.955Z","repository":{"id":21075943,"uuid":"24375548","full_name":"Shopify/semian","owner":"Shopify","description":":monkey: Resiliency toolkit for Ruby for failing fast","archived":false,"fork":false,"pushed_at":"2024-11-05T15:42:43.000Z","size":1362,"stargazers_count":1352,"open_issues_count":48,"forks_count":80,"subscribers_count":488,"default_branch":"main","last_synced_at":"2024-11-19T05:21:15.227Z","etag":null,"topics":["bulkheads","circuit-breaker","resiliency","ruby","webscale"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Shopify.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-09-23T14:47:19.000Z","updated_at":"2024-11-16T01:45:29.000Z","dependencies_parsed_at":"2023-01-13T21:17:45.502Z","dependency_job_id":"dc0b8e89-c35a-4592-941f-190346cd4248","html_url":"https://github.com/Shopify/semian","commit_stats":null,"previous_names":[],"tags_count":78,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shopify%2Fsemian","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shopify%2Fsemian/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shopify%2Fsemian/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shopify%2Fsemian/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Shopify","download_url":"https://codeload.github.com/Shopify/semian/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225657445,"owners_count":17503556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bulkheads","circuit-breaker","resiliency","ruby","webscale"],"created_at":"2024-08-04T15:01:55.287Z","updated_at":"2025-07-10T21:32:43.948Z","avatar_url":"https://github.com/Shopify.png","language":"Ruby","readme":"## Semian ![Build Status](https://github.com/Shopify/semian/actions/workflows/test.yml/badge.svg)\n\n![](http://i.imgur.com/7Vn2ibF.png)\n\nSemian is a library for controlling access to slow or unresponsive external\nservices to avoid cascading failures.\n\nWhen services are down they typically fail fast with errors like `ECONNREFUSED`\nand `ECONNRESET` which can be rescued in code. However, slow resources fail\nslowly. The thread serving the request blocks until it hits the timeout for the\nslow resource. During that time, the thread is doing nothing useful and thus the\nslow resource has caused a cascading failure by occupying workers and therefore\nlosing capacity. **Semian is a library for failing fast in these situations,\nallowing you to handle errors gracefully.** Semian does this by intercepting\nresource access through heuristic patterns inspired by [Hystrix][hystrix] and\n[Release It][release-it]:\n\n- [**Circuit breaker**](#circuit-breaker). A pattern for limiting the\n  amount of requests to a dependency that is having issues.\n- [**Bulkheading**](#bulkheading). Controlling the concurrent access to\n  a single resource, access is coordinated server-wide with [SysV\n  semaphores][sysv].\n\nResource drivers are monkey-patched to be aware of Semian, these are called\n[Semian Adapters](#adapters). Thus, every time resource access is requested\nSemian is queried for status on the resource first. If Semian, through the\npatterns above, deems the resource to be unavailable it will raise an exception.\n**The ultimate outcome of Semian is always an exception that can then be rescued\nfor a graceful fallback**. Instead of waiting for the timeout, Semian raises\nstraight away.\n\nIf you are already rescuing exceptions for failing resources and timeouts,\nSemian is mostly a drop-in library with a little configuration that will make\nyour code more resilient to slow resource access. But, [do you even need\nSemian?](#do-i-need-semian)\n\nFor an overview of building resilient Ruby applications, start by reading [the\nShopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in\ndepth information on Semian, see [Understanding Semian](#understanding-semian).\nSemian is an extraction from [Shopify][shopify] where it's been running\nsuccessfully in production since October, 2014.\n\nThe other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to\nwrite automated resiliency tests.\n\n# Usage\n\nInstall by adding the gem to your `Gemfile` and require the [adapters](#adapters) you need:\n\n```ruby\ngem 'semian', require: %w(semian semian/mysql2 semian/redis)\n```\n\nWe recommend this pattern of requiring adapters directly from the `Gemfile`.\nThis ensures Semian adapters are loaded as early as possible and also\nprotects your application during boot. Please see the [adapter configuration\nsection](#configuration) on how to configure adapters.\n\n## Adapters\n\nSemian works by intercepting resource access. Every time access is requested,\nSemian is queried, and it will raise an exception if the resource is unavailable\naccording to the circuit breaker or bulkheads. This is done by monkey-patching\nthe resource driver. **The exception raised by the driver always inherits from\nthe Base exception class of the driver**, meaning you can always simply rescue\nthe base class and catch both Semian and driver errors in the same rescue for\nfallbacks.\n\nThe following adapters are in Semian and tested heavily in production, the\nversion is the version of the public gem with the same name:\n\n- [`semian/mysql2`][mysql-semian-adapter] (~\u003e 0.3.16)\n- [`semian/redis`][redis-semian-adapter] (~\u003e 3.2.1)\n- [`semian/net_http`][nethttp-semian-adapter]\n- [`semian/activerecord_trilogy_adapter`][activerecord-trilogy-semian-adapter]\n- [`semian-postgres`][postgres-semian-adapter]\n\n### Creating Adapters\n\nTo create a Semian adapter you must implement the following methods:\n\n1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the\n   resource. This takes care of situations such as monitoring, nested resources,\n   unsupported platforms, creating the Semian resource if it doesn't already\n   exist and so on.\n2. `#semian_identifier`. This is responsible for returning a symbol that\n   represents every unique resource, for example `redis_master` or\n   `mysql_shard_1`. This is usually assembled from a `name` attribute on the\n   Semian configuration hash, but could also be `\u003chost\u003e:\u003cport\u003e`.\n3. `connect`. The name of this method varies. You must override the driver's\n   connect method with one that wraps the connect call with\n   `Semian::Resource#acquire`. You should do this at the lowest possible level.\n4. `query`. Same as `connect` but for queries on the resource.\n5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are\n   raised when the request was rejected early because the resource is out of\n   tickets or because the circuit breaker is open (see [Understanding\n   Semian](#understanding-semian). They should inherit from the base exception\n   class from the raw driver. For example `Mysql2::Error` or\n   `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it\n   easy to `rescue` and handle them gracefully in application code, by\n   `rescue`ing the base class.\n\nThe best resource is looking at the [already implemented adapters](#adapters).\n\n### Configuration\n\nThere are some global configuration options that can be set for Semian:\n\n```ruby\n# Maximum size of the LRU cache (default: 500)\n# Note: Setting this to 0 enables aggressive garbage collection.\nSemian.maximum_lru_size = 0\n\n# Minimum time in seconds a resource should be resident in the LRU cache (default: 300s)\nSemian.minimum_lru_time = 60\n```\n\nNote: `minimum_lru_time` is a stronger guarantee than `maximum_lru_size`. That\nis, if a resource has been updated more recently than `minimum_lru_time` it\nwill not be garbage collected, even if it would cause the LRU cache to grow\nlarger than `maximum_lru_size`.\n\nWhen instantiating a resource it now needs to be configured for Semian. This is\ndone by passing `semian` as an argument when initializing the client. Examples\nbuilt in adapters:\n\n```ruby\n# MySQL2 client\n# In Rails this means having a Semian key in database.yml for each db.\nclient = Mysql2::Client.new(host: \"localhost\", username: \"root\", semian: {\n  name: \"master\",\n  tickets: 8, # See the Understanding Semian section on picking these values\n  success_threshold: 2,\n  error_threshold: 3,\n  error_timeout: 10\n})\n\n# Redis client\nclient = Redis.new(semian: {\n  name: \"inventory\",\n  tickets: 4,\n  success_threshold: 2,\n  error_threshold: 4,\n  error_timeout: 20\n})\n```\n\n#### Thread Safety\n\nSemian's circuit breaker implementation is thread-safe by default as of\n`v0.7.0`. If you'd like to disable it for performance reasons, pass\n`thread_safety_disabled: true` to the resource options.\n\nBulkheads should be disabled (pass `bulkhead: false`) in a threaded environment\n(e.g. Puma or Sidekiq), but can safely be enabled in non-threaded environments\n(e.g. Resque and Unicorn). As described in this document, circuit breakers alone\nshould be adequate in most environments with reasonably low timeouts.\n\nInternally, semian uses `SEM_UNDO` for several sysv semaphore operations:\n\n- Acquire\n- Worker registration\n- Semaphore metadata state lock\n\nThe intention behind `SEM_UNDO` is that a semaphore operation is automatically undone when the process exits. This\nis true even if the process exits abnormally - crashes, receives a `SIG_KILL`, etc, because it is handled by\nthe operating system and not the process itself.\n\nIf, however, a thread performs a semop, the `SEM_UNDO` is on its parent process. This means that the operation\n_will not_ be undone when the thread exits. This can result in the following unfavorable behavior when using\nthreads:\n\n- Threads acquire a resource, but are killed and the resource ticket is never released. For a process, the\n  ticket would be released by `SEM_UNDO`, but since it's a thread there is the potential for ticket starvation.\n  This can result in deadlock on the resource.\n- Threads that register workers on a resource but are killed and never unregistered. For a process, the worker\n  count would be automatically decremented by `SEM_UNDO`, but for threads the worker count will continue to increment,\n  only being undone when the parent process dies. This can cause the number of tickets to dramatically exceed the quota.\n- If a thread acquires the semaphore metadata lock and dies before releasing it, semian will deadlock on anything\n  attempting to acquire the metadata lock until the thread's parent process exits. This can prevent the ticket count\n  from being updated.\n\nMoreover, a strategy that utilizes `SEM_UNDO` is not compatible with a strategy that attempts to the semaphores tickets manually.\nIn order to support threads, operations that currently use `SEM_UNDO` would need to use no semaphore flag, and the calling process\nwill be responsible for ensuring that threads are appropriately cleaned up. It is still possible to implement this, but\nit would likely require an in-memory semaphore managed by the parent process of the threads. PRs welcome for this functionality.\n\n#### Quotas\n\nYou may now set quotas per worker:\n\n```ruby\nclient = Redis.new(semian: {\n  name: \"inventory\",\n  quota: 0.51,\n  success_threshold: 2,\n  error_threshold: 4,\n  error_timeout: 20\n})\n\n```\n\nPer the above example, you no longer need to care about the number of tickets.\nRather, the tickets shall be computed as a proportion of the number of active workers.\n\nIn this case, we'd allow 50% of the workers on a particular host to connect to this redis resource.\nSo long as the workers are in their own process, they will automatically be registered. The quota will\nset the bulkhead threshold based on the number of registered workers, whenever a new worker registers.\n\nThis is ideal for environments with non-uniform worker distribution, and to eliminate the need to manually\ncalculate and adjust ticket counts.\n\n**Note**:\n\n- You must pass **exactly** one of options: `tickets` or `quota`.\n- Tickets available will be the ceiling of the quota ratio to the number of workers\n- So, with one worker, there will always be a minimum of 1 ticket\n- Workers in different processes will automatically unregister when the process exits.\n- If you have a small number of workers (exactly 2) it's possible that the bulkhead will be too sensitive using quotas.\n- If you use a forking web server (like unicorn) you should call `Semian.unregister_all_resources` before/after forking.\n\n#### Net::HTTP\n\nFor the `Net::HTTP` specific Semian adapter, since many external libraries may create\nHTTP connections on the user's behalf, the parameters are instead provided\nby associating callback functions with `Semian::NetHTTP`, perhaps in an initialization file.\n\n##### Naming and Options\n\nTo give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`\nthat takes a two parameters, `host` and `port` like `127.0.0.1`,`443` or `github_com`,`80`,\nand returns a `Hash` with configuration parameters as follows. The `proc` is used as a\ncallback to initialize the configuration options, similar to other adapters.\n\n```ruby\nSEMIAN_PARAMETERS = { tickets: 1,\n                      success_threshold: 1,\n                      error_threshold: 3,\n                      error_timeout: 10 }\nSemian::NetHTTP.semian_configuration = proc do |host, port|\n  # Let's make it only active for github.com\n  if host == \"github.com\" \u0026\u0026 port.to_i == 80\n    SEMIAN_PARAMETERS.merge(name: \"github.com_80\")\n  else\n    nil\n  end\nend\n\n# Called from within API:\n# semian_options = Semian::NetHTTP.semian_configuration(\"github.com\", 80)\n# semian_identifier = \"nethttp_#{semian_options[:name]}\"\n```\n\nThe `name` should be carefully chosen since it identifies the resource being protected.\nThe `semian_options` passed apply to that resource. Semian creates the `semian_identifier`\nfrom the `name` to look up and store changes in the circuit breaker and bulkhead states\nand associate successes, failures, errors with the protected resource.\n\nWe only require that the `semian_configuration` be **set only once** over the lifetime of\nthe library.\n\nIf you need to return different values for the same pair of `host`/`port` value, you **must**\ninclude the `dynamic: true` option. Returning different values for the same `host`/`port` values\nwithout setting the `dynamic` option can lead to undesirable behavior.\n\nA common example for dynamic options is the use of a thread local variable, such as\n`ActiveSupport::CurrentAttributes`, for requests to a service acting as a proxy.\n\n```ruby\nSEMIAN_PARAMETERS = {\n  # ...\n  dynamic: true,\n}\n\nclass CurrentSemianSubResource \u003c ActiveSupport::CurrentAttributes\n attribute :sub_name\nend\n\nSemian::NetHTTP.semian_configuration = proc do |host, port|\n  name = \"#{host}_#{port}\"\n  if (sub_resource_name = CurrentSemianSubResource.sub_name)\n    name \u003c\u003c \"_#{sub_resource_name}\"\n  end\n  SEMIAN_PARAMETERS.merge(name: name)\nend\n\n# Two requests to example.com can use two different semian resources,\n# as long as `CurrentSemianSubResource.sub_name` is set accordingly:\n# CurrentSemianSubResource.set(sub_name: \"sub_resource_1\") { Net::HTTP.get_response(URI(\"http://example.com\")) }\n# and:\n# CurrentSemianSubResource.set(sub_name: \"sub_resource_2\") { Net::HTTP.get_response(URI(\"http://example.com\")) }\n```\n\nFor most purposes, `\"#{host}_#{port}\"` is a good default `name`. Custom `name` formats\ncan be useful to grouping related subdomains as one resource, so that they all\ncontribute to the same circuit breaker and bulkhead state and fail together.\n\nA return value of `nil` for `semian_configuration` means Semian is disabled for that\nHTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.\nThis behavior lets the adapter default to whitelisting, although the\nbehavior can be changed to blacklisting or even be completely disabled by varying\nthe use of returning `nil` in the assigned closure.\n\n##### Additional Exceptions\n\nSince we envision this particular adapter can be used in combination with many\nexternal libraries, that can raise additional exceptions, we added functionality to\nexpand the Exceptions that can be tracked as part of Semian's circuit breaker.\nThis may be necessary for libraries that introduce new exceptions or re-raise them.\nAdd exceptions and reset to the [`default`][nethttp-default-errors] list using the following:\n\n```ruby\n# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)\nSemian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]\n\nSemian::NetHTTP.reset_exceptions\n# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)\n```\n\n##### Mark Unsuccessful Responses as Failures\n\nUnsuccessful responses (e.g. 5xx responses) do not raise exceptions,\nand as such are not marked as failures by default.\nThe `open_circuit_server_errors` Semian configuration parameter may be set to enable this behaviour,\nto mark unsuccessful responses as failures as seen below:\n\n```ruby\nSEMIAN_PARAMETERS = { tickets: 1,\n                      success_threshold: 1,\n                      error_threshold: 3,\n                      error_timeout: 10,\n                      open_circuit_server_errors: true }\n```\n\n#### Active Record\n\nSemian supports Active Record adapter `trilogy`.\nIt can be configured in the `database.yml`:\n\n```yml\nsemian: \u0026semian\n  success_threshold: 2\n  error_threshold: 3\n  error_timeout: 4\n  half_open_resource_timeout: 1\n  bulkhead: false # Disable bulkhead for Puma: https://github.com/shopify/semian#thread-safety\n  name: semian_identifier_name\n\ndefault: \u0026default\n  adapter: trilogy\n  username: root\n  password:\n  host: localhost\n  read_timeout: 2\n  write_timeout: 1\n  connect_timeout: 1\n  semian:\n    \u003c\u003c: *semian\n```\n\nExample cases for `activerecord-trilogy-adapter` can be run using\n`BUNDLE_GEMFILE=gemfiles/activerecord_trilogy_adapter.gemfile bundle exec rake examples:activerecord_trilogy_adapter`\n\n# Understanding Semian\n\nSemian is a library with heuristics for failing fast. This section will explain\nin depth how Semian works and which situations it's applicable for. First we\nexplain the category of problems Semian is meant to solve. Then we dive into how\nSemian works to solve these problems.\n\n## Do I need Semian?\n\nSemian is not a trivial library to understand, introduces complexity and thus\nshould be introduced with care. Remember, all Semian does is raise exceptions\nbased on heuristics. It is paramount that you understand Semian before\nincluding it in production as you may otherwise be surprised by its behaviour.\n\nApplications that benefit from Semian are those working on eliminating SPOFs\n(Single Points of Failure), and specifically are running into a wall regarding\nslow resources. But it is by no means a magic wand that solves all your latency\nproblems by being added to your `Gemfile`. This section describes the types of\nproblems Semian solves.\n\nIf your application is multithreaded or evented (e.g. not Resque and Unicorn)\nthese problems are not as pressing. You can still get use out of Semian however.\n\n### Real World Example\n\nThis is better illustrated with a real world example from Shopify. When you are\nbrowsing a store while signed in, Shopify stores your session in Redis.\nIf Redis becomes unavailable, the driver will start throwing exceptions.\nWe rescue these exceptions and simply disable all customer sign in functionality\non the store until Redis is back online.\n\nThis is great if querying the resource fails instantly, because it means we fail\nin just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,\nthis can take as long as our timeout which is easily 200ms. This means every\nrequest, even if it does rescue the exception, now takes an extra 200ms.\nBecause every resource takes that long, our capacity is also significantly\ndegraded. These problems are explained in depth in the next two sections.\n\nWith Semian, the slow resource would fail instantly (after a small amount of\nconvergence time) preventing your response time from spiking and not decreasing\ncapacity of the cluster.\n\nIf this sounds familiar to you, Semian is what you need to be resilient to\nlatency. You may not need the graceful fallback depending on your application,\nin which case it will just result in an error (e.g. a `HTTP 500`) faster.\n\nWe will now examine the two problems in detail.\n\n#### In-depth analysis of real world example\n\nIf a single resource is slow, every single request is going to suffer. We saw\nthis in the example before. Let's illustrate this more clearly in the following\nRails example where the user session is stored in Redis:\n\n```ruby\ndef index\n  @user = fetch_user\n  @posts = Post.all\nend\n\nprivate\ndef fetch_user\n  user = User.find(session[:user_id])\nrescue Redis::CannotConnectError\n  nil\nend\n```\n\nOur code is resilient to a failure of the session layer, it doesn't `HTTP 500`\nif the session store is unavailable (this can be tested with\n[Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the\nserver will send back `HTTP 500`. We accept that, because it's our primary data\nstore. This could be prevented with a caching tier or something else out of\nscope.\n\nThis code has two flaws however:\n\n1. **What happens if the session storage is consistently slow?** I.e. the majority\n   of requests take, say, more than half the timeout time (but it should only\n   take ~1ms)?\n2. **What happens if the session storage is unavailable and is not responding at\n   all?** I.e. we hit timeouts on every request.\n\nThese two problems in turn have two related problems associated with them:\nresponse time and capacity.\n\n#### Response time\n\nRequests that attempt to access a down session storage are all gracefully handled, the\n`@user` will simply be `nil`, which the code handles. There is still a\nmajor impact on users however, as every request to the storage has to time\nout. This causes the average response time to all pages that access it to go up by\nhowever long your timeout is. Your timeout is proportional to your worst case timeout,\nas well as the number of attempts to hit it on each page. This is the problem Semian\nsolves by using heuristics to fail these requests early which causes a much better\nuser experience during downtime.\n\n#### Capacity loss\n\nWhen your single-threaded worker is waiting for a resource to return, it's\neffectively doing nothing when it could be serving fast requests. To use the\nexample from before, perhaps some actions do not access the session storage at\nall. These requests will pile up behind the now slow requests that are trying to\naccess that layer, because they're failing slowly. Essentially, your capacity\ndegrades significantly because your average response time goes up (as explained\nin the previous section). Capacity loss simply follows from an increase in\nresponse time. The higher your timeout and the slower your resource, the more\ncapacity you lose.\n\n#### Timeouts aren't enough\n\nIt should be clear by now that timeouts aren't enough. Consistent timeouts will\nincrease the average response time, which causes a bad user experience, and\nultimately compromise the performance of the entire system. Even if the timeout\nis as low as ~250ms (just enough to allow a single TCP retransmit) there's a\nlarge loss of capacity and for many applications a 100-300% increase in average\nresponse time. This is the problem Semian solves by failing fast.\n\n## How does Semian work?\n\nSemian consists of two parts: **Circuit Breaker** and **Bulkheading**.\nTo understand Semian, and especially how to configure it,\nwe must understand these patterns and their implementation.\n\nDisable Semian via environment variable `SEMIAN_DISABLED=1`.\n\n### Circuit Breaker\n\nThe circuit breaker pattern is based on a simple observation - if we hit a\ntimeout or any other error for a given service one or more times, we’re likely\nto hit it again for some amount of time. Instead of hitting the timeout\nrepeatedly, we can mark the resource as dead for some amount of time during\nwhich we raise an exception instantly on any call to it. This is called the\n[circuit breaker pattern][cbp].\n\n![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)\n\nWhen we perform a Remote Procedure Call (RPC), it will first check the circuit.\nIf the circuit is rejecting requests because of too many failures reported by\nthe driver, it will throw an exception immediately. Otherwise the circuit will\ncall the driver. If the driver fails to get data back from the data store, it\nwill notify the circuit. The circuit will count the error so that if too many\nerrors have happened recently, it will start rejecting requests immediately\ninstead of waiting for the driver to time out. The exception will then be raised\nback to the original caller. If the driver’s request was successful, it will\nreturn the data back to the calling method and notify the circuit that it made a\nsuccessful call.\n\nThe state of the circuit breaker is local to the worker and is not shared across\nall workers on a server.\n\n#### Circuit Breaker Configuration\n\nThere are four configuration parameters for circuit breakers in Semian:\n\n- **circuit_breaker**. Enable or Disable Circuit Breaker. Defaults to `true` if not set.\n- **error_threshold**. The amount of errors a worker encounters within `error_threshold_timeout`\n  amount of time before opening the circuit,\n  that is to start rejecting requests instantly.\n- **error_threshold_timeout**. The amount of time in seconds that `error_threshold`\n  errors must occur to open the circuit.\n  Defaults to `error_timeout` seconds if not set.\n- **error_timeout**. The amount of time in seconds until trying to query the resource\n  again.\n- **error_threshold_timeout_enabled**. If set to false it will disable\n  the time window for evicting old exceptions. `error_timeout` is still used and\n  will reset the circuit. Defaults to `true` if not set.\n- **success_threshold**. The amount of successes on the circuit until closing it\n  again, that is to start accepting all requests to the circuit.\n- **half_open_resource_timeout**. Timeout for the resource in seconds when\n  the circuit is half-open (supported for MySQL, Net::HTTP and Redis).\n- **lumping_interval**. If provided, errors within this timeframe (in seconds) will be lumped and recorded as one.\n\nIt is possible to disable Circuit Breaker with environment variable\n`SEMIAN_CIRCUIT_BREAKER_DISABLED=1`.\n\nFor more information about configuring these parameters, please read\n[this post](https://shopify.engineering/circuit-breaker-misconfigured).\n\n### Bulkheading\n\nFor some applications, circuit breakers are not enough. This is best illustrated\nwith an example. Imagine if the timeout for our data store isn't as low as\n200ms, but actually 10 seconds. For example, you might have a relational data\nstore where for some customers, 10s queries are (unfortunately) legitimate.\nReducing the time of worst case queries requires a lot of effort. Dropping the\nquery immediately could potentially leave some customers unable to access\ncertain functionality. High timeouts are especially critical in a non-threaded\nenvironment where blocking IO means a worker is effectively doing nothing.\n\nIn this case, circuit breakers aren't sufficient. Assuming the circuit is shared\nacross all processes on a server, it will still take at least 10s before the\ncircuit is open. In that time every worker is blocked (see also \"Defense Line\"\nsection for an in-depth explanation of the co-operation between circuit breakers\nand bulkheads) this means we're at reduced capacity for at least 20s, with the\nlast 10s timeouts occurring just before the circuit opens at the 10s mark when a\ncouple of workers have hit a timeout and the circuit opens. We thought of a\nnumber of potential solutions to this problem - stricter timeouts, grouping\ntimeouts by section of our application, timeouts per statement—but they all\nstill revolved around timeouts, and those are extremely hard to get right.\n\nInstead of thinking about timeouts, we took inspiration from Hystrix by Netflix\nand the book Release It (the resiliency bible), and look at our services as\nconnection pools. On a server with `W` workers, only a certain number of them\nare expected to be talking to a single data store at once. Let's say we've\ndetermined from our monitoring that there’s a 10% chance they’re talking to\n`mysql_shard_0` at any given point in time under normal traffic. The probability\nthat five workers are talking to it at the same time is 0.001%. If we only allow\nfive workers to talk to a resource at any given point in time, and accept the\n0.001% false positive rate—we can fail the sixth worker attempting to check out\na connection instantly. This means that while the five workers are waiting for a\ntimeout, all the other `W-5` workers on the node will instantly be failing on\nchecking out the connection and opening their circuits. Our capacity is only\ndegraded by a relatively small amount.\n\nWe call this limitation primitive \"tickets\". In this case, the resource access\nis limited to 5 tickets (see Configuration). The timeout value specifies the\nmaximum amount of time to block if no ticket is available.\n\n```mermaid\n\ngraph TD;\n    Start[Start]\n    CheckConnection{Request Connection to Resource}\n    AllocateTicket[Allocate Ticket if Available]\n    BlockTimeout[Block until Timeout or Ticket Available]\n    AccessResource[Access Resource]\n    ReleaseTicket[Release Ticket]\n    FailRequest[Fail Request]\n    OpenCircuit[Open Circuit Breaker]\n\n    Start --\u003e CheckConnection\n    CheckConnection --\u003e|Ticket Available| AllocateTicket\n    AllocateTicket --\u003e AccessResource\n    AccessResource --\u003e ReleaseTicket\n    ReleaseTicket --\u003e CheckConnection\n\n    CheckConnection --\u003e|No Ticket Available| BlockTimeout\n    BlockTimeout --\u003e|Timeout| FailRequest\n    BlockTimeout --\u003e|Ticket Available| AccessResource\n\n    FailRequest --\u003e OpenCircuit\n    OpenCircuit --\u003e CheckConnection\n\n```\n\nHow do we limit the access to a resource for all workers on a server when the\nworkers do not directly share memory? This is implemented with [SysV\nsemaphores][sysv] to provide server-wide access control.\n\n#### Bulkhead Configuration\n\nThere are two configuration values. It's not easy to choose good values and we're\nstill experimenting with ways to figure out optimal ticket numbers. Generally\nsomething below half the number of workers on the server for endpoints that are\nqueried frequently has worked well for us.\n\n- **bulkhead**. Enable or Disable Bulkhead. Defaults to `true` if not set.\n- **tickets**. Number of workers that can concurrently access a resource.\n- **timeout**. Time to wait in seconds to acquire a ticket if there are no tickets left.\n  We recommend this to be `0` unless you have very few workers running (i.e.\n  less than ~5).\n\nIt is possible to disable Bulkhead with environment variable\n`SEMIAN_BULKHEAD_DISABLED=1`.\n\nNote that there are system-wide limitations on how many tickets can be allocated\non a system. `cat /proc/sys/kernel/sem` will tell you.\n\n\u003e System-wide limit on the number of semaphore sets. On Linux\n\u003e systems before version 3.19, the default value for this limit\n\u003e was 128. Since Linux 3.19, the default value is 32,000. On\n\u003e Linux, this limit can be read and modified via the fourth\n\u003e field of `/proc/sys/kernel/sem`.\n\n#### Bulkhead debugging on linux\n\nNote: It is often helpful to examine the actual IPC resources on the system. Semian\nprovides an easy way to get the semaphore key:\n\n```\nirb\u003e require 'semian'\nirb\u003e puts Semian::Resource.new(:your_resource_name, tickets: 42).key # do this from a dev machine\n\"0x48af51ea\"\n```\n\nThis key can then be used to easily inspect the semaphore on any host machine:\n\n```\nipcs -si $(ipcs -s | grep 0x48af51ea | awk '{print $2}')\n```\n\nWhich should output something like:\n\n```\nSemaphore Array semid=5570729\nuid=8192         gid=8192        cuid=8192       cgid=8192\nmode=0660, access_perms=0660\nnsems = 4\notime = Thu Mar 30 15:06:16 2017\nctime = Mon Mar 13 20:25:36 2017\nsemnum     value      ncount     zcount     pid\n0          1          0          0          48\n1          42         0          0          48\n2          42         0          0          27\n3          31         0          0          48\n```\n\nIn the above example, we can see each of the semaphores. Looking at the enum code\nin `ext/semian/sysv_semaphores.h` we can see that:\n\n- 0: is the semian meta lock (mutex) protecting updates to the other resources. It's currently free\n- 1: is the number of available tickets - currently no tickets are in use because it's the same as 2\n- 2: is the configured (maximum) number of tickets\n- 3: is the number of registered workers (processes) that would be considered if using the quota strategy.\n\n## Defense line\n\nThe finished defense line for resource access with circuit breakers and\nbulkheads then looks like this:\n\n![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)\n\nThe RPC first checks the circuit; if the circuit is open it will raise the\nexception straight away which will trigger the fallback (the default fallback is\na 500 response). Otherwise, it will try Semian which fails instantly if too many\nworkers are already querying the resource. Finally the driver will query the\ndata store. If the data store succeeds, the driver will return the data back to\nthe RPC. If the data store is slow or fails, this is our last line of defense\nagainst a misbehaving resource. The driver will raise an exception after trying\nto connect with a timeout or after an immediate failure. These driver actions\nwill affect the circuit and Semian, which can make future calls fail faster.\n\nA useful way to think about the co-operation between bulkheads and circuit\nbreakers is through visualizing a failure scenario graphing capacity as a\nfunction of time. If an incident strikes that makes the server unresponsive\nwith a `20s` timeout on the client and you only have circuit breakers\nenabled--you will lose capacity until all workers have tripped their circuit\nbreakers. The slope of this line will depend on the amount of traffic to the now\nunavailable service. If the slope is steep (i.e. high traffic), you'll lose\ncapacity quicker. The higher the client driver timeout, the longer you'll lose\ncapacity for. In the example below we have the circuit breakers configured to\nopen after 3 failures:\n\n![resiliency- circuit breakers](https://cloud.githubusercontent.com/assets/97400/22405538/53229758-e612-11e6-81b2-824f873c3fb7.png)\n\nIf we imagine the same scenario but with _only_ bulkheads, configured to have\ntickets for 50% of workers at any given time, we'll see the following\nflat-lining scenario:\n\n![resiliency- bulkheads](https://cloud.githubusercontent.com/assets/97400/22405542/6832a372-e612-11e6-88c4-2452b64b3121.png)\n\nCircuit breakers have the nice property of re-gaining 100% capacity. Bulkheads\nhave the desirable property of guaranteeing a minimum capacity. If we do\naddition of the two graphs, marrying bulkheads and circuit breakers, we have a\nplummy outcome:\n\n![resiliency- circuit breakers bulkheads](https://cloud.githubusercontent.com/assets/97400/22405550/a25749c2-e612-11e6-8bc8-5fe29e212b3b.png)\n\nThis means that if the slope or client timeout is sufficiently low, bulkheads\nwill provide little value and are likely not necessary.\n\n## Failing gracefully\n\nOk, great, we've got a way to fail fast with slow resources, how does that make\nmy application more resilient?\n\nFailing fast is only half the battle. It's up to you what you do with these\nerrors, in the [session example](#real-world-example) we handle it gracefully by\nsigning people out and disabling all session related functionality till the data\nstore is back online. However, not rescuing the exception and simply sending\n`HTTP 500` back to the client faster will help with [capacity\nloss](#capacity-loss).\n\n### Exceptions inherit from base class\n\nIt's important to understand that the exceptions raised by [Semian\nAdapters](#adapters) inherit from the base class of the driver itself, meaning\nthat if you do something like:\n\n```ruby\ndef posts\n  Post.all\nrescue Mysql2::Error\n  []\nend\n```\n\nExceptions raised by Semian's `MySQL2` adapter will also get caught.\n\n### Patterns\n\nWe do not recommend mindlessly sprinkling `rescue`s all over the place. What you\nshould do instead is writing decorators around secondary data stores (e.g. sessions)\nthat provide resiliency for free. For example, if we stored the tags associated\nwith products in a secondary data store it could look something like this:\n\n```ruby\n# Resilient decorator for storing a Set in Redis.\nclass RedisSet\n  def initialize(key)\n    @key = key\n  end\n\n  def get\n    redis.smembers(@key)\n  rescue Redis::BaseConnectionError\n    []\n  end\n\n  private\n\n  def redis\n    @redis ||= Redis.new\n  end\nend\n\nclass Product\n  # This will simply return an empty array in the case of a Redis outage.\n  def tags\n    tags_set.get\n  end\n\n  private\n\n  def tags_set\n    @tags_set ||= RedisSet.new(\"product:tags:#{self.id}\")\n  end\nend\n```\n\nThese decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can\nprovide fallbacks around your primary data store as well. In our case, we simply\n`HTTP 500` in those cases unless it's cached because these pages aren't worth\nmuch without data from their primary data store.\n\n## Monitoring\n\nWith [`Semian::Instrumentable`][semian-instrumentable] clients can monitor\nSemian internals. For example to instrument just events with\n[`statsd-instrument`][statsd-instrument]:\n\n```ruby\n# `event` is `success`, `busy`, `circuit_open`, `state_change`, or `lru_hash_gc`.\n# `resource` is the `Semian::Resource` object (or a `LRUHash` object for `lru_hash_gc`).\n# `scope` is `connection` or `query` (others can be instrumented too from the adapter) (is nil for `lru_hash_gc`).\n# `adapter` is the name of the adapter (mysql2, redis, ..) (is a payload hash for `lru_hash_gc`)\nSemian.subscribe do |event, resource, scope, adapter|\n  case event\n  when :success, :busy, :circuit_open, :state_change\n    StatsD.increment(\"semian.#{event}\", tags: {\n      resource: resource.name,\n      adapter: adapter,\n      type: scope,\n    })\n  else\n    StatsD.increment(\"semian.#{event}\")\n  end\nend\n```\n\n# FAQ\n\n**How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to\ncoordinate access to a resource. The semaphore is only shared within the\n[IPC][namespaces]. Unless you are running many workers inside every container,\nthis leaves the bulkheading pattern effectively useless. We recommend sharing\nthe IPC namespace between all containers on your host for the best ticket\neconomy. If you are using Docker, this can be done with the [--ipc\nflag](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc).\n\n**Why isn't resource access shared across the entire cluster?** This implies a\ncoordination data store. Semian would have to be resilient to failures of this\ndata store as well, and fall back to other primitives. While it's nice to have\nall workers have the same view of the world, this greatly increases the\ncomplexity of the implementation which is not favourable for resiliency code.\n\n**Why isn't the circuit breaker implemented as a host-wide mechanism?** No good\nreason. Patches welcome!\n\n**Why is there no fallback mechanism in Semian?** Read the [Failing\nGracefully](#failing-gracefully) section. In short, exceptions is exactly this.\nWe did not want to put an extra level on abstraction on top of this. In the\nfirst internal implementation this was the case, but we later moved away from\nit.\n\n**Why does it not use normal Ruby semaphores?** To work properly the access\ncontrol needs to be performed across many workers. With MRI that means having\nmultiple processes, not threads. Thus we need a primitive outside of the\ninterpreter. For other Ruby implementations a driver that uses Ruby semaphores\ncould be used (and would be accepted as a PR).\n\n**Why are there three semaphores in the semaphore sets for each resource?** This\nhas to do with being able to resize the number of tickets for a resource online.\n\n**Can I change the number of tickets freely?** Yes, the logic for this isn't\ntrivial but it works well.\n\n**What is the performance overhead of Semian?** Extremely minimal in comparison\nto going to the network. Don't worry about it unless you're instrumenting\nnon-IO.\n\n# Developing Semian\n\nSemian requires a Linux environment for **Bulkheading**.\nWe provide a [docker-compose](https://docs.docker.com/compose/) file\nthat runs MySQL, Redis, Toxiproxy and Ruby in containers.\nUse the steps below to work on Semian from a Mac OS environment.\n\n## Prerequisites :\n\n```bash\n# install docker-for-desktop\n$ brew cask install docker\n\n# install latest docker-compose\n$ brew install docker-compose\n\n# install visual-studio-code (optional)\n$ brew cask install visual-studio-code\n\n# clone Semian\n$ git clone https://github.com/Shopify/semian.git\n$ cd semian\n```\n\n## Visual Studio Code\n\n- Open semian in vscode\n- Install recommended extensions (one off requirement)\n- Click `reopen in container` (first boot might take about a minute)\n\nSee https://code.visualstudio.com/docs/remote/containers for more details\n\nIf you make any changes to `.devcontainer/` you'd need to recreate the containers:\n\n- Select `Rebuild Container` from the command palette\n\nRunning Tests:\n\n- `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)\n\n### Interactive Test Debugging\n\nTo use the interactive debugger on vscode:\n- Open semian in vscode\n- Create an `.env` file (if it doesn't exist)\n- Set up a `DEBUG` ENV variable (ex; `DEBUG=true`)\n- Under the `.vscode/` subdirectory, create a `launch.json` file, and include the following:\n\n```json\n{\n  \"configurations\": [\n    {\n      \"type\": \"rdbg\",\n      \"name\": \"Attach to Ruby rdbg\",\n      \"request\": \"attach\",\n      \"debugPort\": \"12345\",\n    }\n  ]\n}\n```\n\n- For universal support, for any lines you would like to add breakpoints to in your `_test.rb` file (under `test/`), include the following snippet near the line of interest:\n\n```rb\nrequire \"debug\"\nbinding.break if ENV[\"DEBUG\"]\n```\n\n**Note:** unless you are using an vscode extension such as [Dev Container](https://code.visualstudio.com/docs/devcontainers/tutorial), **do not use the built-in vscode breakpoints -- they will not work!**\n\n- Start up the test container\n\n```shell\n$ docker-compose -f .devcontainer/docker-compose.yml --profile test up -d\n```\n\n- When the process indicates that it is waiting for the debugger connection, go to the `Run and Debug` tab, and execute the `Attach to Ruby rdbg` debugger\n\n- Use the vscode debugging tools (such as step in, step out, pause, resume) as normal\n\n## Everything else\n\nTest semian in containers:\n\n- `$ docker-compose -f .devcontainer/docker-compose.yml up -d`\n- `$ docker exec -it semian bash`\n\nIf you make any changes to `.devcontainer/` you'd need to recreate the containers:\n\n- `$ docker-compose -f .devcontainer/docker-compose.yml up -d --force-recreate`\n\nRun tests in containers:\n\n```shell\n$ docker-compose -f ./.devcontainer/docker-compose.yml --profile test run --rm test\n```\n\nRunning Tests:\n\n- `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)\n\n### Running tests in batches\n\n- _TEST_WORKERS_ - Total number of workers or batches.\n  It uses to identify a total number of batches, that would be run in parallel. _Default: 1_\n- _TEST_WORKER_NUM_ - Specify which batch to run. The value is between 1 and _TEST_WORKERS_. _Default: 1_\n\n```shell\n$ bundle exec rake test:parallel TEST_WORKERS=5 TEST_WORKER_NUM=1\n```\n\n### Debug\n\nBuild a semian native extension with debug information.\n\n```shell\n$ bundle exec rake clean --trace\n$ export DEBUG=1\n$ bundle exec rake build\n$ bundle install\n```\n\n[hystrix]: https://github.com/Netflix/Hystrix\n[release-it]: https://pragprog.com/titles/mnee2/release-it-second-edition/\n[shopify]: http://www.shopify.com/\n[mysql-semian-adapter]: lib/semian/mysql2.rb\n[postgres-semian-adapter]: https://github.com/mschoenlaub/semian-postgres\n[redis-semian-adapter]: lib/semian/redis.rb\n[activerecord-trilogy-semian-adapter]: lib/semian/activerecord_trilogy_adapter.rb\n[semian-adapter]: lib/semian/adapter.rb\n[nethttp-semian-adapter]: lib/semian/net_http.rb\n[nethttp-default-errors]: lib/semian/net_http.rb#L35-L45\n[semian-instrumentable]: lib/semian/instrumentable.rb\n[statsd-instrument]: http://github.com/shopify/statsd-instrument\n[resiliency-blog-post]: https://shopify.engineering/building-and-testing-resilient-ruby-on-rails-applications\n[toxiproxy]: https://github.com/Shopify/toxiproxy\n[sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html\n[cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern\n[namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html\n","funding_links":[],"categories":["Ruby","HarmonyOS","Capabilities"],"sub_categories":["Windows Manager","Resilience"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShopify%2Fsemian","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShopify%2Fsemian","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShopify%2Fsemian/lists"}