{"id":49355045,"url":"https://github.com/thatsme/phi_accrual","last_synced_at":"2026-04-27T13:02:07.998Z","repository":{"id":353022510,"uuid":"1215802769","full_name":"thatsme/phi_accrual","owner":"thatsme","description":"Source-agnostic φ-accrual failure detector for Elixir/OTP. Observability-grade, telemetry-first, EWMA-based. Emits a continuous suspicion   value per monitored node; thresholding and policy are consumer concerns.","archived":false,"fork":false,"pushed_at":"2026-04-20T09:44:10.000Z","size":36,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-22T07:03:08.779Z","etag":null,"topics":["beam","distributed-systems","elixir","failure-detection","observability","otp","phi-accrual","telemetry"],"latest_commit_sha":null,"homepage":"https://hexdocs.pm/phi_accrual","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thatsme.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-20T09:14:13.000Z","updated_at":"2026-04-20T10:58:35.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/thatsme/phi_accrual","commit_stats":null,"previous_names":["thatsme/phi_accrual"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/thatsme/phi_accrual","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thatsme%2Fphi_accrual","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thatsme%2Fphi_accrual/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thatsme%2Fphi_accrual/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thatsme%2Fphi_accrual/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thatsme","download_url":"https://codeload.github.com/thatsme/phi_accrual/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thatsme%2Fphi_accrual/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32337274,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beam","distributed-systems","elixir","failure-detection","observability","otp","phi-accrual","telemetry"],"created_at":"2026-04-27T13:02:04.005Z","updated_at":"2026-04-27T13:02:07.942Z","avatar_url":"https://github.com/thatsme.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# phi_accrual\n\nA source-agnostic φ-accrual failure detector for Elixir/OTP, built on\nHayashibara et al. 2004 with a dual-α EWMA estimator, head-of-line and\nlocal-pause awareness, and a telemetry-first API.\n\n\u003e ⚠️ **Alpha — `v0.1.x`.** The API and configuration surface may change\n\u003e before `v1.0`. The **telemetry event schema is already stable** (see\n\u003e [Versioning](#versioning)), but everything else is subject to tuning\n\u003e based on real-deployment feedback. Production use at your own risk;\n\u003e please open issues as you find rough edges.\n\n\u003e **Observability-grade, not decision-grade.** Designed for dashboards,\n\u003e alerting, and operator intuition — not for automated routing, quorum,\n\u003e or correctness decisions. See [limitations](#limitations) for why.\n\n## Quick start\n\n```elixir\n# mix.exs\ndef deps do\n  [{:phi_accrual, \"~\u003e 0.1\"}]\nend\n```\n\nThe application auto-starts. Feed in heartbeat arrivals from anywhere\nyour code already receives cross-node traffic, and read out φ on demand:\n\n```elixir\n# Call this whenever you receive evidence that a peer is alive —\n# a GenServer reply, a :pg broadcast, an :rpc response, a custom ping.\n# First call for an unknown node auto-tracks it with defaults.\nPhiAccrual.observe(:\"peer@host\")\n\n# Query φ at any time.\nPhiAccrual.phi(:\"peer@host\")\n#=\u003e {:ok, 0.42, :steady}\n```\n\nThat's the whole core loop: **feed in arrivals, read out φ.** Everything\nbelow is about making it useful in production — reference heartbeat\nsources if you have none of your own, telemetry wiring for Prometheus,\nthresholding with hysteresis, and honest limitations.\n\n## What it does\n\nGiven a stream of heartbeat arrivals from a remote node, the detector\nmaintains an EWMA estimate of the inter-arrival distribution (mean and\nvariance, independently smoothed) and emits a continuous suspicion\nvalue φ. φ is calibrated so that `φ ≈ -log₁₀(P(arrival still pending))`:\n\n| φ value | Rough meaning                                          |\n| ------- | ------------------------------------------------------ |\n| 1       | 1-in-10 chance the node is dead                        |\n| 3       | 1-in-1000                                              |\n| 8       | 1-in-100 000 000 — very likely down                    |\n\n**Thresholding is a consumer concern.** The detector does not decide\nwhether a node is up or down; it publishes φ, and you (or the optional\n`PhiAccrual.Threshold` module) decide what crosses what line.\n\n## Why another failure detector?\n\nThe Elixir/OTP ecosystem has plenty of cluster-management libraries\n(`libcluster`, `swarm`, `horde`, `partisan`), but all of them use\nbinary up/down detectors or entangle detection with membership.\n`phi_accrual` is the thing that goes alongside them: a pure detector,\nunopinionated about who sends heartbeats, what the topology looks like,\nor what to do when φ gets high.\n\n## Usage — bring your own signal\n\nAnything that arrives from a remote node is evidence of liveness. If\nyour app already has cross-node traffic, call `observe/2` from the\nreceive path — no extra network cost:\n\n```elixir\ndefmodule MyApp.Chatter do\n  use GenServer\n\n  def handle_info({:reply_from, node}, state) do\n    PhiAccrual.observe(node)\n    {:noreply, state}\n  end\nend\n```\n\nThen pattern-match on `phi/1` to handle every result state:\n\n```elixir\ncase PhiAccrual.phi(:\"node_a@host\") do\n  {:ok, phi, :steady}        -\u003e # warm estimator, normal\n  {:ok, phi, :recovering}    -\u003e # warm estimator, absorbing a recent gap\n  {:insufficient_data, n}    -\u003e # still in bootstrap, `n` samples remaining\n  {:stale, elapsed_ms}       -\u003e # no arrival for \u003e stale_after_ms\n  {:error, :not_tracked}     -\u003e # never observed\nend\n```\n\nCall `PhiAccrual.track(node, opts)` **before** your first `observe` if\nyou need custom per-node estimator options; otherwise the first\n`observe` auto-tracks with defaults.\n\n## Usage — reference source\n\nIf you have no existing cross-node chatter, enable the bundled\n`DistributionPing` source in config:\n\n```elixir\n# config/runtime.exs\nconfig :phi_accrual,\n  distribution_ping: [interval_ms: 1_000, auto_track: true]\n```\n\nEach node then pings every peer every `interval_ms` over BEAM\ndistribution. Cheap per-ping, but cluster cost is O(N²) —\nat 50 nodes and 1 s interval that's 2 500 pings/second of distribution\ntraffic.\n\n**This source inherits HoL blocking** — see\n[limitations](#limitations). The v2 `UdpSource` will escape it.\n\n## What happens when a node fails\n\nSuppose `:node_a@host` has been heartbeating every ~1 s for a few\nminutes. Its estimator has mean ≈ 1 000 ms, σ ≈ 50 ms, and φ hovers\naround 0.3 (the median for an on-schedule arrival).\n\nThen the node goes dark. Here is the timeline, using the default\noptions and a threshold instance configured at `suspect_at: 4.0`,\n`recover_at: 3.0`:\n\n```\nt=0s    last heartbeat arrives. φ ≈ 0.3.\n        → [:phi_accrual, :sample, :observed]  (interval_ms: ~1000)\n\nt=1s    no new heartbeat. φ ≈ 0.3 (still on-schedule).\n        → [:phi_accrual, :phi, :computed]  (periodic gauge tick)\n\nt=2s    φ ≈ 3.5. starting to get suspicious.\n        → [:phi_accrual, :phi, :computed]\n\nt=3s    φ crosses 4.0.\n        → [:phi_accrual, :phi, :computed]\n        → [:phi_accrual, :threshold, :suspected]\n\nt=10s   φ very high. state still :steady (stale_after_ms default 60 s).\n        → [:phi_accrual, :phi, :computed]\n\nt=60s   elapsed \u003e stale_after_ms.\n        → [:phi_accrual, :phi, :computed]  (state: :stale)\n```\n\nIf `:node_a@host` comes back at t=15s and resumes heartbeating, the\nfirst-arrival interval of 15 000 ms exceeds `recovering_threshold_ms`\n(default 10 000). The state transitions to `:recovering` for the next\n3 samples while the EWMA absorbs the outlier. Once φ drops below 3.0:\n\n```\nt=15s   first heartbeat after outage. interval = 15 000 ms.\n        → [:phi_accrual, :sample, :observed]\n        state becomes :recovering.\n\nt=16s   next heartbeat. φ has fallen sharply (elapsed is small).\n        → [:phi_accrual, :phi, :computed]  (state: :recovering)\n        → [:phi_accrual, :threshold, :recovered]    (φ crossed 3.0 downward)\n\nt=19s   three samples since the outlier.\n        → state returns to :steady.\n```\n\n**Nowhere in this flow does the library decide the node is \"down.\"**\nIt just publishes φ and state labels; the `Threshold` module (or your\nown consumer) decides what to do. That separation is why the detector\ncan be wired to a dashboard, an alert, and an automated-routing policy\nsimultaneously with different thresholds.\n\n## Telemetry schema (v1.x stable)\n\nEvent names, measurement keys, and metadata keys are a contract.\n**Breaking changes only in v2.**\n\n```\n[:phi_accrual, :sample, :observed]\n  measurements: %{interval_ms}\n  metadata:     %{node, local_pause?}\n\n[:phi_accrual, :phi, :computed]                  # periodic gauge stream\n  measurements: %{phi, elapsed_ms}\n  metadata:     %{node, state, local_pause?, confidence}\n    # state ∈ [:steady, :recovering, :insufficient_data, :stale]\n\n[:phi_accrual, :local_pause, :start]             # rising edge\n  metadata:     %{kind}                          # :long_gc | :long_schedule | :busy_dist_port\n[:phi_accrual, :local_pause, :stop]              # falling edge\n\n[:phi_accrual, :overload, :shed]\n  measurements: %{mailbox_len}\n  metadata:     %{node}\n\n[:phi_accrual, :source, :started]\n  metadata:     %{source, interval_ms}\n\n[:phi_accrual, :threshold, :suspected]           # emitted by Threshold module\n[:phi_accrual, :threshold, :recovered]\n  measurements: %{phi}\n  metadata:     %{node, instance, threshold, confidence, detector_state}\n```\n\nPipe these to Prometheus via `telemetry_metrics_prometheus`, to logs,\nor to your own alerting (see next section).\n\n## Wiring telemetry to Prometheus\n\nPull in [`telemetry_metrics_prometheus`](https://hex.pm/packages/telemetry_metrics_prometheus)\n(or your preferred `telemetry_metrics` reporter) and declare the\nmetrics you care about:\n\n```elixir\n# mix.exs — add dependency\n{:telemetry_metrics_prometheus, \"~\u003e 1.1\"}\n\n# In your supervision tree\nchildren = [\n  {TelemetryMetricsPrometheus,\n   metrics: [\n     # φ as a gauge — one series per (node, state) pair.\n     Telemetry.Metrics.last_value(\n       \"phi_accrual.phi.computed.phi\",\n       event_name: [:phi_accrual, :phi, :computed],\n       measurement: :phi,\n       tags: [:node, :state, :confidence]\n     ),\n\n     # Counter of every heartbeat observed.\n     Telemetry.Metrics.counter(\n       \"phi_accrual.sample.observed.count\",\n       event_name: [:phi_accrual, :sample, :observed],\n       tags: [:node]\n     ),\n\n     # Local-pause events — correlate noise in φ with GC / HoL.\n     Telemetry.Metrics.counter(\n       \"phi_accrual.local_pause.start.count\",\n       event_name: [:phi_accrual, :local_pause, :start],\n       tags: [:kind]\n     ),\n\n     # Overload shedding — if this is ever non-zero in steady state,\n     # tune α instead of raising :shed_threshold.\n     Telemetry.Metrics.counter(\n       \"phi_accrual.overload.shed.count\",\n       event_name: [:phi_accrual, :overload, :shed],\n       tags: [:node]\n     ),\n\n     # Discrete alert events from the Threshold module.\n     Telemetry.Metrics.counter(\n       \"phi_accrual.threshold.suspected.count\",\n       event_name: [:phi_accrual, :threshold, :suspected],\n       tags: [:node, :instance]\n     ),\n     Telemetry.Metrics.counter(\n       \"phi_accrual.threshold.recovered.count\",\n       event_name: [:phi_accrual, :threshold, :recovered],\n       tags: [:node, :instance]\n     )\n   ]}\n]\n```\n\nFor ad-hoc logging, attach a handler directly:\n\n```elixir\n:telemetry.attach_many(\n  \"phi-accrual-logger\",\n  [\n    [:phi_accrual, :threshold, :suspected],\n    [:phi_accrual, :threshold, :recovered]\n  ],\n  \u0026MyApp.PhiLogger.handle/4,\n  nil\n)\n\ndefmodule MyApp.PhiLogger do\n  require Logger\n\n  def handle([:phi_accrual, :threshold, kind], %{phi: phi}, %{node: node}, _) do\n    Logger.warning(\"node=#{node} #{kind} phi=#{Float.round(phi, 2)}\")\n  end\nend\n```\n\n## Thresholding (optional)\n\n`PhiAccrual.Threshold` converts the φ gauge stream into discrete\n`:suspected` / `:recovered` events with hysteresis:\n\n```elixir\n# In your supervision tree\nchildren = [\n  {PhiAccrual.Threshold, name: :dash, suspect_at: 4.0, recover_at: 3.0},\n  {PhiAccrual.Threshold, name: :route, suspect_at: 8.0, recover_at: 7.0}\n]\n```\n\nMultiple instances coexist — one for dashboards at φ=4, another for\nautomated routing at φ=8. Skip the module entirely if you want to roll\nyour own.\n\n## Configuration\n\n```elixir\n# config/runtime.exs\nconfig :phi_accrual,\n  # enable the node-global :erlang.system_monitor hook (default: true).\n  # Disable if another library already subscribes.\n  pause_monitor: true,\n\n  # back-pressure threshold — observe/2 sheds samples when mailbox\n  # exceeds this count and emits [:overload, :shed] telemetry.\n  shed_threshold: 10_000,\n\n  # bundled reference source — off by default, opt in:\n  distribution_ping: [interval_ms: 1_000, auto_track: true]\n```\n\nPer-node estimator options (passed to `PhiAccrual.track/2`):\n\n| Option                       | Default  | Notes                                         |\n| ---------------------------- | -------- | --------------------------------------------- |\n| `:alpha_mean`                | `0.125`  | EWMA smoothing for mean                       |\n| `:alpha_var`                 | `0.125`  | EWMA smoothing for variance (tune lower)      |\n| `:min_std_dev_ms`            | `50.0`   | Floor on σ — prevents singular distribution   |\n| `:min_samples`               | `8`      | Bootstrap gate before φ is reported           |\n| `:stale_after_ms`            | `60_000` | Elapsed past which state becomes `:stale`     |\n| `:recovering_threshold_ms`   | `10_000` | Large-gap detection for `:recovering` tag     |\n| `:recovering_grace_samples`  | `3`      | Samples the `:recovering` tag persists for    |\n| `:initial_interval_ms`       | `1_000`  | Prior mean before any observation             |\n| `:initial_std_dev_ms`        | `500`    | Prior σ (variance = σ²)                       |\n\n## Limitations\n\nRead these before wiring φ to anything that takes irreversible action.\n\n**Head-of-line blocking (primary v1 caveat).** `DistributionPing` and\nany source that travels over BEAM distribution shares a TCP socket\nwith user traffic. A large GenServer reply or `:pg` broadcast can\ndelay heartbeats for arbitrary periods. `PauseMonitor` subscribes to\n`:busy_dist_port` so you can *observe* this (pause telemetry +\n`confidence: false` on φ events), but the underlying problem cannot be\nfixed by this library while the source is distribution-based. The v2\n`UdpSource` solves it by using a dedicated socket.\n\n**Local-pause suppression is best-effort.** `:erlang.system_monitor`\nfires on `:long_gc`, `:long_schedule`, and `:busy_dist_port`. The\nmonitor marks φ output with `local_pause?: true` and\n`confidence: false` for a short lockout window after any event. It\ndoes **not** freeze φ or widen the variance — we decided the silent-\ndetector failure mode is worse than noisy φ. Consumers are expected to\nfilter on the confidence flag (the `Threshold` module passes it\nthrough in metadata).\n\n**Gaussian assumption misbehaves under bimodal distributions.** BEAM\nGC produces intermittent large pauses that, combined with normal\nintervals, yield a bimodal inter-arrival distribution. A Gaussian EWMA\nis a poor fit and will over-alert. Correlate φ with\n`:erlang.statistics(:garbage_collection)` before acting on high φ. A\nnon-parametric estimator (Satzger or a two-component mixture) is a v2\nconsideration once we have real traces from deployments.\n\n**One `:erlang.system_monitor` per node.** Only one subscription can\nexist. If another library installs its own, enabling both will cause\none to silently win. Disable `pause_monitor` in config and feed pause\nstate to `PhiAccrual.PauseMonitor.put_state/1` yourself if you need\ncoexistence.\n\n## Testing strategy\n\nFailure detectors are hard to test against wall-clock. This project:\n\n* Uses [`StreamData`](https://hex.pm/packages/stream_data) for\n  property-based tests of estimator math (`test/phi_accrual/core_test.exs`).\n* Injects clocks into `PhiAccrual.Estimator` via the `:clock_fn`\n  option — no `Process.sleep` in unit tests.\n* Integration tests against live distribution (`:peer`-based,\n  multi-node) are planned for v2 alongside the `UdpSource` work.\n\n## Versioning\n\nv1.x is **telemetry-schema-stable**: event names, measurement keys,\nand metadata keys will not change until v2. Per-node option defaults\nmay be tuned within v1.x.\n\n## Roadmap\n\n### v1 (shipped)\n\n- Dual-α EWMA estimator with bootstrap / stale / recovering states\n- `PauseMonitor` with `:busy_dist_port` tracking\n- Per-node estimator GenServer + `DynamicSupervisor` + `Registry`\n- Overload shedding with telemetry\n- Bring-your-own-signal API + `DistributionPing` reference source\n- Optional `Threshold` module with hysteresis\n- Committed telemetry event schema\n\n### v2 (planned)\n\n- `UdpSource` — dedicated UDP socket for heartbeats, escapes HoL,\n  makes the detector decision-grade\n- Evidence-based evaluation of non-parametric / mixture estimators\n- `:peer`-based multi-node integration tests\n- Optional `phi_accrual_libcluster` companion package\n\n### Related ideas\n\nThis library is the first of three composable primitives:\nφ-accrual → HLC + causal broadcast → SWIM-Lifeguard standalone.\n\n## License\n\nApache-2.0. See LICENSE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthatsme%2Fphi_accrual","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthatsme%2Fphi_accrual","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthatsme%2Fphi_accrual/lists"}