{"id":15026600,"url":"https://github.com/bitwalker/swarm","last_synced_at":"2025-05-14T14:07:27.020Z","repository":{"id":44415127,"uuid":"64172108","full_name":"bitwalker/swarm","owner":"bitwalker","description":"Easy clustering, registration, and distribution of worker processes for Erlang/Elixir","archived":false,"fork":false,"pushed_at":"2022-06-22T11:00:39.000Z","size":522,"stargazers_count":1225,"open_issues_count":44,"forks_count":104,"subscribers_count":29,"default_branch":"master","last_synced_at":"2025-05-10T13:58:12.709Z","etag":null,"topics":["clustering","elixir","erlang-distribution","process-registry"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitwalker.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-25T22:29:49.000Z","updated_at":"2025-05-01T18:10:55.000Z","dependencies_parsed_at":"2022-08-12T11:10:39.628Z","dependency_job_id":null,"html_url":"https://github.com/bitwalker/swarm","commit_stats":null,"previous_names":[],"tags_count":34,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitwalker%2Fswarm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitwalker%2Fswarm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitwalker%2Fswarm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitwalker%2Fswarm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitwalker","download_url":"https://codeload.github.com/bitwalker/swarm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254159194,"owners_count":22024558,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","elixir","erlang-distribution","process-registry"],"created_at":"2024-09-24T20:04:45.290Z","updated_at":"2025-05-14T14:07:26.986Z","avatar_url":"https://github.com/bitwalker.png","language":"Elixir","funding_links":[],"categories":["\u003ca name=\"Elixir\"\u003e\u003c/a\u003eElixir"],"sub_categories":[],"readme":"# Swarm\n\n[![Hex.pm Version](http://img.shields.io/hexpm/v/swarm.svg?style=flat)](https://hex.pm/packages/swarm) [![Build Status](https://travis-ci.com/bitwalker/swarm.svg?branch=master)](https://travis-ci.com/bitwalker/swarm)\n\n**NOTE**: If you are upgrading from 1.0, be aware that the autoclustering functionality has been extracted\nto its own package, which you will need to depend on if you use that feature.\nThe package is [libcluster](http://github.com/bitwalker/libcluster) and is available on\n[Hex](https://hex.pm/packages/libcluster). Please be sure to read over the README to make sure your\nconfig is properly updated.\n\nSwarm is a global distributed registry, offering a feature set similar to that of `gproc`,\nbut architected to handle dynamic node membership and large volumes of process registrations\nbeing created/removed in short time windows.\n\nTo be more clear, Swarm was born out of the need for a global process registry which could\nhandle large numbers of persistent processes representing devices/device connections, which\nneeded to be distributed around a cluster of Erlang nodes, and easily found. Messages need\nto be routed to those processes from anywhere in the cluster, both individually, and as groups.\nAdditionally, those processes need to be shifted around the cluster based on cluster topology\nchanges, or restarted if their owning node goes down.\n\nBefore writing Swarm, I tried both `global` and `gproc`, but the former is not very flexible, and\nboth of them require leader election, which, in the face of dynamic node membership and the sheer\nvolume of registrations, ended up causing deadlocks/timeouts during leadership contention.\n\nI also attempted to use `syn`, but because it uses `mnesia` at the time, dynamic node membership as a requirement\nmeant it was dead on arrival for my use case.\n\nIn short, are you running a cluster of Erlang nodes under something like Kubernetes? If so, Swarm is\nfor you!\n\nView the docs [here](https://hexdocs.pm/swarm).\n\n**PLEASE READ**: If you are giving Swarm a spin, it is important to understand that you can concoct scenarios whereby\nthe registry appears to be out of sync temporarily, this is a side effect of an eventually consistent model and does not mean that\nSwarm is not working correctly, rather you need to ensure that applications you build on top of Swarm are written to embrace eventual\nconsistency, such that periods of inconsistency are tolerated. For the most part though, the registry replicates extremely\nquickly, so noticeable inconsistency is more of an exception than a rule, but a proper distributed system should always be designed to\ntolerate the exceptions, as they become more and more common as you scale up. If however you notice extreme inconsistency or delayed\nreplication, then it is possible it may be a bug, or performance issue, so feel free to open an issue if you are unsure, and we will gladly look into it.\n\n## Installation\n\n```elixir\ndefp deps do\n  [{:swarm, \"~\u003e 3.0\"}]\nend\n```\n\n## Features\n\n- automatic distribution of registered processes across\n  the cluster based on a consistent hashing algorithm,\n  where names are partitioned across nodes based on their hash.\n- easy [handoff of processes](#process-handoff) between one node and another, including\n  handoff of current process state.\n- can do simple registration with `{:via, :swarm, name}`\n- both an Erlang and Elixir API\n\n## Restrictions\n\n- auto-balancing of processes in the cluster requires registrations to be done via\n  `register_name/5`, which takes module/function/args params, and handles starting\n  the process for you. The MFA must return `{:ok, pid}`.\n  This is how Swarm handles process handoff between nodes, and automatic restarts when nodedown\n  events occur and the cluster topology changes.\n\n### Process handoff\n\nProcesses may be redistributed between nodes when a node joins, or leaves, a cluster. You can indicate whether the handoff should simply restart the process on the new node, start the process and then send it the handoff message containing state, or ignore the handoff and remain on its current node.\n\nProcess state can be transferred between running nodes during process redistribution by using the `{:swarm, :begin_handoff}` and `{:swarm, :end_handoff, state}` callbacks. However process state will be lost when a node hosting a distributed process terminates. In this scenario you must restore the state yourself.\n\n## Consistency Guarantees\n\nLike any distributed system, a choice must be made in terms of guarantees provided. You can choose between\navailability or consistency during a network partition by selecting the appropriate process distribution strategy.\n\nSwarm provides two strategies for you to use:\n\n- #### `Swarm.Distribution.Ring`\n\n  This strategy favors availability over consistency, even though it is eventually consistent, as\n  network partitions, when healed, will be resolved by asking any copies of a given name that live on\n  nodes where they don't belong to shutdown.\n\n  Network partitions result in all partitions running an instance of processes created with Swarm.\n  Swarm was designed for use in an IoT platform, where process names are generally based on physical\n  device ids, and as such, the consistency issue is less of a problem. If events get routed to two\n  separate partitions, it's generally not an issue if those events are for the same device. However\n  this is clearly not ideal in all situations. Swarm also aims to be fast, so registrations and\n  lookups must be as low latency as possible, even when the number of processes in the registry grows\n  very large. This is achieved without consensus by using a consistent hash of the name which\n  deterministically defines which node a process belongs on, and all requests to start a process on\n  that node will be serialized through that node to prevent conflicts.\n\n  This is the default strategy and requires no configuration.\n\n- #### `Swarm.Distribution.StaticQuorumRing`\n\n  A quorum is the minimum number of nodes that a distributed cluster has to obtain in order to be\n  allowed to perform an operation. This can be used to enforce consistent operation in a distributed\n  system.\n\n  You configure the quorum size by defining the minimum number of nodes that must be connected in the\n  cluster to allow process registration and distribution. Calls to `Swarm.register_name/5` will return `{:error, :no_node_available}` if there are fewer nodes available than the configured minimum quorum size.\n\n  In a network partition, the partition containing at least the quorum size number of clusters will\n  continue operation. Processes running on the other side of the split will be stopped and restarted\n  on the active side. This ensures that only one instance of a registered process will be running in\n  the cluster.\n\n  You must configure this strategy and its minimum quorum size using the `:static_quorum_size` setting:\n\n  ```elixir\n  config :swarm,\n    distribution_strategy: Swarm.Distribution.StaticQuorumRing,\n    static_quorum_size: 5\n  ```\n\n  The quorum size should be set to half the cluster size, plus one node. So a three node cluster\n  would be two, a five node cluster is three, and a nine node cluster is five. You *must* not add more\n  than 2 x quorum size - 1 nodes to the cluster as this would cause a network split to result in\n  both partitions continuing operation.\n\n  Processes are distributed amongst the cluster using the same consistent hash of their name as in\n  the ring strategy above.\n\n  This strategy is a good choice when you have a fixed number of nodes in the cluster.\n\n## Clustering\n\nSwarm pre-2.0 included auto-clustering functionality, but that has been split out into its own package,\n[libcluster](https://github.com/bitwalker/libcluster). Swarm works out of the box with Erlang's distribution\ntools (i.e. `Node.connect/1`, `:net_kernel.connect_node/1`, etc.), but if you need the auto-clustering that Swarm\npreviously provided, you will need to add `:libcluster` to your deps, and make sure it's in your applications\nlist *before* `:swarm`. Some of the configuration has changed slightly in `:libcluster`, so be sure to review\nthe docs.\n\n### Node Blacklisting/Whitelisting\n\nYou can explicitly whitelist or blacklist nodes to prevent certain nodes from being included in Swarm's consistent\nhash ring. This is done with either the `node_whitelist` and `node_blacklist` settings respectively. These settings\nmust be lists containing either literal strings or valid Elixir regex patterns as either string or regex literals.\nIf no whitelist is set, then the blacklist is used, and if no blacklist is provided, the default blacklist includes\ntwo patterns, in both cases to ignore nodes which are created by Relx/ExRM/Distillery when using releases, in order\nto setup remote shells (the first) and hot upgrade scripting (the second), the patterns can be found in this repo's\n`config/config.exs` file, and you can find a quick example below:\n\n```elixir\nconfig :swarm,\n  node_whitelist: [~r/^myapp-[\\d]@.*$/]\n```\n\nThe above will only allow nodes named something like `myapp-1@somehost` to be included in Swarm's clustering. **NOTE**:\nIt is important to understand that this does not prevent those nodes from connecting to the cluster, only that Swarm will\nnot include those nodes in its distribution algorithm, or communicate with those nodes.\n\n## Registration/Process Grouping\n\nSwarm is intended to be used by registering processes *before* they are created, and letting Swarm start\nthem for you on the proper node in the cluster. This is done via `Swarm.register_name/5`. You may also register\nprocesses the normal way, i.e. `GenServer.start_link({:via, :swarm, name}, ...)`. Swarm will manage these\nregistrations, and replicate them across the cluster, however these processes will not be moved in response\nto cluster topology changes.\n\nSwarm also offers process grouping, similar to the way `gproc` does properties. You \"join\" a process to a group\nafter it is started, (beware of doing so in `init/1` outside of a Task, or it will deadlock), with `Swarm.join/2`.\nYou can then publish messages (i.e. `cast`) with\n`Swarm.publish/2`, and/or call all processes in a group and collect results (i.e. `call`) with `Swarm.multi_call/2` or\n`Swarm.multi_call/3`. Leaving a group can be done with `Swarm.leave/2`, but will automatically be done when a process\ndies. Join/leave can be used to do pubsub like things, or perform operations over a group of related processes.\n\n## Debugging/Troubleshooting\n\nBy configuring Swarm with `debug: true` and setting Logger's log level to `:debug`, you can get much more\ninformation about what it is doing during operation to troubleshoot issues.\n\nTo dump the tracker's state, you can use `:sys.get_state(Swarm.Tracker)` or `:sys.get_status(Swarm.Tracker)`.\nThe former will dump the tracker state including what nodes it is tracking, what nodes are in the hash ring,\nand the state of the interval tree clock. The latter will dump more detailed process info, including the current\nfunction and its arguments. This is particularly useful if it appears that the tracker is stuck and not doing\nanything. If you do find such things, please gist all of these results and open an issue so that I can fix these\nissues if they arise.\n\n## Example\n\nThe following example shows a simple case where workers are dynamically created in response\nto some events under a supervisor, and we want them to be distributed across the cluster and\nbe discoverable by name from anywhere in the cluster. Swarm is a perfect fit for this\nsituation.\n\n```elixir\ndefmodule MyApp.Supervisor do\n  @moduledoc \"\"\"\n  This is the supervisor for the worker processes you wish to distribute\n  across the cluster, Swarm is primarily designed around the use case\n  where you are dynamically creating many workers in response to events. It\n  works with other use cases as well, but that's the ideal use case.\n  \"\"\"\n  use Supervisor\n\n  def start_link() do\n    Supervisor.start_link(__MODULE__, [], name: __MODULE__)\n  end\n\n  def init(_) do\n    children = [\n      worker(MyApp.Worker, [], restart: :temporary)\n    ]\n    supervise(children, strategy: :simple_one_for_one)\n  end\n\n  @doc \"\"\"\n  Registers a new worker, and creates the worker process\n  \"\"\"\n  def register(worker_name) do\n    {:ok, _pid} = Supervisor.start_child(__MODULE__, [worker_name])\n  end\nend\n\ndefmodule MyApp.Worker do\n  @moduledoc \"\"\"\n  This is the worker process, in this case, it simply posts on a\n  random recurring interval to stdout.\n  \"\"\"\n  def start_link(name) do\n    GenServer.start_link(__MODULE__, [name])\n  end\n\n  def init([name]) do\n    {:ok, {name, :rand.uniform(5_000)}, 0}\n  end\n\n  # called when a handoff has been initiated due to changes\n  # in cluster topology, valid response values are:\n  #\n  #   - `:restart`, to simply restart the process on the new node\n  #   - `{:resume, state}`, to hand off some state to the new process\n  #   - `:ignore`, to leave the process running on its current node\n  #\n  def handle_call({:swarm, :begin_handoff}, _from, {name, delay}) do\n    {:reply, {:resume, {name, delay}}, {name, delay}}\n  end\n  # called after the process has been restarted on its new node,\n  # and the old process' state is being handed off. This is only\n  # sent if the return to `begin_handoff` was `{:resume, state}`.\n  # **NOTE**: This is called *after* the process is successfully started,\n  # so make sure to design your processes around this caveat if you\n  # wish to hand off state like this.\n  def handle_cast({:swarm, :end_handoff, delay}, {name, _}) do\n    {:noreply, {name, delay}}\n  end\n  # called when a network split is healed and the local process\n  # should continue running, but a duplicate process on the other\n  # side of the split is handing off its state to us. You can choose\n  # to ignore the handoff state, or apply your own conflict resolution\n  # strategy\n  def handle_cast({:swarm, :resolve_conflict, _delay}, state) do\n    {:noreply, state}\n  end\n\n  def handle_info(:timeout, {name, delay}) do\n    IO.puts \"#{inspect name} says hi!\"\n    Process.send_after(self(), :timeout, delay)\n    {:noreply, {name, delay}}\n  end\n  # this message is sent when this process should die\n  # because it is being moved, use this as an opportunity\n  # to clean up\n  def handle_info({:swarm, :die}, state) do\n    {:stop, :shutdown, state}\n  end\nend\n\ndefmodule MyApp.ExampleUsage do\n  ...snip...\n\n  @doc \"\"\"\n  Starts worker and registers name in the cluster, then joins the process\n  to the `:foo` group\n  \"\"\"\n  def start_worker(name) do\n    {:ok, pid} = Swarm.register_name(name, MyApp.Supervisor, :register, [name])\n    Swarm.join(:foo, pid)\n  end\n\n  @doc \"\"\"\n  Gets the pid of the worker with the given name\n  \"\"\"\n  def get_worker(name), do: Swarm.whereis_name(name)\n\n  @doc \"\"\"\n  Gets all of the pids that are members of the `:foo` group\n  \"\"\"\n  def get_foos(), do: Swarm.members(:foo)\n\n  @doc \"\"\"\n  Call some worker by name\n  \"\"\"\n  def call_worker(name, msg), do: GenServer.call({:via, :swarm, name}, msg)\n\n  @doc \"\"\"\n  Cast to some worker by name\n  \"\"\"\n  def cast_worker(name, msg), do: GenServer.cast({:via, :swarm, name}, msg)\n\n  @doc \"\"\"\n  Publish a message to all members of group `:foo`\n  \"\"\"\n  def publish_foos(msg), do: Swarm.publish(:foo, msg)\n\n  @doc \"\"\"\n  Call all members of group `:foo` and collect the results,\n  any failures or nil values are filtered out of the result list\n  \"\"\"\n  def call_foos(msg), do: Swarm.multi_call(:foo, msg)\n\n  ...snip...\nend\n```\n\n## License\n\nMIT\n\n## Testing\n\n`mix test` runs a variety of tests, most of them use a cluster of\nElixir nodes to test the tracker and the registry. If you want more\nverbose output during the tests, run them like this:\n\n    # SWARM_DEBUG=true mix test\n\nThis sets the log level to `:debug`, runs ExUnit with `--trace`, and\nenables GenServer tracing on the Tracker processes.\n\n### Executing the tests locally\nIn order to execute the tests locally you'll need to have\n[Erlang Port Mapper Daemon](http://erlang.org/doc/man/epmd.html) running.\n\nIf you don't have `epmd` running you can start it using the following command:\n\n    epmd -daemon\n\n\n## TODO\n\n- automated testing (some are present)\n- QuickCheck model\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitwalker%2Fswarm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitwalker%2Fswarm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitwalker%2Fswarm/lists"}