{"id":13792492,"url":"https://github.com/tarantool/vshard","last_synced_at":"2025-04-05T03:04:21.554Z","repository":{"id":37731753,"uuid":"113455762","full_name":"tarantool/vshard","owner":"tarantool","description":"The new generation of sharding based on virtual buckets","archived":false,"fork":false,"pushed_at":"2025-03-12T09:31:34.000Z","size":2085,"stargazers_count":101,"open_issues_count":143,"forks_count":31,"subscribers_count":36,"default_branch":"master","last_synced_at":"2025-03-29T02:03:29.592Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tarantool.png","metadata":{"files":{"readme":"README.md","changelog":"changelogs/0.1.20.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-07T13:44:20.000Z","updated_at":"2025-03-27T13:21:24.000Z","dependencies_parsed_at":"2023-10-30T23:27:41.903Z","dependency_job_id":"21fd7db4-7722-4890-85be-ed047b011cf6","html_url":"https://github.com/tarantool/vshard","commit_stats":null,"previous_names":[],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fvshard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fvshard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fvshard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fvshard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tarantool","download_url":"https://codeload.github.com/tarantool/vshard/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247280262,"owners_count":20912967,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T22:01:12.828Z","updated_at":"2025-04-05T03:04:21.534Z","avatar_url":"https://github.com/tarantool.png","language":"Lua","funding_links":[],"categories":["Packages"],"sub_categories":["Database"],"readme":"# Sharding for Tarantool\n\n[![Tarantool][tarantool-badge]][Tarantool]\n\nSharding module for **Tarantool** based on Virtual Buckets concept.\n\n![alt text](https://github.com/tarantool/vshard/blob/master/sharding_arch.png)\n\n## Prerequisites\n\n- Tarantool version 1.9+.\n\n## Install\n\nInstall **vshard** as module `tarantoolctl rocks install https://raw.githubusercontent.com/tarantool/vshard/master/vshard-scm-1.rockspec`\n\n## Contribution\n\nIn order to contribute you might want to avoid installation into regular paths.\nYou need to fetch the source code to patch it in a local folder.\n\n* `git clone \u003cthis repo or your fork\u003e`;\n* `git submodule update --init --recursive`;\n* VShard requires Tarantool being in `PATH`. So either you install one into the\n  system or you fetch Tarantool's main repository source code, build it, and\n  add to `PATH` manually these paths: `\u003cpath to tarantool build\u003e/src` and\n  `\u003cpath to tarantool build\u003e/extra`.\n\nNow vshard should be functional. You can try it in `example` folder, see its\nMakefile.\n\nYour patch should pass all the existing tests (unless it is necessary to change\nthem) and have its own test usually. To run the tests this should work:\n  * `cd test`;\n  * `python test-run.py` or `./test-run.py`;\n\n## Configuration\n\nA Tarantool sharded cluster consists of the following components:\n\n- **Storage** - a storage node which stores a subset of the sharded data.\n   Each shard is deployed as a set of replicated storages (a **replicaset**).\n- **Router** - a query router which provides an interface between\n  sharded cluster and clients.\n\nA minimal viable sharded cluster should consists of:\n\n- One or more replication sets consisted of two or more **Storage** instances;\n- One or more **Router** instances.\n\nThe number of **Storage** instances in a replicaset defines the redundancy\nfactor of the data. Recommended value is 3 or more. The number of routers\nare not limited, because routers are completely stateless. We recommend to\nincrease the number of routers when existing instance become CPU or I/O bound.\n\n**Router** and **Storage** applications perform completely different set of\nfunctions and they should be deployed to different Tarantool instances.\nDespite the fact that it is technically possible to place `router` application\nto every Storage node, this approach is highly discouraged and should be\navoided on the production deployments.\n\nAll **Storage** instances can be deployed with absolutely identical instance\n(configuration) file. A **Storage** application automatically self-identifies\nthe running instance in the configuration and determines a replicaset to\nwhich it belongs to. Due to limitation of Tarantool 1.7.x, self-identification\nis currently performed by the instance name used by tarantoolctl, i.e. a name\nof file used to start Tarantool instance without `.lua` script. Please ensure\nthat all storage nodes have globally unique instance names. It makes sense to\nuse some convention for naming storages in the cluster.\nFor example:\n\n- `storage_1_a` - storage node #1 for replicaset#1\n- `storage_1_b` - storage node #2 for replicaset#1\n- `storage_1_c` - storage node #3 for replicaset#1\n- `storage_2_a` - storage node #1 for replicaset#2\n- ...\n\nAll Router instances also can be deployed with absolutely identical\ninstance (configuration) file. Instance names are not important\nfor routers because routers are stateless and know nothing about each other.\n\nAll cluster nodes must have identical cluster topology for proper operation.\nIt is your obligation to ensure that this configuration is identical.\nWe suggest to use some configuration management tool, like Ansible or Puppet\nto deploy the cluster.\n\nA sample cluster configuration for **Storage** and **Router** can look like:\n\n```Lua\nlocal cfg = {\n    memtx_memory = 100 * 1024 * 1024,\n    bucket_count = 10000,\n    rebalancer_disbalance_threshold = 10,\n    rebalancer_max_receiving = 100,\n    sharding = {\n        ['cbf06940-0790-498b-948d-042b62cf3d29'] = { -- replicaset #1\n            replicas = {\n                ['8a274925-a26d-47fc-9e1b-af88ce939412'] = {\n                    uri = 'storage:storage@127.0.0.1:3301',\n                    name = 'storage_1_a',\n                    master = true\n                },\n                ['3de2e3e1-9ebe-4d0d-abb1-26d301b84633'] = {\n                    uri = 'storage:storage@127.0.0.1:3302',\n                    name = 'storage_1_b'\n                }\n            },\n        }, -- replicaset #1\n        ['ac522f65-aa94-4134-9f64-51ee384f1a54'] = { -- replicaset #2\n            replicas = {\n                ['1e02ae8a-afc0-4e91-ba34-843a356b8ed7'] = {\n                    uri = 'storage:storage@127.0.0.1:3303',\n                    name = 'storage_2_a',\n                    master = true\n                },\n                ['001688c3-66f8-4a31-8e19-036c17d489c2'] = {\n                    uri = 'storage:storage@127.0.0.1:3304',\n                    name = 'storage_2_b'\n                }\n            },\n        }, -- replicaset #2\n    }, -- sharding\n    weights = ... -- See details below.\n}\n```\n\n* `sharding` defines logical topology of sharded Tarantool cluster;\n* `bucket_count` total bucket count in a cluster. **It can not be changed after bootstrap!**;\n* `rebalancer_disbalance_threshold` maximal bucket disbalance percents. Disbalance for each replicaset is calculated by formula: `|etalon_bucket_count - real_bucket_count| / etalon_bucket_count * 100`.\n* `rebalancer_max_receiving` maximal bucket count that can be received in parallel by single replicaset. This count must be limited, because else, when a new replicaset is added to a cluster, the rebalancer would send to it very big amount of buckets from existing replicasets - it produces heavy load on a new replicaset to apply all these buckets.\n\nExample of usage `rebalancer_max_receiving`:\u003cbr\u003e\nSuppose it to be equal to 100, total bucket count is 1000 and there are\n3 replicasets with 333, 333 and 334 buckets. When a new replicaset is\nadded, each replicaset's etalon bucket count becomes 250. And the new\nreplicaset does not receive 250 buckets at once - it receives 100, 100\nand 50 sequentially instead.\n\n### Replicas weight configuration\n\nA router sends all read-write request to a master replica only (with master = true in config). For read-only requests the sharding can use weights, if they are specified. The weights are used for failovering and for sending read-only requests not only to master replica, but to the 'nearest' available replica. Weights are used exactly to define distances between replicas in scope of a replicaset.\n\nYou can use weights, for example, to define physical distance between\nrouter and each replica in each replicaset - in such a case read-only\nrequests are being sent to the literally nearest replica.\u003cbr\u003e\nOr by weights you can define, which replicas are more powerful and can\nprocess more requests per second.\n\nThe idea is to specify for each router and replica their zone, and fill matrix of relative zone weights. It allows to use different weights in different zones for the same zone.\n\nTo define weights you can set `zone` attribute for each replica in the config above. For example:\n```Lua\nlocal cfg = {\n   sharding = {\n      ['...replicaset_uuid...'] = {\n         replicas = {\n            ['...replica_uuid...'] = {\n                 ...,\n                 zone = \u003cnumber or string\u003e\n            }\n         }\n      }\n   }\n}\n```\nAnd in `weights` attribute of `vshard.router.cfg` argument you can specify relative weights for each zone pair. Example:\n```Lua\nweights = {\n    [1] = {\n        [2] = 1, -- Zone 1 routers sees weight of zone 2 as 1.\n        [3] = 2, -- Weight of zone 3 as 2.\n        [4] = 3, -- ...\n    },\n    [2] = {\n        [1] = 10,\n        [2] = 0,\n        [3] = 10,\n        [4] = 20,\n    },\n    [3] = {\n        [1] = 100,\n        [2] = 200, -- Zone 3 routers sees weight of zone 2 as 200. Note\n                   -- that it is not equal to weight of zone 2 visible from\n                   -- zone 1.\n        [4] = 1000,\n    }\n}\n\nlocal cfg = vshard.router.cfg({weights = weights, sharding = ...})\n```\nThe last requirement to allow weighted routing is specification `zone` parameter in `vshard.router.cfg`.\n\n### Rebalancer configuration\n\nThe sharding has builtin rebalancer, which periodically wakes up and moves data from one node to another by buckets. It takes all tuples from all spaces on a node with the same bucket id and moves to a more free node.\n\nTo help rebalancer with its work you can specify replicaset weights. The\nweights are not the same weights as replica ones, defined in the section\nabove. The bigger replicaset weight, the more buckets it can store. You\ncan consider weights as relative data amount on a replicaset. For\nexample, if one replicaset has weight 100 and another has 200, then the\nsecond will store twice more buckets then the first one.\n\nBy default, all weights of all replicasets are equal.\n\nYou can use weights, for example, to store more data on a replicasets with more memory space, or to store more data on hot replicasets. It depends on your application.\n\nAll other fields are passed to box.cfg() as is without any modifications.\n\n**Replicaset Parameters**:\n\n* `[UUID] - string` - replicaset unique identifier, generate random one\n  using `uuidgen(1)`;\n* `replicas - table` - a map of replicas with key = replica UUID and\n  value = instance (see details below);\n* `weight - number` - rebalancing weight - the less it is, the less buckets it stores.\n\n**Instance Parameters**:\n\n- `[UUID] - string` - instance unique identifier, generate random one using `uuidgen(1)`;\n- `uri - string` - Uniform Resource Identifier of remote instance with **required** login and password;\n- `name - string` - identifier of remote instance from filename (can be not unique, but it is recommended to use unique names);\n- `zone - string or number` - replica zone (see weighted routing in the section 'Replicas weight configuration');\n- `master - boolean` - true, if a replica is master in its replicaset. You can define 0 or 1 masters for each replicaset. It accepts all write requests.\n\nOn routers call `vshard.router.cfg(cfg)`:\n\n```Lua\ncfg.listen = 3300\n\n-- Start the database with sharding\nvshard = require('vshard')\nvshard.router.cfg(cfg)\n```\n\nOn storages call `vshard.storage.cfg(cfg, \u003cINSTANCE_UUID\u003e)`:\n\n```Lua\n-- Get instance name\nlocal MY_UUID = \"de0ea826-e71d-4a82-bbf3-b04a6413e417\"\n\n-- Call a configuration provider\nlocal cfg = dofile('localcfg.lua')\n\n-- Start the database with sharding\nvshard = require('vshard')\nvshard.storage.cfg(cfg, MY_UUID)\n```\n\nvshard.storage.cfg() will **automatically** call box.cfg() and configure\nlisten port and replication.\n\nSee `router.lua` and `storage.lua` at the root directory of this project\nfor sample configuration.\n\n## Defining Schema\n\nDatabase Schema is stored on storages and routers know nothing about\nspaces and tuples.\n\nSpaces should be defined in your storage application using `box.once()`:\n\n```Lua\nbox.once(\"testapp:schema:1\", function()\n    local customer = box.schema.space.create('customer')\n    customer:format({\n        {'customer_id', 'unsigned'},\n        {'bucket_id', 'unsigned'},\n        {'name', 'string'},\n    })\n    customer:create_index('customer_id', {parts = {'customer_id'}})\n    customer:create_index('bucket_id', {parts = {'bucket_id'}, unique = false})\n\n    local account = box.schema.space.create('account')\n    account:format({\n        {'account_id', 'unsigned'},\n        {'customer_id', 'unsigned'},\n        {'bucket_id', 'unsigned'},\n        {'balance', 'unsigned'},\n        {'name', 'string'},\n    })\n    account:create_index('account_id', {parts = {'account_id'}})\n    account:create_index('customer_id', {parts = {'customer_id'}, unique = false})\n    account:create_index('bucket_id', {parts = {'bucket_id'}, unique = false})\n    box.snapshot()\nend)\n```\n\nEvery space you plan to shard must have `bucket_id` unsigned field indexed\nby `bucket_id` TREE index. Spaces without `bucket_id` index don't\nparticipate in the sharded Tarantool cluster and can be used as regular\nspaces if needed.\n\n## Adding Data\n\nAll DML operations with data should be performed via `router`. The\nonly operation is supported by `router` is `CALL` via `bucket_id`:\n\n```Lua\nresult = vshard.router.call(bucket_id, mode, func, args)\n```\n\nvshard.router.call() routes result = func(unpack(args)) call to a shard\nwhich serves `bucket_id`.\n\n`bucket_id` is just a regular number in range 1..`bucket_count`, where\n`bucket_count` is configuration parameter. This number can be assigned in\narbitrary way by client application. Sharded Tarantool cluster uses this\nnumber as an opaque unique identifier to distribute data across replicasets. It\nis guaranteed that all records with the same `bucket_id` will be stored on the\nsame replicaset.\n\n## Router public API\n\nAll client's requests should be sent to routers.\n\n#### `vshard.router.bootstrap()`\n\nPerform initial distribution of buckets across replicasets.\n\n#### `result = vshard.router.call(bucket_id, mode, func, args, opts)`\n\nCall function `func` on a shard which serves `bucket_id`,\n\n#### `netbox, err = vshard.router.route(bucket_id)`\n\nReturn replicaset object for specified `bucket_id`.\n\n#### `replicaset, err = vshard.router.routeall()`\n\nReturn all available replicaset objects in the map of type: `{UUID = replicaset}`.\n\n#### `bucket_id = vshard.router.bucket_id(key)`\n\nCalculate `bucket_id` using a simple built-in hash function:\n\n#### `bucket_count = vshard.router.bucket_count()`\n\nReturn the bucket count configured by vshard.router.cfg().\n\n#### `vshard.router.sync(timeout)`\n\nWait until all data are synchronized on replicas.\n\n#### `info = vshard.router.info()`\n\nReturns the current router status.\n\n**Example:**\n\n```\nvshard.router.info()\n---\n- replicasets:\n  - master:\n      state: active\n      uri: storage:storage@127.0.0.1:3301\n      uuid: 2ec29309-17b6-43df-ab07-b528e1243a79\n  - master:\n      state: active\n      uri: storage:storage@127.0.0.1:3303\n      uuid: 810d85ef-4ce4-4066-9896-3c352fec9e64\n...\n```\n\n**Parameters:**\n\n* `bucket_id` - bucket identifier\n* `mode` - `read` or `write`\n* `func` - function name\n* `args` - array of arguments to func\n* `opts` - call options. Can contain only one parameter - `timeout` in seconds.\n\n**Returns:** original return value from `func` or nil and error object.\nError object has type attribute equal to 'ShardingError' or one of error types from tarantool ('ClientError', 'OutOfMemory', 'SocketError' ...).\n* `ShardingError` - returned on errors, specific for sharding:\n  replicaset unavailability, master absence, wrong bucket id etc. It has\n  attribute `code` with one of values from vshard.error.code, optional\n  `message` with human readable error description, and other attributes,\n  specific for concrete error code;\n* Other errors: see tarantool errors.\n\n`route()` and `routeall()` returns replicaset objects. Replicaset has two methods:\n\n#### `replicaset.callro(func, args, opts)`\n\nCall a function `func` on a nearest available replica (distances are\ndefined using `replica.zone` and `cfg.weights` matrix - see sections\nabove) with a specified arguments. It is recommended to call only\nread-only functions using `callro()`, because the function can be\nexecuted not on a master.\n\n#### `replicaset.callrw(func, args, opts)`\n\nSame as `callro()`, but a call guaranteed to be executed on a master.\n\n## Storage public API\n\n#### `vshard.storage.cfg(cfg, name)`\n\nConfigure the database and start sharding for instance `name`.\n\n- `cfg` - configuration table, see examples above.\n- `name` - unique instance name to identify the instance in cfg.sharding\n   table\n\nSee examples above.\n\n#### `vshard.storage.sync(timeout)`\n\nWait until all data are synchronized on replicas.\n\n#### `info = vshard.storage.info()`\n\nReturns the current storage status.\n\n**Example:**\n\n```\nvshard.storage.info()\n---\n- buckets:\n    2995:\n      status: active\n      id: 2995\n    2997:\n      status: active\n      id: 2997\n    2999:\n      status: active\n      id: 2999\n  replicasets:\n    2dd0a343-624e-4d3a-861d-f45efc571cd3:\n      uuid: 2dd0a343-624e-4d3a-861d-f45efc571cd3\n      master:\n        state: active\n        uri: storage:storage@127.0.0.1:3301\n        uuid: 2ec29309-17b6-43df-ab07-b528e1243a79\n    c7ad642f-2cd8-4a8c-bb4e-4999ac70bba1:\n      uuid: c7ad642f-2cd8-4a8c-bb4e-4999ac70bba1\n      master:\n        state: active\n        uri: storage:storage@127.0.0.1:3303\n        uuid: 810d85ef-4ce4-4066-9896-3c352fec9e64\n...\n```\n\n## Storage internal API\n\n#### `status, result = bucket.stat(bucket_id)`\n\nReturns information about `bucket_id`:\n\n```\nunix/:./data/storage_1_a.control\u003e vshard.storage.bucket_stat(1)\n---\n- 0\n- status: active\n  id: 1\n...\n```\n\n#### `vshard.storage.bucket_delete_garbage(bucket_id)`\n\nForce garbage collection for `bucket_id` (in case the bucket was\ntransferred to a different replicaset).\n\n#### `status, result = bucket_collect(bucket_id)`\n\nCollect all data logically stored in `bucket_id`:\n\n```\nvshard.storage.bucket_collect(1)\n---\n- 0\n- - - 514\n    - - [10, 1, 1, 100, 'Account 10']\n      - [11, 1, 1, 100, 'Account 11']\n      - [12, 1, 1, 100, 'Account 12']\n      - [50, 5, 1, 100, 'Account 50']\n      - [51, 5, 1, 100, 'Account 51']\n      - [52, 5, 1, 100, 'Account 52']\n  - - 513\n    - - [1, 1, 'Customer 1']\n      - [5, 1, 'Customer 5']\n```\n\n#### `status = bucket_force_create(bucket_id)`\n\nForce creation of `bucket_id` on this replicaset.\nUse only for manual recovery or initial redistribution.\n\n#### `status = bucket_force_drop(bucket_id)`\n\nForce removal of `bucket_id` from this replicaset.\nUse only for manual recovery or initial redistribution.\n\n#### `status = bucket_send(bucket_id, to)`\n\nTransfer `bucket_id` from the current replicaset to a remote replicaset.\n\n**Parameters:**\n\n- `bucket_id` - bucket identifier\n- `to` - remote replicaset UUID\n\n### `status = bucket_recv(bucket_id, from, data)`\n\nReceive `bucket_id` from a remote replicaset.\n\n**Parameters:**\n\n- `bucket_id` - bucket identifier\n- `from` - UUID of original replicaset\n- `data` - buckets data in the same format as `bucket_collect()` returns\n\n\n\n### Sharding architecture\n#### Overview\n\nConsider a distributed Tarantool cluster that consists of subclusters called shards, each storing some part of data. Each shard, in its turn, constitutes a replicaset consisting of several replicas, one of which serves as a master node that processes all read and write requests.\n\nThe whole dataset is logically partitioned into a predefined number of virtual buckets (vbuckets), each assigned a unique number ranging from 1 to N, where N is the total number of vbuckets. The number of vbuckets is specifically chosen to be several orders of magnitude larger than the potential number of cluster nodes, even given future cluster scaling. For example, with M projected nodes the dataset may be split into 100 * M or even 1,000 * M vbuckets. Care should be taken when picking the number of vbuckets: if too large, it may require extra memory for storing the routing information; if too small, it may decrease the granularity of rebalancing.\n\nEach shard stores a unique subset of vbuckets, which means that a vbucket cannot belong to several shards at once, as illustrated below:\n\n```\nvb1 vb2 vb3 vb4 vb5 vb6 vb7 vb8 vb9 vb10 vb11\n    sh1         sh2        sh3       sh4\n```\n\nThis shard-to-vbucket mapping is stored in a table in one of Tarantool’s system spaces, with each shard holding only a specific part of the mapping that covers those vbuckets that were assigned to this shard.\n\nApart from the mapping table, the bucket id is also stored in a special field of every tuple of every table participating in sharding.\n\nOnce a shard receives any request (except for SELECT) from an\napplication, this shard checks the bucket id specified in the request\nagainst the table of bucket ids that belong to a given node. If the\nspecified bucket id is invalid, the request gets terminated with the\nfollowing error: “wrong bucket”. Otherwise the request is executed, and\nall the data created in the process is assigned the bucket id specified\nin the request. Note that the request should only modify the data that\nhas the same bucket id as the request itself.\n\nStoring bucket ids both in the data itself and the mapping table ensures data consistency regardless of the application logic and makes rebalancing transparent for the application. Storing the mapping table in a system space ensures sharding is performed consistently in case of a failover, as all the replicas in a shard share a common table state.\n\n#### Router\n\nOn their way from the application to the sharded cluster, all the requests pass through a separate program component called a router. Its main function is to hide the cluster topology from the application, namely:\n\n* the number of shards and their placement;\n* the rebalancing process;\n* the occurrence of a failover caused by the shutdown of a replica.\n\nA router can also calculate a bucket id on its own provided that the application clearly defines rules for calculating a bucket id based on the request data. To do it, a router needs to be aware of the data schema.\n\nA router is stateless and doesn’t store the cluster topology. Nor does it rebalance data.\nA router is a separate program component that can be implemented both in the storage and application layers, and its placement is application-driven.\n\nA router maintains a constant pool of connections to all the storages that is created at startup. Creating it this way helps avoid configuration errors. Once a pool is created, a router caches the current state of the \\_vbucket table to speed up the routing. In case a bucket id is moved to another storage as a result of data rebalancing or one of the shards fails over to a replica, a router updates the routing table in a way that's transparent for the application.\n\nSharding is not integrated into any centralized configuration storage system. It is assumed that the application itself handles all the interactions with such systems and passes sharding parameters. That said, the configuration can be changed dynamically - for example, when adding or deleting one or several shards:\n\n1. to add a new shard to the cluster, a system administrator first changes the configuration of all the routers and then the configuration of all the storages;\n2. the new shard becomes available to the storage layer for rebalancing;\n3. as a result of rebalancing, one of the vbuckets is moved to the new shard;\n4. when trying to access the vbucket, a router receives a special error code that specifies the new vbucket location.\n##### CRUD (create, replace, update, delete) operations\nCRUD operations can either be executed in a stored procedure inside a storage or initialized by the application. In any case, the application must include the operation bucket id in a request. When executing an INSERT request, the operation bucket id is stored in a newly created tuple. In other cases, it is checked if the specified operation bucket id matches the bucket id of a tuple being modified.\n##### SELECT requests\nSince a storage is not aware of the mapping between a bucket id and a primary key, all the SELECT requests executed in stored procedures inside a storage are only executed locally. Those SELECT requests that were initialized by the application are forwarded to a router. Then, if the application has passed a bucket id, a router uses it for shard calculation.\n\n##### Calling stored procedures\nThere are several ways of calling stored procedures in cluster replicasets. Stored procedures can be called on a specific vbucket located in a replicaset or without specifying any particular vbucket. In the former case, it is necessary to differentiate between read and write procedures, as write procedures are not applicable to vbuckets that are being migrated. All the routing validity checks performed for sharded DML operations hold true for vbucket-bound stored procedures as well.\n\n#### Replicaset balancing algorithm\n\nThe main objective of balancing is to add and delete replicasets as well as to even out the load based of physical capacities of certain replicasets.\n\nFor balancing to work, each replicaset can be assigned a weight that is proportional to its capacity for data storage. The simplest capacity metric is the percentage of vbuckets stored by a replicaset. Another possible metric is the total size of all sharded spaces in a replicaset.\n\nBalancing is performed as follows: all the replicasets are ordered based on the value of the M/W ratio, where M is a replicaset capacity metric and W is a replicaset weight. If the difference between the smallest and the largest value exceeds a predefined threshold, balancing is launched for the corresponding replicasets. Once done, the process can be repeated for the next pair of replicasets with different policies, depending on how much the calculated metric deviates from the cluster mean (for example, the minimum value is compared with the maximum one, then with the second largest and so on).\n\nWith this approach, assigning a zero weight to a replicaset would allow evenly distributing all of its vbuckets among other cluster nodes, and adding a new replicaset with a zero load would result in vbuckets being moved to it from other replicasets.\n\nIn the migration process, a vbucket goes through two stages at the source and the receiver. At the source, the vbucket is put into the sending state, in which it accepts all the read requests, but declines any write requests. Once the vbucket is activated at the receiver, it is marked as moved at the source and declines all requests from here on.\n\nAt the receiver, the vbucket is created in the receiving state, and then data copying starts, with all the requests getting declined. Once all the data is copied over, the vbucket is activated and starts accepting all requests.\n\nIf a node assumes the role of a master, all the vbuckets in the sending state are checked first. For such vbuckets, a request is sent to the destination replicaset, and if a vbucket is active there, it is deleted locally. All the vbuckets in the receiving state are simply deleted.\n\n#### Bootstrapping and restarting a storage cluster\n\nThe main problem when setting up a cluster is that the identifiers of its components (replicasets and Tarantool instances) are unknown. Adding or removing a replicaset one instance at a time creates a risk of data loss. The optimal way of setting up a cluster and adding new nodes to it is to independently create a fully functional replicaset and then add it to the cluster configuration. In this case, all the parameters necessary for updating the configuration are known prior to adding any new members to the cluster.\n\nOnce a new replicaset is fully set up in the cluster and all of its nodes are notified of the configuration changes, vbuckets are migrated to the new node.\n\nIf a replicaset master fails, it is recommended to:\n\n1. Switch one of the replicas into master mode for all the replicaset instances, which would allow the new master to process all the incoming requests.\n2. Update the configuration of all the cluster members, which would result in all connection requests being forwarded to the new master.\n\nMonitoring the master and switching instance modes can be handled by any external utility.\n\nTo perform a planned outage of a replicaset master, it is recommended to:\n\n1. Update the configuration of the master and wait for its replicas to get into sync. The master will be forwarding all the requests to a new master.\n2. Switch a new instance into master mode.\n3. Update the configuration of all the nodes.\n4. Shut down the old master.\n\nTo perform a planned outage of a cluster replicaset, it is recommended to:\n\n1. Migrate all the vbuckets to other cluster storages.\n2. Update the configuration of all the nodes.\n3. Shut down the replicaset.\n\nIn case a whole replicaset fails, some part of the dataset becomes inaccessible. Meanwhile, a router tries to reconnect to the master of the failed replicaset on a regular basis. This way, once the replicaset is up and running again, the cluster is automatically restored to its full working condition.\n\n## Local Development\n\n### Quick Start\n\nChange directory to `example/` and type `make` to run the development cluster:\n\n```bash\n# cd example/\n# make\ntarantoolctl start storage_1_a\nStarting instance storage_1_a...\nStarting cluster for replica 'storage_1_a'\nSuccessfully found myself in the configuration\nCalling box.cfg()...\n[CUT]\ntarantoolctl enter router_1\nconnected to unix/:./data/router_1.control\nunix/:./data/router_1.control\u003e\n\nunix/:./data/router_1.control\u003e vshard.router.info()\n---\n- replicasets:\n  - master:\n      state: active\n      uri: storage:storage@127.0.0.1:3301\n      uuid: 2ec29309-17b6-43df-ab07-b528e1243a79\n  - master:\n      state: active\n      uri: storage:storage@127.0.0.1:3303\n      uuid: 810d85ef-4ce4-4066-9896-3c352fec9e64\n...\n\nunix/:./data/router_1.control\u003e vshard.router.call(1, 'read', 'no_such_func')\n---\n- error: Procedure 'no_such_func' is not defined\n...\n\n```\n\n### Details\n\nRepository includes a pre-configured development cluster of 1 router\nand 2 replicasets of 2 nodes each (=5 Tarantool instances totally):\n\n- `router_1` - a router instance\n- `storage_1_a` - a storage instance, master of replicaset #1\n- `storage_1_b` - a storage instance, slave of replicaset #1\n- `storage_2_a` - a storage instance, master of replicaset #2\n- `storage_2_b` - a storage instance, slave of replicaset #2\n\nAll instances are managed by `tarantoolctl` utility from the root directory\nof the project. Use `tarantoolctl start router_1` to start `router_1`,\n`tarantoolctl enter router_1` to enter admin console and so on.\n`make start` starts all instances:\n\n```bash\n# make start\n# ps x|grep tarantoo[l]\n15373 ?        Ssl    0:00 tarantool storage_1_a.lua \u003crunning\u003e\n15379 ?        Ssl    0:00 tarantool storage_1_b.lua \u003crunning\u003e\n15403 ?        Ssl    0:00 tarantool storage_2_a.lua \u003crunning\u003e\n15418 ?        Ssl    0:00 tarantool storage_2_b.lua \u003crunning\u003e\n15433 ?        Ssl    0:00 tarantool router_1.lua \u003crunning\u003e\n```\n\nEssential commands you need to known:\n\n* `make start` - start all Tarantool instances\n* `make stop` - stop all Tarantool instances\n* `make logcat` - show logs from all instances\n* `make enter` - enter into admin console on router\\_1\n* `make clean` - clean all persistent data\n* `make test` or `cd test \u0026\u0026 ./test-run.py` - run the test suite\n* `make` = stop + clean + start + enter\n\n## Terms and definitions\n\nThis section contains definitions of key terms used throughout the document.\n\n**Cluster** - Set of nodes that form a single group\u003cbr\u003e\n**Horizontal scaling** - Partitioning data into several servers and adding more servers as necessary\u003cbr\u003e\n**Node** - Physical or virtual server instance\u003cbr\u003e\n**Rebalancing** - Moving some part of data to new servers added to the cluster\u003cbr\u003e\n**Replicaset** - Container for storing data. Each replicaset stores a unique subset of vbuckets (one vbucket cannot belong to several replicasets at once)\u003cbr\u003e\n**Router** - Server responsible for routing requests from the system to certain cluster nodes\u003cbr\u003e\n**Sharding** - Database architecture that allows splitting data between two or more database instances by some key. Sharding is a special case of horizontal scaling.\u003cbr\u003e\n**Virtual bucket (vbucket)** - Sharding key that determines which replicaset stores certain data\u003cbr\u003e\n\n## See Also\n\nFeel free to contact us on [Telegram (eng)]  channel or send a pull request.\n\n* [Tarantool]\n* [Maillist]\n* [Telegram (rus)]\n\n[tarantool-badge]: https://img.shields.io/badge/Tarantool-1.9-blue.svg?style=flat\n[Tarantool]: https://tarantool.org/\n[Telegram (eng)]: http://telegram.me/tarantool\n[Telegram (rus)]: http://telegram.me/tarantoolru\n[Maillist]: https://groups.google.com/forum/#!forum/tarantool\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarantool%2Fvshard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftarantool%2Fvshard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarantool%2Fvshard/lists"}