{"id":13525046,"url":"https://github.com/instaclustr/esop","last_synced_at":"2026-07-01T15:00:44.453Z","repository":{"id":37745104,"uuid":"143301289","full_name":"instaclustr/esop","owner":"instaclustr","description":"Cloud-enabled backup and restore tool for Apache Cassandra","archived":false,"fork":false,"pushed_at":"2026-07-01T10:10:47.000Z","size":1768,"stargazers_count":54,"open_issues_count":11,"forks_count":27,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-07-01T11:26:35.090Z","etag":null,"topics":["apache","aws","azure","backup","backuping","cassandra","ceph","clouds","gcp","kubernetes","minio","netapp-public","ops","oracle","restore","restoring","s3","sstable","sstables","storage"],"latest_commit_sha":null,"homepage":"https://instaclustr.com","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/instaclustr.png","metadata":{"files":{"readme":"README.adoc","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-08-02T13:51:45.000Z","updated_at":"2026-07-01T10:11:03.000Z","dependencies_parsed_at":"2023-11-20T20:31:16.424Z","dependency_job_id":"6920b095-a5a7-40a2-92e8-7d38b7de6703","html_url":"https://github.com/instaclustr/esop","commit_stats":{"total_commits":252,"total_committers":12,"mean_commits":21.0,"dds":0.4563492063492064,"last_synced_commit":"fcfab5cc64801cc1834ea0998992a7686a4f0459"},"previous_names":["instaclustr/instaclustr-esop","instaclustr/cassandra-backup"],"tags_count":65,"template":false,"template_full_name":null,"purl":"pkg:github/instaclustr/esop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fesop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fesop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fesop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fesop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/instaclustr","download_url":"https://codeload.github.com/instaclustr/esop/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fesop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35011257,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-01T02:00:05.325Z","response_time":130,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","aws","azure","backup","backuping","cassandra","ceph","clouds","gcp","kubernetes","minio","netapp-public","ops","oracle","restore","restoring","s3","sstable","sstables","storage"],"created_at":"2024-08-01T06:01:15.552Z","updated_at":"2026-07-01T15:00:44.422Z","avatar_url":"https://github.com/instaclustr.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Instaclustr Esop\n\nimage:https://img.shields.io/maven-central/v/com.instaclustr/esop-core.svg?label=Maven%20Central[link=https://central.sonatype.com/search?q=esop-core]\nimage:https://img.shields.io/circleci/build/github/instaclustr/esop/master.svg[\"Instaclustr\",link=\"https://app.circleci.com/pipelines/github/instaclustr/esop?branch=master\"]\n\n_Swiss knife for Apache Cassandra backup and restore_\n\nimage::Esop.png[Esop,width=50%]\n\n- Website: https://www.instaclustr.com/\n\n- Documentation: https://www.instaclustr.com/support/documentation/\n\nThis repository is home of backup and restoration tools from Instaclustr for Cassandra called https://en.wikipedia.org/wiki/Aesop[Esop]\n\nEsop of version 2.0.0 is not compatible with any Esop of version 1.x.x.\nEsop 2.0.0 has changed the manifest format which is uploaded to a remote\nlocation hence, as of now, Esop 2.0.0 can not read manifests for versions 1.x.x.\n\nIt is recommended all your Esop instances to 4.0.3 version before upgrading them to 4.1.0 version. Esop v4.1.0 depends on the\nAWS S3 Client Encryption v4.0.0 version which is not compatible with versions older than v3.6.0.\n\nEsop is able to perform these operations and has these features:\n\n* Backup and restore of SSTables\n* Backup and restore of commit logs\n* Restoration of data into a Cassandra schema or diffrent table schema\n* Backing-up to and restoring from S3 (Oracle and Ceph via Object Gateway  too), Azure, or GCP, or into any local destination or other storage\nproviding they are easily implementable\n* listing of backups and their removal (from remote location, s3, azure, gcp), global removal of backups across all nodes in cluster\n* periodic removal of backups (e.g. after 10 days)\n* Effective upload and download—it will upload only SSTables which are not present remotely so\nany subsequent backups will upload and restores will download only the difference\n* When used in connection with https://github.com/instaclustr/icarus[Instaclustr Icarus] it is possible to backup **simultaneously** so there\nmight be more concurrent backups which may overlap what they backup\n* Possible to restore whole node / cluster _from scratch_\n* In connection with Icarus, it is possible to **restore on a running cluster**  so no\ndowntime is necessary\n* It takes care of details such as initial tokens, auto bootstrapping, and so on...\n* Ability to throttle the bandwidth used for backup\n* Point-in-time restoration of commit logs\n* verification of downloaded data - computes hases upon upload and download and it has to match otherwise restoration fails\n* it is possible to restore tables under different names so they do not clash with your current tables ideal when you want to investigate / check data before you restore the original tables, to see what data you will have once you restore it\n* retry of failed operations against s3 when uploading / downloading failure happens\n* support of multiple data directories for Cassandra node\n\nThis tool is used as a command line utility and it is meant to be executed from a shell\nor from scripts. However, this tooling is also embedded seamlessly into Instaclustr Icarus.\nThe advantage of using Icarus is that you may backup and restore your node (or whole cluster)\nremotely by calling a respective REST endpoint so Icarus can execute respective backup or\nrestore _operation_. Icarus is designed to be run alongside a node and it talks to Cassandra via\nJMX (no need to expose JMX publicly).\n\nIn addition, this tool has to be run in the very same context/environment as a Cassandra\nnode—it needs to see the whole directory structure of a node (data dir etc.) as it will\nupload these files during a backup and download them on a restore. If you want to be able to\nrestore and backup remotely, use Icarus which embeds this project.\n\n## Supporter Cassandra Versions\n\nSince we are talking to Cassandra via JMX, almost any Cassandra version is supported.\nWe are testing this tool with Cassandra 5.x and 4.x.\n\n## Usage\n\nReleased artifact is on https://search.maven.org/artifact/com.instaclustr/esop[Maven Central].\nYou may want to build it on your own by standard Maven targets. After this project is built by `mvn clean install`\n(refer to \u003c\u003cbuild and tests\u003e\u003e for more details), the binary is in `target` and it is called `instaclustr-esop.jar`.\nThis binary is all you need to backup/restore. It is the command line application, invoke it without any arguments to\nsee help. You can invoke `help backup` for `backup` command, for example.\n\n----\n$ java -jar target/esop.jar\nMissing required subcommand.\nUsage: \u003cmain class\u003e [-V] COMMAND\n  -V, --version   print version information and exit\nCommands:\n  backup             Take a snapshot of this nodes Cassandra data and upload it\n                       to remote storage. Defaults to a snapshot of all\n                       keyspaces and their column families, but may be\n                       restricted to specific keyspaces or a single\n                       column-family.\n  restore            Restore the Cassandra data on this node to a specified\n                       point-in-time.\n  commitlog-backup   Upload archived commit logs to remote storage.\n  commitlog-restore  Restores archived commit logs to node.\n----\n\nYou get detailed help by invoking `help` subcommand like this:\n\n----\n$ java -jar target/esop.jar backup help\n----\n\n### Connecting to Cassandra Node\n\nAs already mentioned, this tool expects to be invoked alongside a node - it needs\nto be able to read/write into Cassandra data directories. For other operations such as\nknowing tokens etc., it connects to respective node via JMX. By default, it will try to connect\nto `service:jmx:rmi:///jndi/rmi://127.0.0.1:7199/jmxrmi`. It is possible to override this\nand other related settings via the command line arguments. It is also possible to connect to\nsuch nodes securely if it is necessary, and this tool also supports specifying keystore, truststore,\nuser name and password etc. For brevity, please consult the command line `help`.\n\nIf you do not want to specify credentials on the command line, you can put them into a file and \nreference it by `--jmx-credentials` options. The content of this file is treated as a standard Java property file, \nexpecting this content:\n\n----\nusername=jmxusername\npassword=jmxpassword\nkeystorePassword=keystorepassword\ntruststorePassword=truststorepassword\n----\n\nNot all sub-commands require the connection to Cassandra to exist. As of now, a JMX connection is\nnecessary for:\n\n. backup of tables/keyspaces\n. restore of tables/keyspaces (hard linking and importing strategies)\n\nThe next release of this tool might relax these requirements so it would be possible to\nbackup and restore a node which is offline.\n\nFor backup and restore of commit logs, it is not necessary to have a node up as well in case you need to restore a node\n_from scratch_ or if you use \u003c\u003cIn-place restoration strategy\u003e\u003e.\n\n### Storage Location\n\nData to backup and restore from, are located in a remote storage. This setting is controlled by flag\n`--storage-location`. The storage location flag has very specific structure which also indicates where data will be\nuploaded. Locations consist of a storage _protocol_ and path. Please keep in mind that the protocol we are using is not a\n_real_ protocol. It is merely a mnemonic. Use either `s3`, `gcp`, `azure` or `file`.\n\nThe format is:\n\n`protocol://bucket/cluster/datacenter/node`\n\n* `protocol` is either `s3`,`azure`,'gcp`, or `file.\n* `bucket` is name of the bucket data will be uploaded to/downloaded from, for example `my-bucket`\n* `cluster` is name of the cluster, for example, `test-cluster`\n* `datacenter` is name of the datacenter a node belongs to, for example `datacenter1`\n* `node` is identified of a node. It might be e.g. `1`, or it might be equal to node id (uuid)\n\nThe structure of a storage location is validated upon every request.\n\nIf we want to backup to S3, it would look like:\n\n`s3://cassandra-backups/test-cluster/datacenter1/1`\n\nIn S3, data for that node will be stored under key `test-cluster/datacenter1/1`. The same mechanism works for other clouds.\n\nFor `file` protocol, use `file:///data/backups/test-cluster/dc1/node1`.\nIn every case, `file` has to start with full path (`file:///`, three slashes).\nFile location does not have a notion of a _bucket_, but we are using it here regardless—in the following examples, the _bucket_ will be _a_.\n\nIt does not matter you put slash at the end of whole location, it will be removed.\n\n.file path resolution\n|===\n|storage location |path\n\n|file:///tmp/some/path/a/b/c/d\n|/tmp/some/path/a\n\n|file:///tmp/a/b/c/d\n|/tmp/a\n|===\n\n\n### Authentication Against a Cloud\n\nIn order to be able to download from and upload to a remote bucket, this tool needs to pick up\nsecurity credentials to do so. This varies across clouds. `file` protocol does not need any authentication.\n\n#### S3\n\nThe resolution of credentials for S3 uses the same resolution mechanism as the official AWS S3 client uses.\nThe most notable fact is that if no credentials are set explicitly, it will try to resolve them from environment\nproperties of the node it runs on. If that node runs in AWS EC2, it will resolve them by help of that particular instance.\n\nS3 connectors will expect to find environment properties `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY`.\nThey will also accept `AWS_REGION`.\n\nIt is possible to connect to S3 via proxy; please consult \"--use-proxy\" flag and \"--proxy-*\" family of settings on command line.\n\n#### Azure\n\nAzure module expects `AZURE_STORAGE_CONNECTION_STRING` or `AZURE_STORAGE_ACCOUNT` and `AZURE_STORAGE_KEY` environment variable to be set.\nOnly one of the options are necessary. If both are set, it will fail.\n\nEsop relies on Azure Block Blobs to store backups in Azure Blob Storage and it caps maximum size of the block to 4 MB. If there is a need\nto store bigger file, use `azure.max.blob.block.size` system property. Units are bytes. Default: 4194304 which is 4 MB.  \n\n#### GCP\n\nGCP module expects `GOOGLE_APPLICATION_CREDENTIALS` environment property or `google.application.credentials` to be set with the path to service account credentials.\n\n### Directory Structure of a Remote Destination\n\nCassandra data files as well as some meta-data needed for successful restoration are uploaded into a bucket\nof a supported cloud provider (e.g. S3, Azure, or GCP) or they are copied to a local directory.\n\nLet's say we are in a bucket called `my-cassandra-backups` in Azure, and we did a backup with storage location set to\n`azure://test-cluster/dc1/1e519de1-58bb-40c5-8fc7-3f0a5b0ae7ee`. Snapshot name we set via `--snapshot-tag` was `snapshot3` and\nschema version of that node was `f1159959-593d-33d1-9ade-712ea55b31ef`.\nThe content of that hypothetical bucket with same data will look like this:\n\n```\n.\n├── topology\n│   └── snapshot3-f1159959-593d-33d1-9ade-712ea55b31ef-1600645759830.json (1)\n└── test-cluster\n    └── dc1\n        ├── 1e519de1-58bb-40c5-8fc7-3f0a5b0ae7ee (2)\n        │   ├── data\n        │   │   ├── system\n        │   │   |     // data for this keyspace\n        │   │   ├── system_auth\n        │   │   |     // data for this keyspace\n        │   │   ├── system_schema\n        │   │   |     // data for this keyspace\n        │   │   ├── test1\n        │   │   │   ├── testtable1-52d74870fb9911eaa75583ff20369112\n        │   │   │   │   ├── 1-2620247400 (3)\n        │   │   │   │   │   ├── na-1-big-CompressionInfo.db\n        │   │   │   │   │   ├── na-1-big-Data.db\n        │   │   │   │   │   ├── na-1-big-Digest.crc32\n        │   │   │   │   │   ├── na-1-big-Filter.db\n        │   │   │   │   │   ├── na-1-big-Index.db\n        │   │   │   │   │   ├── na-1-big-Statistics.db\n        │   │   │   │   │   ├── na-1-big-Summary.db\n        │   │   │   │   │   └── na-1-big-TOC.txt\n        │   │   │   │   ├── 1-4234234234\n        │   │   │   │   │   ├── // other SSTable\n        │   │   │   │   └── schema.cql (4)\n        │   │   │   ├── testtable2-545c13b0fb9911eaadb9b998490b71f5\n        │   │   │   │     // other table\n        │   │   │   └── testtable3-55e8a720fb9911eaa2026b6b285d5a8a\n        │   │   │         // other table\n        │   │   └── test2\n        │   └── manifests (5)\n        │       └── snapshot1-f1159959-593d-33d1-9ade-712ea55b31ef-1600645216879.json\n        ├── 55d39d99-a9e1-44da-941c-3a46efed66b3\n        │      // other node\n        ├── 59b5e477-df39-4126-acd4-726c937fe8fc\n        │      // other node\n        └── e8fd8bca-e6cb-4a1a-82db-192e2b4b77a5\n\n```\n\n. When this tool is used in connection with Instaclustr Cassandra Sidecar, it also creates a _topology_ file.\n. Data for each node are stored under that very node, here we used UUID identifier which is host ID as Cassandra sees it, and it is unique.\nHence, it is impossible to accidentally store data for a different node as each node will have unique UUID. It may happen\nthat over time we will have a cluster of same name and data center of same name but the node id would be still different\nso no clash would occur.\n. Each SSTable is stored in a directory\n. `schema.cql` contains a CQL \"create\" statement of that table as it looked upon a respective snapshot. It is there for diagnostic purposes so we might\nas well import data by other means than this tool as we would have to create that table in the first place before importing any data to it.\n. `manifests` directory holds JSON files which contain all files related to a snapshot as well other meta information. Its content will be discussed later.\n\nThe directory where SSTable files are found, in our example for `test1.testtable1`, is `1-2620247400`. `1` means the\ngeneration, `2620247400` is crc checksum from `na-1-big-Digest.crc32`. Through this technique, every SSTable is\ntotally unique and it ensures that they would not clash, even if they were named the same. This crc is\ninherently the part of the path where all files are, and a manifest file is pointing to them so we have\na unique match.\n\n#### Manifest\n\nA manifest file is uploaded with all data. It contains all information necessary to restore that snapshot.\n\nManifest name has this format: `snapshot3-f1159959-593d-33d1-9ade-712ea55b31ef-1600645759830.json`\n\n* `snapshot3`—name of snapshot used during a backup\n* `f1159959-593d-33d1-9ade-712ea55b31ef` schema version of Cassandra\n* `1600645759830` timestamp when that snapshot/backup was taken\n\nThe content of a manifest file looks like this:\n\n```\n{\n  \"snapshot\" : {\n    \"name\" : \"snapshot3\",\n    \"keyspaces\" : {\n      \"ks1\" : {\n        \"tables\" : {\n          \"ks1t1\" : {\n            \"sstables\" : {\n              \"md-2-big\" : [ {\n                \"objectKey\" : \"data/test2/test2-9939cd004ed711ecbe182d028df13d6f/2-79610399/md-2-big-CompressionInfo.db\",\n                \"type\" : \"FILE\",\n                \"size\" : 43,\n                \"hash\" : \"f8678a952d1fadf8d3368e078318dbc6cdf5eb7666631c77b288ead7d42ed572\"\n              }, {\n                \"objectKey\" : \"data/test2/test2-9939cd004ed711ecbe182d028df13d6f/2-79610399/md-2-big-Data.db\",\n                \"type\" : \"FILE\",\n                \"size\" : 55,\n                \"hash\" : \"004a1da4ef6681c11a5119cd0fe5c2cf73adabd52d76b0b2139ab09b6e1ce2ea\"\n              }, {\n                \"objectKey\" : \"data/test2/test2-9939cd004ed711ecbe182d028df13d6f/2-79610399/md-2-big-Digest.crc32\",\n                \"type\" : \"FILE\",\n                \"size\" : 8,\n                \"hash\" : \"5ff7e315ca70052e3b8f31753d3bdc4b8ddc966d3ca9991e519eed0f558dd6a4\"\n              }],\n            \"id\" : \"e17ff4b0e89211eab4313d37e7f4ac07\",\n            \"schemaContent\" : \"CREATE TABLE IF NOT EXISTS ks1.ks1t1 ...\"\n          },\n          \"ks1t2\" : {\n             // other table\n          }\n        }\n      }\n      \"ks2\": {\n        // other keyspace\n      }\n    }\n  },\n  \"tokens\" : [ \"-1025679257793152318\", \"-126823146888567559\", .... ],\n  \"schemaVersion\" : \"f1159959-593d-33d1-9ade-712ea55b31ef\"\n}\n```\n\nA manifest maps all resources related to a snapshot, their size as well as type (`FILE` or `CQL_SCHEMA`). It\nholds all schema content in a respective file too, so we do not need to read/parse the schema file as it is\nalready a part of the manifest.\n\nUpon restore, this file is read into its Java model and _enriched_ by setting a path where each _manifest entry_ should be\nphysically located on disk as we need to remove part of the file where a hash is specified. It is also possible\nto filter this manifest in such a way that we might backup 5 tables, but we want to restore only 2 of them so the other\nthree tables would not be downloaded at all.\n\n#### Topology File\n\nTopology file is uploaded during a backup as well. It is uploaded into a bucket's `topology` directory in root.\nA topology file is provided not only as a reference to see what the topology was upon backup, but it also helps Instaclustr Cassandra operator\nto resolve which node it should download data for.\n\nIf we are restoring a cluster from scratch and all we have is its former hostname, we need to know what\nwas the node's id (`nodeId` below) because that id signifies which directory its data is stored in. When Instaclustr\nCassandra operator restores a cluster from scratch, it knows a name of a pod (its hostname) but it does not know the\nid to load data from. The storage location upon a restore looks like `s3://bucket/test-cluster/dc1/cassandra-test-cluster-dc1-west1-b-0`.\nInternally, based on a snapshot and schema, we resolve the correct topology file and we filter its content to see\nwhich node starts on that hostname so we use, in this case, `nodeId` 8619f3e2-756b-4cb1-9b5a-4f1c1aa49af6 upon restoration.\nStorage location flag is then updated to use this node, so it will look like `s3://bucket/test-cluster/dc1/8619f3e2-756b-4cb1-9b5a-4f1c1aa49af6`.\n\n```\n{\n  \"timestamp\" : 1600645216879,\n  \"clusterName\" : \"test-cluster\",\n  \"schemaVersion\" : \"f1159959-593d-33d1-9ade-712ea55b31ef\",\n  \"topology\" : [ {\n    \"hostname\" : \"cassandra-test-cluster-dc1-west1-b-0\",\n    \"cluster\" : \"test-cluster\",\n    \"dc\" : \"dc1\",\n    \"rack\" : \"west1-b\",\n    \"nodeId\" : \"8619f3e2-756b-4cb1-9b5a-4f1c1aa49af6\",\n    \"ipAddress\" : \"10.244.2.82\"\n  }, {\n    \"hostname\" : \"cassandra-test-cluster-dc1-west1-a-0\",\n    \"cluster\" : \"test-cluster\",\n    \"dc\" : \"dc1\",\n    \"rack\" : \"west1-a\",\n    \"nodeId\" : \"b7952bdc-ccae-4443-9521-908820d067c1\",\n    \"ipAddress\" : \"10.244.1.194\"\n  }, {\n    \"hostname\" : \"cassandra-test-cluster-dc1-west1-c-0\",\n    \"cluster\" : \"test-cluster\",\n    \"dc\" : \"dc1\",\n    \"rack\" : \"west1-c\",\n    \"nodeId\" : \"1e519de1-58bb-40c5-8fc7-3f0a5b0ae7ee\",\n    \"ipAddress\" : \"10.244.2.83\"\n  } ]\n}\n```\n\nA name of a topology file has this format `clusterName-snapshotName-schemaVersion-timestamp`. This uniquely\nidentifies a topology in time.\n\n#### Resolving Manifest and Topology File From Backup Request\n\nLets say we have done a backup against a node, multiple times, where some snapshot names were the same\nand schema version was the same too, for some cases we will have these manifests in a bucket:\n\n```\n├── snapshot3-f1159959-593d-33d1-9ade-712ea55b31ef-1600645759830.json\n└── test-cluster\n    └── dc1\n        └── 1e519de1-58bb-40c5-8fc7-3f0a5b0ae7ee\n            └── manifests (5)\n                ├─ snapshot1-f1159959-593d-33d1-9ade-712ea55b31ef-1600645216000.json\n                ├─ snapshot1-f1159959-593d-33d1-9ade-712ea55b31ef-1600645217000.json\n                ├─ snapshot1-b555c56d-a89f-4002-9f9c-0d4c78d3eca9-1600645217800.json\n                ├─ snapshot2-f1159959-593d-33d1-9ade-712ea55b31ef-1600645218000.json\n                ├─ snapshot3-f1159959-593d-33d1-9ade-712ea55b31ef-1600645219000.json\n                └─ snapshot4-f1159959-593d-33d1-9ade-712ea55b31ef-1600645220000.json\n```\n\nWhich manifest will be resolved when we use `snapshot1` as `--snapshot-tag`?\n\nIf there are multiple manifests starting with same snapshot tag and having same schema version,\nin this particular case, it will pick the one with timestamp `1600645217800` as the latest manifest wins.\n\nYou may specify `--snapshot-tag` as `snapshot1-f1159959-593d-33d1-9ade-712ea55b31ef` or even full version with timestamp.\nThe longest prefix wins and when there are multiple manifests resolved, the latest wins.\n\nIn case we have the same snapshot but different schema, only the snapshot name and schema version will be enough, not the snapshot name alone.\n\nBy this logic, we are preventing the situation where two operators (as a person) will do two backups with the same\nsnapshots against a node on the same schema version—the only information which makes these two requests unique is the timestamp.\nHowever, we may use just the same snapshot name (for practical reasons not recommended) and all would work just fine.\n\nThe same resolution logic holds for topology file resolution—the longest prefix wins and it has to be uniquely filtered.\n\nUpon backup, the schema version is determined by calling respective JMX method. The user does not have to provide it on his own.\nOn the other hand, the second way how to resolve the problems above during restoration is to specify `--exactSchemaVersion` flag.\nWhen set, it will try to filter only manifests which were done on the same schema version as a current node runs on.\nThe last option is to use `--schema-version` option (in connection with `--exact-schema-version`) with the schema version manually.\n\n#### Multiple Cassandra data directories\n\nIt is possible to work with a Cassandra node which\nhas data in multiple locations, not only in one,\nas `data_files_directories` in `cassandra.yaml` is an array.\n\nIn order to point backup or restore procedures to multiple\ndata directories, there is a flag called `--data-dir`.\nThis flag can be set multiple times - each one pointing to\ndifferent data directory, as it is set in `cassandra.yaml`.\n\nUpon backup, files of all SSTables across all directories\nare uploaded to a remote location. However, upon restore,\nthey are not necessarily put into the same directories.\n\nFor in-place restoration strategy, SSTables are dispersed\namong all data directories in a round-robin fashion.\n\nFor hard-linking strategy, it is logically same as for in-place, SSTables\nare again dispersed among all data directories with any signifant order.\n\nFor importing strategy, Esop does not control where SSTables will be put\nat all as this is delegated to imporing mechanism of Cassandra itself so\nthe support of multiple data directories is there out of the box.\n\n#### Backup\n\nThe anatomy of a backup is quite simple. The successful invocation of `backup` sub-command will\ndo the following:\n\n. Checks if a remote bucket for whatever storage provider exists, and will optionally create it if it doesn't (consult command line for help on how to achieve that). If a bucket does not\nexist and we are not allowed to create it automatically—the backup will fail.\n. Takes tokens of a respective node via JMX. Tokens are necessary for cases when we want to\nrestore into a completely empty node. If we downloaded all data but tokens would be autogenerated,\nthe data that node is supposed to serve would not match tokens that node is using.\n. Takes a snapshot of respective _entities_—either keyspaces or tables. It is not possible\nto mix keyspaces and some tables, it is _either_ keyspace(s) _or_ tables. This is inherited from the\nfact that Cassandra JMX API is designed that way. `nodetool snapshot` also permits us to specify\nentities to backup either as `ks1,ks2,ks3` or `ks1.t1,ks1.t2,ks2.t3` and we copy this behaviour here.\nThe name of snapshot is auto generated when not specified via command line.\n. Creates internal mapping of snapshot to files it should upload.\n. Uploads SSTables and helper files to remote storage—only files which are not uploaded. By doing this,\nwe will not \"over-upload\" as an SSTable is an immutable construct, so there is no need to upload what is\nalready there. The backup procedure will check if a remote file is not there and uploads only in\ncase it is not. Backup is doing a \"hash\" of an SSTable and it is uploaded under such key\nso it is not possible that two SSTables would be overwritten even if they are named the same as their\nhashes do not necessarily match.\n. The actual downloading/uploading is done in parallel—the number of simultaneous uploadings/downloadings is controlled by `concurrent-connections` setting which defaults to 10. It is possible\nto throttle the bandwidth so we do not use all available bandwidth for backups/restores so the\nnode which might still be in operation would suffer performance-wise.\n. Writes meta-files to a remote storage—manifest and topology file (when Sidecar is used).\n. Clears taken snapshot.\n\nAs of now, a node to be backed-up has to be online because we need tokens, we need to take a snapshot, etc.\nand this is done via JMX. In theory we do not need a node to be online if we take a snapshot beforehand\nand tokens are somehow provided externally, however the current version of the tool does require it.\n\n#### Restore\n\nThis tool is seamlessly integrated into https://github.com/instaclustr/icarus[Icarus]\nwhich is able to do backup and restore in a distributed manner—cluster wide. Please refer to documentation of Icarus\nto understand what restoration phases are and what restoration strategies one might use. The very same\nrestoration flow might be executed from CLI, Icarus just accepts a JSON payload which is a different representation\nof the very same data structure as the one used from command like but the functionality is completely the same.\n\nCLI tool is not responsive to `globalRequest` flag in restoration/backup requests—only Sidecar can coordinate\ncluster-wide restoration and backup.\n\nA restoration is a relatively more complex procedure than a backup. We have provided three _strategies_.\nYou may control which strategy is used via command line.\n\nIn general, the restoration is about:\n\n. Downloading data from remote location\n. Making Cassandra use these files\n\nWhile the first step is quite straightforward, the second depends on various factors we guide a\nreader through.\n\nRestoration strategy is determined by flag `--restoration-strategy-type` which might be\n`IN_PLACE`, `IMPORT`, or `HARDLINKS`, case-insensitive.\n\n#### In-Place Restoration Strategy\n\nIn-place strategy must be used only in case a Cassandra node is _down_— Cassandra process\ndoes not run. This strategy will download only SSTables (and related files) which are not present\nlocally, and it will directly download them to their respective data directories of a node. Then it will\nremove SSTables (and related files) which should not be there. As a backup is done against a _snapshot_;\nrestore is also done from a snapshot.\n\nUse this strategy if you want to:\n\n* restore from an older snapshot and your node does not run\n* restore from a snapshot and your node is completely empty—it was never run/its `data` dir is empty\n* restore a cluster/node by Cassandra Operator. This feature is already fully embedded into our\noperator offering so one can restore whole clusters very conveniently.\n\nIn more detail, in-place strategy does the following:\n\n. Checks that a remote bucket to download data from exists and errors out if it does not\n. In case `--resolve-host-id-from-topology` flag is used, it will resolve a host to restore from topology file.\n. Downloads a manifest—manifest contains the list of files which are logically related to a snapshot.\n. Filters out the files which need to be downloaded, as some files which are present locally might be\nalso a part of a taken snapshot so we would download them unnecessarily.\n. Downloads files directly into Cassandra `data` dir.\n. Deletes files from `data` dir which should not be there.\n. Cleans data in other directories—hints, saved caches, commit logs.\n. Updates `cassandra.yaml` if present with `auto_bootstrap: false` and `initial_token` with tokens from\nmanifest.\n\nIt is possible to restore not only user keyspaces and tables but system keyspaces too. This is necessary for\nthe successful restoration of a cluster/node exactly as it was before as all system tables would be same.\nNormally, system keyspaces are not restored and one has to set this explicitly by `--restore-system-keyspace` flag.\n\nIn-place strategy uses also `--restore-into-new-cluster` flag. If specified, it will restore only system\nkeyspaces needed for successful restoring (`system_schema`) but it will not attempt to restore anything else.\nWe do not always want to restore _everything_ because system keyspaces\ncontain details like tokens, peers with ips, etc. and this information is very specific to each one so\nwe do not restore them. However, if we did not restore `system_schema`, the newly started node would not see\nthe restored data as there would not be any schema. By restoring `system_schema`, Cassandra will detect\nthese keyspaces and tables on the very first start.\n\nIn-place restoration might update `cassandra.yaml` file if found. This is done automatically\nupon restoration in Cassandra operator but it might be required to be done manually for other cases. By default,\n`cassandra.yaml` is not updated. The updating is enabled by setting `--update-cassandra-yaml` flag upon restore. It is\nexpected that `cassandra.yaml` is located in a directory `\\{cassandraConfigDirectory\\}/` (by default `/etc/cassandra`).\nThe Cassandra configuration directory with `cassandra.yaml` might be changed via `--config-directory` flag. There are two\noptions which are automatically changed when `cassanra.yaml` if found, in connection with this strategy:\n\n* `auto_bootstrap` - if not found, it will be appended and set to `false`. If found and set to `true`, it\nwill be replaced by `false`. If `auto_bootstrap: false` is already present, nothing happens.\n* `initial_token`—set only in case it is not present `cassandra.yaml`. Tokens are set in order to\nhave the node we are restoring to on the same tokens as the node we took a snapshot from.\n\n#### Hard-Linking Strategy\n\nThis strategy is supposed to be executed against a _running_ node. Hard-linking strategy downloads data\nfrom a bucket to a node's local directory and it will make hardlinks from these files to Cassandra data dir\nfor that keyspace/table. After hardlinks are done, it will _refresh_ a respective table / keyspace\nvia JMX so Cassandra will start to read from them. Afterwards, the original files are deleted.\n\nThis strategy works for Cassandra version 3 as well as for Cassandra 4.\n\n#### Importing Strategy\n\nThis strategy is similar to hardlinking strategy — the node upon restoration can still run and serve\nother requests so a restoration process is not disruptive. _Importing_ means that it will\nimport downloaded SSTables via JMX directly so no hardlinks and refresh are necessary. Importing of\nSSTables by calling respecting JMX method was introduced in Cassandra 4 only, so this does not work\nagainst a node of version 3 or below. Keep in mind that imported SSTables are physically deleted\nfrom download directory and moved to live Cassandra data directory.\n\n#### Restoration Phases for Hardlinking and Importing Strategy\n\nHardlinking and importing strategy consists of _phases_. Each phase is done _per node_.\n\n. Cluster health check—this phase ensures that we are restoring into a healthy cluster,\nif any of this check is violated the restore will not proceed. We check that:\n.. A node under the restoration is in `NORMAL` state\n.. Each node in a cluster is `UP—the failure detector (as seen from that node) does not detect any node as failed\n.. All nodes are not in _joining_, _leaving_, _moving_ state and all are reachable\n.. All nodes are on same schema version\n. Downloading phase—this phase will download all data necessary for the restore to happen.\n. Truncate phase—this phase will truncate all respective tables we want to restore.\n. Importing phase—for hardlinking strategy. It will do hardlinks from download directory to\nlive Cassandra data dir; for importing strategy, it will call JMX method to import them.\n. Cleaning phase—this phase will cleanup a directory where Cassandra put truncated data; it will also\ndelete the directory where downloaded SSTables are.\n\nIn a situation where we are restoring into a cluster of multiple nodes, the truncate\noperation should be executed only once against a particular node, as Cassandra will internally\ndistribute the truncating operation to all nodes in a cluster. In other words, it is enough to\ntruncate at one node only as data from all other nodes will be truncated too.\n\nDownloading phase is proceeding all other phases because we want to be sure that we are truncating the data if\nand only if we have all data to restore from. If we truncated all data and download fails, we\ncan not restore and the node does not contain any data to serve, rendering it useless (for that table)\nwith some complicated procedure to recover the truncated data.\n\nIf any phases fail, all other phases fail too. Hence if we fail to download data, from an operational\npoint of view nothing happens, as nothing was truncated and data on a running cluster were not touched.\nIf we fail to truncate, we are still good. Once we truncate and we have all data, it is\nstraightforward to import/hard-link data. This is the least invasive operation with a high\nprobability of success.\n\nIt can be decided if we want to delete downloaded as well as truncated data after a restore is finished.\nIf we plan to restore multiple times with the same data—for whatever reason— and to return back to the same snapshot,\nit is not desired to download all data all over again. We might just reuse them. This is controlled by flags\n`--restoration-no-download-data` and `--restoration-no-delete-downloads` respectively.\n\n#### Restoring Into Different Schemas\n\nWhen a cluster we made a backup for is on the same schema at the time we want to do a restore, all is fine.\nHowever, a database schema evolves over time, columns are added or removed and we still want to be able to restore.\nLet's look at this scenario:\n\n. create keyspace `ks1` with table `table1`\n. insert data\n. make backup\n. alter table, **add** a column\n. insert data\n. restore into snapshot made in the 3rd step\n\nClearly, the schema we are on differs from the schema back then—there is a new column which is not present in uploaded SSTables.\nHowever, this will work, resulting in a column which is new to have all values for that column as `null`. This tool does not\ntry to modify a schema itself. An operator would have to take care of this manually and such column would have to be dropped.\n\nThe opposite situation works as well:\n\n. create keyspace `ks1` with table `table1`\n. insert data\n. make backup\n. alter table, **drop** a column\n. insert data\n. restore into snapshot made in the 3rd step\n\nIf we want to restore, we have one column less from snapshot, data will be imported but that column will just not be there.\n\nAs of now, the restore is only \"forward-compatible\" on a table level. If we dropped whole table and we want to restore it,\nthis is not possible—the table has to be there already. You may recreate them by applying respective CQL create statements\nfrom the manifest before proceeding. The tool might try to create these tables beforehand as we have that CQL schema at hand, but\ncurrently it is not implemented.\n\n### Simultaneous Backups\n\nBackups are non-blocking. It means that multiple backups might be in progress. However, no file is uploaded\nin one particular moment more than once. Each backup request forms a _session_. A session contains _units_ to\nupload, referencing an entry in a manifest. If the second backup wants to upload the same file as the first one\nwhich is already uploading, it will just wait until the first backup is complete. The simultaneous restore is not finished yet.\n\nThe power of simultaneous backups is fully understood in connection with Instaclustr Cassandra Sidecar as\nthat is a server-like application running for a long period of time where an operator can submit backup requests which\nmight happen at the same time (uploading of files is happening concurrently). CLI application does not profit from this feature.\n\n### Resolution of Entities to Backup/Restore\n\nThe flag `--entities` commands which database tables/keyspaces should be backed- up or restored.\n\n|===\n|--entities |backup |restore\n\n|empty\n|all keyspaces and tables\n|all keyspaces and tables except `system*`\n\n|`ks1`\n|all tables in keyspace ks1\n|all tables in keyspace ks1, except system keyspace\n\n|`ks1.t1,ks2.t2`\n|tables `t1` in `ks1` and table `t2` in `ks2`\n|tables `t1` in `ks1` and table `t2` in `ks2`\n|===\n\nMoreover, if `--restore-system-keyspace` is set upon restore, it is possible to restore system\nkeyspaces only in case `--restoration-strategy-type` is `IN_PLACE`. Logically, we can not restore system\nkeyspaces on a running cluster in case we use hardlinking or importing strategy. System keyspaces are\nfiltered out from entities automatically for these strategy types. However, if `IN_PLACE` strategy is used\nand flag `--restore-into-new-cluster` is specified, such strategy will pick only system keyspaces necessary for\nsuccessful bootstrapping, as it restores `system_schema` only from all system schemas. `system_schema` needs to\nalready contain the keyspaces and tables we are restoring. If we started a completely new node without restoring `system_schema`,\nit would not detect these imported keyspaces.\n\nKeep in mind that if system keyspace (`system_schema`) is not specified upon backup, it will not be uploaded;\n`--entities` need to enumerate all entities explicitly (or if it is empty, absolutely everything will be uploaded).\n\n### Backup and Restore of Commit Logs\n\nIt is possible to backup and restore commit logs too. There is a dedicated sub-command for this task.\nPlease refer to examples how to invoke it. The commit logs are simply uploaded to a remote storage\nunder node keys of the users choosing as specified in storage location property. The respective command\ndoes not derive the storage path on its own out of the box as commit logs might be uploaded even\nif a node is offline. So there might be no means to retrieve its host id via JMX, for example, but this\nmight be turned on on demand.\n\nThe example of backup (for brevity, we are showing just the sub-command):\n\n----\n$ java -jar esop.jar commitlog-backup \\\n  --storage-location=s3://myBucket/mycluster/dc1/node1 \\\n  --commit-log-dir /var/lib/cassandra/data/commitlog\n----\n\nNote that in this example, there is not any need to specify `--jmx-service` because it is not needed. JMX is needed\nfor taking snapshots, for example, but here we do not take any. Commitlog directory is specified by\n`--commit-log-dir`. It is possible to override this by specifying `--cl-archive` with the path to the commit logs\ninstead of expecting them to be under `--commit-log-dir`. This plays nicely especially with\nthe commit log archiving procedure of Cassandra. Let's say you have this in `commitlog_archiving.properties` file:\n\n----\narchive_command=/bin/ln %path /backup/%name\n----\n\nwhere `%path` is a fully qualified path of the segment to archive and `%name` is name of the commit log (these variables\nwill be automatically expanded by Cassandra). Then you might archive your commit logs like this:\n\n----\n$ java -jar esop.jar commitlog-backup \\\n  --storage-location=s3://myBucket/mycluster/dc1/node1 \\\n  --cl-archive=/backup\n----\n\nThe backup logic will iterate over all commit logs in `/backup` and it will try to refresh them in the remote\nstore, if they are refreshed, it means they are already uploaded. If refreshing fails, that commit log is not\nthere so it will be uploaded.\n\nYou might as well script this in such a way that a commit log would be automatically uploaded as part of\nCassandra archiving procedure, like this:\n\n----\narchive_command=/bin/bash /path/to/my/backup-script.sh %path %name\n----\n\nThe content of `backup-script.sh` might look like:\n\n----\n$!/bin/bash\n\njava -jar esop.jar commitlog-backup \\\n    --storage-location=s3://myBucket/mycluster/dc1/node1 \\\n    --commit-log=$1\n----\n\nThere is one improvement to do here, even if we do not know what the host id or dc or name of a cluster is,\nthis can be found out dynamically as part of the backup by specifying `--online` flag (if a Cassandra node is online it just archived a commit log for us).\n\n----\n$!/bin/bash\n\n# specifying --online will update s3://myBucket/mycluster/dc1/node1 to\n# s3://myBucket/real-dc/real-dc-name/68fcbda0-442f-4ca4-86ec-ec46f2a00a71 where uuid is host id.\n\njava -jar esop.jar commitlog-backup \\\n    --storage-location=s3://myBucket/mycluster/dc1/node1 \\\n    --commit-log=$1 \\\n    --online\n----\n\n### Examples of Command Line Invocation\n\nEach example shown here should be prepended with `java -jar esop.jar`. We are showing here\njust respective commands.\n\nThis command will copy over all SSTables to the remote location. It is also possible to choose a location\nin a cloud. For backup, a node has to be up to back it up.\n\n----\n\nbackup \\\n--jmx-service 127.0.0.1:7199 \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--data-dir /my/installation/of/cassandra/data/data \\\n--entities=ks1,ks2 \\\n--snapshot-tag=mysnapshot\n----\n\nIf you want to upload SSTables into AWS, GCP, or Azure, just change protocol to either `s3`,\n`gcp`, or `azure`. The first part of the path is the bucket you want to upload files to, for `s3`,\nit would be like `s3://bucket-for-my-cluster/cluster-name/dc-name/node-id`. If you want to use a different\ncloud, just change the protocol respectively.\n\nWe also support https://docs.cloud.oracle.com/en-us/iaas/Content/Object/Tasks/s3compatibleapi.htm[Oracle cloud];\nuse `oracle://` protocol for your backup and restores.\n\nWe also support CEPH S3 Gateway, use `ceph://` protocol for your backup and restores.\n\nIf a bucket does not exist, it will be created only when `--create-missing-bucket` is specified.\nThe verification of a bucket might be skipped by flag `--skip-bucket-verification`.\nIf the verification is not skipped (which is default) and we detect that a\nbucket does not exist, the operation fails if we do not specify `--create-missing-bucket` flag.\n\n### Example of in-place `restore`\n\nThe restoration of a node is achieved by following parameters:\n\n----\n$ restore --data-dir /my/installation/of/cassandra/data/data \\ \\\n          --config-directory=/my/installation/of/restored-cassandra/conf \\\n          --snapshot-tag=stefansnapshot\" \\\n          --storage-location=s3://bucket-name/cluster-name/dc-name/node-id \\\n          --restore-system-keyspace \\\n          --update-cassandra-yaml=true\"\n----\n\nNotice a few things here:\n\n* there is implicity used `--restoration-strategy-type=IN_PLACE`\n* `--snapshot-tag` is specified. Normally, when snapshot name is not used upon backup, there\nis a snapshot taken of some generated name. You would have to check the name of a snapshot in\na backup location to specify it yourself, so it is better to specify that beforehand and just\nreference it.\n* `--update-cassandra-yaml` is set to true, this will automatically set `initial_tokens` in `cassandra.yaml` for the\nrestored node. If it is false, you will have to set it up yourself, copying the content of tokens file\nin backup directory, under `tokens` directory.\n* `--restore-system-keyspace` is specified, which means it will restore system keyspaces too, which is not\nnormally done. This might be specified only for IN_PLACE strategy as that strategy requires a node to be down and\nwe can manipulate system keyspaces only on such a node.\n\n### Example of Hardlinking and Importing Restoration\n\nHardlinking as well as importing restoration consists of phases. These strategies expect a Cassandra node\nto be up and fully operational. The primary goal of these strategies is to restore on a _running node_,\nso the restoration procedure does not require a node to be offline which greatly increases the availablity of the whole\ncluster. Backup and restore will look like the following:\n\n----\n\nbackup \\\n--jmx-service 127.0.0.1:7199 \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--data-dir /my/installation/of/cassandra/data/data \\\n--entities=ks1,ks2 \\\n--snapshot-tag=mysnapshot\n----\n\nThe first restoration phase is DOWNLOAD as we need to download remote SSTables:\n\n----\nrestore \\\n--data-dir /my/installation/of/cassandra/data/data \\\n--snapshot-tag=my-snapshot \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--entities=ks1,ks2 \\\n--restoration-strategy-type=hardlinks \\\n--restoration-phase-type=download, /// IMPORTANT\n--import-source-dir=/where/to/put/downloaded/sstables\n----\n\nThen we need to truncate `ks1` and `ks2`:\n\n----\nrestore,\n--data-dir /my/installation/of/cassandra/data/data \\\n--snapshot-tag=my-snapshot \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--entities=ks1,ks2 \\\n--restoration-strategy-type=hardlinks \\\n--restoration-phase-type=truncate \\ /// IMPORTANT\n--import-source-dir=/where/to/put/downloaded/sstables\n----\n\nOnce we truncate keyspaces, we can make hardlinks from directory where we downloaded SSTables\nto the Cassandra data directory:\n\n----\nrestore,\n--data-dir /my/installation/of/cassandra/data/data \\\n--snapshot-tag=my-snapshot \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--entities=ks1,ks2 \\\n--restoration-strategy-type=hardlinks \\\n--restoration-phase-type=import \\ /// IMPORTANT\n--import-source-dir=/where/to/put/downloaded/sstables\n----\n\nLastly we can cleanup downloaded data as well as truncated as they are not needed anymore:\n\n----\nrestore,\n--data-dir /my/installation/of/cassandra/data/data \\\n--snapshot-tag=my-snapshot \\\n--storage-location=s3://myBucket/mycluster/dc1/node1 \\\n--entities=ks1,ks2 \\\n--restoration-strategy-type=hardlinks \\\n--restoration-phase-type=cleanup \\ /// IMPORTANT\n--import-source-dir=/where/to/put/downloaded/sstables\n----\n\nIf you check this closely you see that the only flag we have changed is `--restoration-phase-type`\nand that is correct. All commands will look exactly the same but they will just differ on `--restoration-phase-type`.\n\nIf we wanted to do a restore via Cassandra JMX _importing_, our `--restoration-strategy-type` would be `import`.\n\n### Renaming of a table to restore to\n\nIt is possible to restore to a different table you backed up. This feature is very handy for cases\nwhen you want to examine data before you actually restore them - you might put them temporarily\nto a different table to see if all is right etc. From Esop CLI, you drive this feature by flag called `--rename`.\nThis flag might repeat as many times as many times you need to rename.\n\nThis feature might be used only for hardlinks or importing strategy, not for in-place.\n\nA table has to exist before a restore action is taken. Esop does **not** create this table for you automatically\nand it is left for a user to ensure such table exists before proceeding.\n\nLet's say you have backed up a table called `tb1` in a keyspace called `ks1` but you want to restore\nit into table `tb2` in the same keyspace. Hence you need to specify `--rename=ks1.tb1=ks1.tb2`.\n\n`--rename` options is meant to be used along with `--entities`. It is a valid scenario to do this:\n\nThese examples show invalid cases for the combination of `--entities` and `--renamed`\n\n----\n--entities=\"\" --rename=whatever non empty  -\u003e invalid\n--entities=ks1 --rename=whatever non empty -\u003e invalid, you can not use only a keyspace in --entities\n--entities=ks1.tb1 --rename=ks1.tb2=ks1.tb2 -\u003e invalid as \"from\" is not in entities\n--entities=ks1.tb1 --rename=ks1.tb2=ks1.tb1 -\u003e invalid as \"to\" is in entities (and from is not in entities)\n--entities=ks1.tb1 --rename=ks1.tb1=ks1.tb2 -\u003e truncate ks1.tb2 and process just ks1.tb2, k1.tb1 is not touched\n----\n\nValid cases:\n\n----\n--entities=ks1.tb1 --rename=ks1.tb1=ks1.tb2\n--entities=ks1.tb1 --rename=ks1.tb1=ks2.tb1\n--entities=ks1.tb1,ks2.tb2,ks3.tb4 --rename=ks1.tb1=ks1.tb2,ks2.tb2=ks3.tb3\n----\n\n* entities in \"to\" have to be unique across all renaming pairs, \"ks1.tb1=ks1.tb2,ks1.tb3=ks1.tb2\" is invalid\n* please keep in mind that if you are doing cross-keyspace renaming, as of now you are completely on your\nown when it comes to e.g. replication factors etc, Esop currently does not check that replication factor\nand replication strategy in source and target keyspace match. This might be addressed in the future versions.\n\nFrom Icarus point of view, you need to add a map under \"rename\" field:\n\n----\n{\n    \"rename\": {\n        \"ks1.tb1\": \"ks1.tb2\",\n        \"ks2.tb3\": \"ks2.tb4\",\n        \"ks3.tb5\": \"ks3.tb6\"\n    }\n}\n----\n\n### Skipping refreshment of remote objects\n\nBy default, Esop \"refreshes\" remote objects. Refreshment means that the last modification\ndate of a remote object will be updated to the time the backup was done. This is done because\nwe need to somehow detect if a remote file already exists or not. If it does, we do not upload it.\nIf it does not exist, we upload it. However, if it does exist, we need to update the modification date\nbecause there might be, for example, a retention policy on remote objects in a bucket to be set for\nsome period of time (for example, 14 days) and if a particular files not touched for 14 days, it would be removed.\nThis way you might automatically implement the deletion of older backups because if there is a newer backup\nconsisting of a set of SSTables, all SSTables which were previously a part of the older backup but they are not\na part of the current backup would not be touched - hence no modification date would be refreshed - so they would expire.\n\nFor cases there is a versioning enabled (currently known to be an issue for S3 backups only),\nour attempt to refresh it would create new, versioned, file. This is not desired. Hence, we\nhave the possibility to skip refreshment, and we just detect if a file is there or not, but you would\nlose the ability to expire objects as described above.\n\nThis behavior is controlled by flag called `--skip-refreshing` on backup command. By default, when\nnot specified, it is evaluated to `false`, so skipping would not happen.\n\nCurrently, this functionality is not working for s3 protocol.\n\n### Retry of upload / download operations\n\nImagine there is a restore happening which is downloading 100 GB of data and your connectivity\nto the Internet is disrupted when it is almost done, on 80%. If you restart whole restoration\nprocess, you do not want to download all 80 GB again. Hence, we want that if a restore is stopped\nin the middle, it will not start from scratch next time we run it and it will download what is necessary.\n\nAs a result of these errors, a file might be corrupted, it may be incomplete on the disk\nso its loading or hard linking into Cassandra would fail. To be sure that data are not corrupted,\nthere is a hash (sha512) of that file made and it is uploaded as part of the manifest. Upon restore,\nif that file already exists locally, it computes the has and it compares it withe the one in the manifest\nand they have to match. If they do not match, such corrupted file is deleted and whole operation\nas such (download phase in case of import or hardlinks strategy) fails. On the next restore attempt,\nit will skip files which are in download directory already present and donwloads ony missing ones,\ncomputing their hashes etc ...\n\nOn backup path, if a communication error happens, this is also detected and operation fails\nas such but some files might be already uploaded. On next upload, Esop checks if such file\nis already present remotely and it will skip it from uploading if it does.\n\nIf upload of a file fails, Esop can _retry_. The mechanism how this happens is controlled by\nthe family of \"--retry-*\" switches on the command line. In a nutshell, your retry might be\nexponential or linear. The exponential retry will execute the same operation (e.g. uploading of a file)\nevery time exponentially it terms of the pause between retries. Linear retry has the retry period constant.\n\n### Explanation of Global Requests\n\nIt looks like the phases are an unnecessary hassle to go through, but the granularity is required in case we are\nexecuting a so called _global request_. A global request is used in the context of Cassandra Sidecar and it does not\nhave any usage during CLI executions.\n\n### Example of `commitlog-restore`\n\nThe restoration of commit logs can be done like this:\n\n----\n$ commitlog-restore --commit-log-dir=/my/installation/of/restored-cassandra/data/commitlog \\\n                    --config-directory=/my/installation/of/restored-cassandra/conf \\\n                    --storage-location=s3://bucket-name/cluster-name/dc-name/node-id \\\n                    --commitlog-download-dir=/dir/where/commitlogs/are/downloaded \\\n                    --timestamp-end=unix_timestamp_of_last_transaction_to_replay\n----\n\nThe commit log restorations are driven by Cassandra's `commitlog_archiving.properties` file. This\ntool will generate such files into the node's `conf` directory so it will be read upon node start.\n\nAfter a node is restored in this manner, one has to *delete* `commitlog_archiving.properties` file\nin order to prevent commitlog replay by accident again if a node is restarted.\n\n----\nrestore_directories=/home/smiklosovic/dev/instaclustr-esop/target/commitlog_download_dir\nrestore_point_in_time=2020\\:01\\:13 11\\:32\\:51\nrestore_command=cp -f %from %to\n----\n\n## Listing of backups\n\nThis feature is available for file, s3, azure and gcp backups.\n\nListing of a bucket provides a better visibility into what backups there are, how many files they\nconsist of and how much space they occupy as well as how much space we would reclaim by their deletion.\n\n----\n$ java -jar esop.jar list \\\n    --storage-location=file:///backup1/cluster/datacenter1/node1 \\\n    --human-units\n\nTimestamp               Name             Files Occupied space Reclaimable space\n2021-04-27T15:38:40.284 name-of-backup-1 154   113.1 kB       10.1 kB\n2021-04-27T15:38:20.259 name-of-backup-2 138   103.0 kB       0 B\n                                         154   113.1 kB\n----\n\nListing of a backup will read all manifests there are for a respective node and it will compute the statistics above.\nIt is important to understand that the figure representing the number of files for a specific backup does not\nrepresent the unique files. Since a backup can have SSTables present in more than one backup, the sum of\nfiles per backup does not need to match the global number of files. Above we see that backup1 has 154 files and backup2\nhas 138 files but in total there is 154 files. This means that backup2 is logically consisting of SSTables which\nare all in backup1 and backup1 contains all SSTables in backup2 plus some new ones. Same holds for occupied space.\n\nThe figure of reclaimable space represents the number of bytes (or any human-readable size) which would be freed\nby deleting that particular backup. For example, from the above we see that by deleting backup-2, we would get\nno free space. Why? Because all SSTables in backup-2 also belongs to backup-1. So we can not just physically remove it\nbecause backup-1 would just be corrupted.\n\nOn the other hand, by deletion of backup-1, we would gain 10.1 kB. Why? Because we just can not go and delete all SSTables\nbelonging to backup-1, because backup-2 would be corrupted - it would miss SSTables. We can safely delete only\nthese files from backup-1 which are not in backup-2 - and that difference occupies just 10 kB.\n\nHowever, we see that in total, our data occupy 113 kB at disk even though the sum of occupied space of all backups\ndoes not match the total - because there are SSTables logically belonging to multiple backups.\n\nPlease keep in mind that this table reflects the reality as long as you do not add nor delete any backup.\n\nIf you want to use different storage location, for example, if your backups are in AWS, use \"--storage-location=s3://...\".\nThe same logic applies for Azure and GCP (`azure://` and `gcp://` respectively).\n\n|===\n|flag |explanation\n\n|--resolve-nodes\n|Resolves cluster name, data center and host id of a node Esop is connected to, otherwise\nit will try skip connecting to that node and it will expect valid --storage-location property.\n\n|--simple-format\n|prints out just names of backups instead of all statistics\n\n|--json\n|prints out a json instead of a table\n\n|--human-units\n|prints human-friendly sizes, e.g 5 kB, 1 GB etc instead of just number of bytes\n\n|--to-file\n|path to file to redirect the output of the command to, file is created when it does not exist\n\n|--from-timestamp\n|expects unix timestamp (also present in backup's name at the end), once set, it will only process backups taken since then, including.\n\n|--last-n\n|expects a postive integer to process only last (the oldest) n backups.\n|===\n\nAll `--json`, `--simple-format` and `--to-file` might be freely turned on / off on demand. By\ndefault, it will print a table in complex format to the standard output.\n\n`list` command is receptive to all family of `--jmx-*` settings in order to connect to a running\nCassandra node if necessary.\n\n## Removal of a backup\n\nSince we are storing each SSTable only once, ever, a deletion of a backup is not so straightforward.\n\nRemoval works for file, s3, gcp and azure protocol.\n\nWe might delete only SSTable which is present only in one backup. If some particular SSTable\nis present in multiple backups, we might delete that backup _logically_, but we can not\ndelete that SSTable. The underlying logic computes how may backups a particular file is present it\nby scanning all manifests there are and if we specify we want to delete so and so backup, it will\nphysically remove only files which are part of that very backup and they are not present anywhere\nelse.\n\nBy doing this, we are not forced to remove only the last backup (for example looking at its timestamp)\nhowever we can, in general, remove any backup.\n\nThe general workflow is to either list all backups and remove only the one you want, or you can\nspecify `--oldest` to delete the oldest one and you can do this repeatedly. If you want to\nremove all backups older than some time, you might get this information from listing the backups by\nspecifying `--from-timestamp` and then you can delete these backups one by one.\n\n----\n$ java -jar esop.jar remove-backup \\\n    --storage-location=file:///backup1/cluster/datacenter1/node1 \\\n    --backup-name=full-backup-name-from-listing-with-timestamp-etc\n----\n\nAll flags:\n\n|===\n|flag |explanation\n\n|--backup-name\n|name of manifest file to delete a backup for (minus .json)\n\n|--oldest\n|removes oldest backup there is, backup names does not need to be specified then\n\n|--dry\n|it will not delete files for real, good for evaluation to see what it would do before shooting\n\n|--resolve-nodes\n|consult list command, same logic\n|===\n\n## Global removal of backups\n\nFrom the previous section, you know how to delete an individual backup. However, it would be\nnice to be able to delete, for example, all backups older than 14 days, globally. \"Globally\" means\nthat it will scan whole local backup destination of all nodes (all dcs). You have the\noption to either do individual removal or global removal.\n\nFor global removal of backups older than 14 days:\n\n[source,bash]\n----\n$ esop remove-backup \\\n    --global-request \\\n    --storage-location=file:///submit/backup/Test-Cluster/dc1/ab3f1d62-1a61-4f84-a2e2-97a626940d8d \\\n    --older-than=14day\n----\n\nIt is enough to specify one node, all other nodes will be resolved automatically.\n\n`--older-than` accepts a format like \"number+unit\", for example \"1h\", \"1minute\".\n\n\nIf you want to run this in a daemon mode - meaning this operation would be run repeatedly, you need to\nexecute it like this:\n\n[source,bash]\n----\n$ esop remove-backup \\\n    --global-request \\\n    --storage-location=file:///submit/backup/Test-Cluster/dc1/ab3f1d62-1a61-4f84-a2e2-97a626940d8d \\\n    --older-than=5minute \\\n    --rate=1minute\n----\n\nThis means it will execute a backup removal every 1 minute and it will delete all backups older than 5 minutes.\nFor more real scenarios you might specify `--older-than=14day` and `--rate=1day`. The time for the next\nexecution will count down from the time this command was firstly executed.\n\nYou have also possibility to specify datacenters to remove by `--dcs` flag (might be specified multiple times\nfor each dc separately)\n\n## Client-side encryption with AWS KMS\n\nIn order to perform the encryption of your SSTables, so they are stored in a remote AWS S3 bucket already encrypted,\nwe leverage AWS KMS client-side encryption by https://github.com/aws/amazon-s3-encryption-client-java[this library].\n\nHistorically, Esop was using AWS API of version 1, however the library which makes client-side encryption possible\nis using API of version 2. The version 1 and version 2 API can live in one project simultaneously. As AWS KMS encryption\nfeature in Esop is rather new, we decided to code one additional S3 module which is using V2 API, and\nwe left V1 API implementation untouched if users still prefer it for whatever reason. We might eventually switch to\nV2 API completely and drop the code using V1 API in the future.\n\nA user also needs to supply KMS key id to encrypt data with. The creation of KMS key is out of scope of this document\nhowever keep in mind that such a key has to be symmetric.\n\nThe example of encrypted backup is shown below:\n\n----\njava -jar esop.jar backup \\\n    --storage-location=s3://instaclustr-oss-esop-bucket\n    --data-dir /my/installation/of/cassandra/data/data \\\n    --entities=ks1 \\\n    --snapshot-tag=snapshot-1 \\\n    --kmsKeyId=3bbebd10-7e5f-4fad-997a-89b51040df4c\n----\n\nNotice we also set `kmsKeyId` referencing name of KMS key in AWS to use for encryption.\n\nKMS key ID is also read from system property `AWS_KMS_KEY_ID` or environment property of the same name.\nKey ID from the command line has precedence over system property which has precedence over environment property.\n\nIf `--storage-location` is not fully specified, Esop will try to connect to a running node via JMX, and it resolves\nwhat cluster and datacenter it belongs to and what node ID it has.\n\nThe uploading logic of a particular SSTable file is as follows. First we need\nto refresh the object to update its last modification date, the logic which leads\nto it is this:\n\n* try to list tags of a remote object / key in a bucket\n** if such key is not found, we need to upload a file\n* if we are using encrypting backup (by having `--kmsKeyId` set), we prepare a tag\nwhich has `kmsKey` as a key and KMS key ID as a value\n* if tags of a remote key are not set or if they are not contain `kmsKey` tag,\nthat means that the remote object exists, but it is not encrypted. Hence, we\nwill need to upload it again, but encrypted this time\n* if we are not skipping the refresh, we will copy the file with `kmsKey` tag\n\nUpon the actual upload, we check if `kmsKeyId` is set from the command line (or system / env properties)\nand based on that we will use encrypting or non-encrypting S3 client. Encrypting S3\nclient wraps non-encrypting client. If encrypting client is used, everything\nwhich it uploads will be encrypted on the client and sent to AWS S3 bucket\nalready encrypted.\n\nBy the nature of Esop's directory layout and uploading logic, we see that\nif there was a backup which was not encrypted, we may decide later on that\nwe start to encrypt. Let's cover this logic in the following example:\n\nLet's have a backup consisting of 3 SSTables, S1, S2 and S3 respectively.\n\n----\nbucket:\n    S1\n    S2  - all tables are not encrypted\n    S3\n----\n\nLater, we inserted new data into SSTable S4 and S5, so we have S1 - S5 on disk. However, now we want to encrypt. We might end up having this in a bucket:\n\n----\nbucket:\n    S1\n    S2 - all tables are not encrypted\n    S3\n    S4 - encrypted\n    S5 - encrypted\n----\n\nIf we did it like this, we would end up having a backup partly encrypted which is not desired. For\nthis reason, if we see that there is an object in S3 bucket already, we need to read its _tags_\nto see what key it was encrypted with. If it was not encrypted (it is not tagged), we know\nthat we need to upload it again, now encrypted. Hence, eventually, all SSTables of a new backup will be encrypted.\n\nIf there is a backup which was not encrypted and some backup was, these two backups may have some\nSSTables common. Imagine this scenario:\n\n----\nbucket:\n    S1 not encrypted, backup 1\n    S2 not encrypted, backup 1\n    S3 not encrypted, backup 1\n----\n\nAs we started to encrypt and we want to backup, now, imagine that S1 and S2 were compacted into S4 and there were additional S5 and S6 encrypted:\n\n----\nbucket:\n    S1 not encrypted, backup 1, compacted into S4\n    S2 not encrypted, backup 1, compacted into S4\n    S3 not encrypted, backup 1\n    S4 encrypted, backup 2 - compacted S1 and S2\n    S5 encrypted, backup 2\n    S6 encrypted, backup 2\n----\n\nWe see that we are going to back up S3, S4 (compacted S1 and S2), S5 and S6. S3 is already uploaded,\nbut it is not encrypted, so S3 will be re-uploaded and encrypted. S4, S5 and S6 are not present remotely yet so all of them will be encrypted and uploaded.\n\nAfter doing so, we see this in the bucket:\n\n----\nbucket:\n    S1 not encrypted, backup 1, compacted into S4\n    S2 not encrypted, backup 1, compacted into S4\n    S3 encrypted, backup 1 and backup 2     // S3 is encrypted from now on\n    S4 encrypted, backup 2 - compacted S1 and S2\n    S5 encrypted, backup 2\n    S6 encrypted, backup 2\n----\n\nBackup no.1 consists of SSTables S1, S2 (both non-encrypted) and S3 (encrypted). Backup no.2 consists of S3 - S6 all of which are encrypted.\n\nNow, if we remove backup 1, only S1 and S2 SSTables will be removed because S3 is part of\nthe backup 2 as well. As we remove all non-encrypted backups, we will be left with backups which contain SSTables which are encrypted. Hence, we converted a bucket with non-encrypted backups to encrypted only.\n\nThis logic introduces these questions:\n\n* What if I have already encrypted backup, and I want to use a different KMS key?\n* How would restore look like when my backup contains SSTables which are both encrypted and in plaintext? How it would look like when I want to restore but there are different keys used?\n\nTo answer the first question is rather easy. If you want to use a different KMS key, that is the same\nsituation as if we were going to upload but no key was used. If we detect that already uploaded\nobject was encrypted with a different KMS key (by reading its tags) from a key we want to use now,\nwe just need to re-upload such SSTable and encrypt it with a different KMS key.\nAll other logic already explained is same.\n\nRestoration will read tags of a remote object to see what KMS key it was encrypted with. If remote\nobject was stored as plaintext, no wrapping S3 encryption client is used. If KMS key\nused is same as we supplied on the command line, the already initialized S3 encrypting client is used.\nIf a particular object was encrypted with a KMS key we do not have S3 encrypting client for yet,\nsuch client is dynamically created as part of the restoration process and it will be cached to be re-used\nfor the decryption of any other object using same KMS key.\nThe net result of this logic is that a backup may consist of SSTables encrypted with\nwhatever KMS key and as long as such KMS key exists in AWS KMS and we\ncan reference it, it will be decrypted just fine.\n\nWe *do not* encrypt Esop's manifest files. This is purely practical. If we were encrypting a manifest as well,\noperators would need to decrypt downloaded manifest from a bucket on their own by some other tool. As manifest\ndoes not contain any sensitive information and it serves solely as a metadata file to see what a particular backup\nconsists of, we chose to not encrypt it to make life for operators just easier. Manifest file is the only file\nwhich is not encrypted - all other files are.\n\nWe also decided to not store kmsKeyId in a manifest. It is better if a particular object is tagged with its key id\nit was encrypted with rather than store it in a manifest. If we used different kmsKeys, manifests would start to\nbe obsolete and restoration of such backup would not be possible as key was already changed. Tags will make\nrestoration in this scenario possible.\n\n## Logging\n\nWe are using logback. There is already `logback.xml` embedded in the built JAR. However, if you\nwant to configure it, feel free to provide your own `logback.xml` and configure it like this:\n\n----\njava -Dlogback.configurationFile=my-custom-logback.xml \\\n    -jar instaclustr-backup-restore.jar backup\n----\n\nYou can find the original file in `src/main/resources/logback.xml`.\n\n## Build and Test\n\nThere are end-to-end tests which can test all GCP, Azure, and S3 integrations.\n\nHere are the test groups/profiles:\n\n* azureTests\n* googleTest\n* s3Tests\n* cloudTest—runs tests which will be using cloud \"buckets\" for backup / restore\n\nThere is no need to create buckets in a cloud beforehand as they will be created and deleted\nas part of a test automatically, per cloud provider.\n\nCloud tests are executed like this:\n\n----\n$ mvn clean install -PcloudTests\n----\n\nBy default, `mvn install` is invoked with `noCloudTests` which will skip all tests dealing with\nstorage provides but `file://`.\n\nYou have to specify these system properties to run these tests successfully:\n\n----\n-Dawsaccesskeyid={your aws access key id}\n-Dawssecretaccesskey={your aws secret access key}\n-Dgoogle.application.credentials={path to google application credentials file on local disk}\n-Dazurestorageaccount={your azure storage account}\n-Dazurestoragekey={your azure storage key}\n----\n\nIn order to skip tests altogether, invoke the build like `mvn clean install -DskipTests`.\n\nUser can use a Maven wrapper script so all Maven will be downloaded automatically. The build\nin that case is run as `./mvnw clean install`.\n\nIf you want to build rpm or deb package, you need to enable `rpm` and/or `deb` Maven profile.\n\n## Further Information\n\n- Please see https://www.instaclustr.com/support/documentation/announcements/instaclustr-open-source-project-status/ for Instaclustr support status of this project\n- See Data Backup Documentation (https://www.instaclustr.com/support/documentation/cassandra/cassandra-cluster-operations/cluster-data-backups/)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstaclustr%2Fesop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finstaclustr%2Fesop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstaclustr%2Fesop/lists"}