https://github.com/clickhouse/copier
clickhouse-copier (obsolete)
https://github.com/clickhouse/copier
Last synced: 11 months ago
JSON representation
clickhouse-copier (obsolete)
- Host: GitHub
- URL: https://github.com/clickhouse/copier
- Owner: ClickHouse
- License: apache-2.0
- Created: 2024-03-11T00:11:10.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-17T10:32:11.000Z (about 2 years ago)
- Last Synced: 2025-04-09T10:12:15.436Z (12 months ago)
- Language: C++
- Homepage:
- Size: 1.42 MB
- Stars: 11
- Watchers: 4
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
> [!NOTE]
> This tool is no longer supported, but you can use the latest available version as is.
# clickhouse-copier
Copies data from the tables in one cluster to tables in another (or the same) cluster.
To get a consistent copy, the data in the source tables and partitions should not change during the entire process.
You can run multiple `clickhouse-copier` instances on different servers to perform the same job. ClickHouse Keeper, or ZooKeeper, is used for syncing the processes.
After starting, `clickhouse-copier`:
- Connects to ClickHouse Keeper and receives:
- Copying jobs.
- The state of the copying jobs.
- It performs the jobs.
Each running process chooses the “closest” shard of the source cluster and copies the data into the destination cluster, resharding the data if necessary.
`clickhouse-copier` tracks the changes in ClickHouse Keeper and applies them on the fly.
To reduce network traffic, we recommend running `clickhouse-copier` on the same server where the source data is located.
## Download and Install
Download the binaries from the [final release](releases/tag/final).
## Running Clickhouse-copier
The utility should be run manually:
``` bash
$ clickhouse-copier --daemon --config keeper.xml --task-path /task/path --base-dir /path/to/dir
```
Parameters:
- `daemon` — Starts `clickhouse-copier` in daemon mode.
- `config` — The path to the `keeper.xml` file with the parameters for the connection to ClickHouse Keeper.
- `task-path` — The path to the ClickHouse Keeper node. This node is used for syncing `clickhouse-copier` processes and storing tasks. Tasks are stored in `$task-path/description`.
- `task-file` — Optional path to file with task configuration for initial upload to ClickHouse Keeper.
- `task-upload-force` — Force upload `task-file` even if node already exists. Default is false.
- `base-dir` — The path to logs and auxiliary files. When it starts, `clickhouse-copier` creates `clickhouse-copier_YYYYMMHHSS_` subdirectories in `$base-dir`. If this parameter is omitted, the directories are created in the directory where `clickhouse-copier` was launched.
## Format of keeper.xml
``` xml
trace
100M
3
127.0.0.1
2181
```
## Configuration of Copying Tasks
``` xml
false
127.0.0.1
9000
...
...
2
1
0
3
1
source_cluster
test
hits
destination_cluster
test
hits2
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/hits2', '{replica}')
PARTITION BY toMonday(date)
ORDER BY (CounterID, EventDate)
jumpConsistentHash(intHash64(UserID), 2)
CounterID != 0
'2018-02-26'
'2018-03-05'
...
...
...
```
`clickhouse-copier` tracks the changes in `/task/path/description` and applies them on the fly. For instance, if you change the value of `max_workers`, the number of processes running tasks will also change.
## Build from sources
You don't have to. Download the binaries from the [final release](releases/tag/final).
But if you want, use the following repository snapshot https://github.com/ClickHouse/ClickHouse/tree/1179a70c21eeca88410a012a73a49180cc5e5e2e and proceed with the normal ClickHouse build. The built `clickhouse` binary will contain the copier tool.