https://github.com/clickhouse/copier

clickhouse-copier (obsolete)
https://github.com/clickhouse/copier

Last synced: about 1 year ago
JSON representation

clickhouse-copier (obsolete)

Host: GitHub
URL: https://github.com/clickhouse/copier
Owner: ClickHouse
License: apache-2.0
Created: 2024-03-11T00:11:10.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-17T10:32:11.000Z (over 2 years ago)
Last Synced: 2025-04-09T10:12:15.436Z (over 1 year ago)
Language: C++
Homepage:
Size: 1.42 MB
Stars: 11
Watchers: 4
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

> [!NOTE]
> This tool is no longer supported, but you can use the latest available version as is.

# clickhouse-copier

Copies data from the tables in one cluster to tables in another (or the same) cluster.

To get a consistent copy, the data in the source tables and partitions should not change during the entire process.

You can run multiple `clickhouse-copier` instances on different servers to perform the same job. ClickHouse Keeper, or ZooKeeper, is used for syncing the processes.

After starting, `clickhouse-copier`:

- Connects to ClickHouse Keeper and receives:

- Copying jobs.
- The state of the copying jobs.

- It performs the jobs.

Each running process chooses the “closest” shard of the source cluster and copies the data into the destination cluster, resharding the data if necessary.

`clickhouse-copier` tracks the changes in ClickHouse Keeper and applies them on the fly.

To reduce network traffic, we recommend running `clickhouse-copier` on the same server where the source data is located.

## Download and Install

Download the binaries from the [final release](releases/tag/final).

## Running Clickhouse-copier

The utility should be run manually:

``` bash
$ clickhouse-copier --daemon --config keeper.xml --task-path /task/path --base-dir /path/to/dir
```

Parameters:

- `daemon` — Starts `clickhouse-copier` in daemon mode.
- `config` — The path to the `keeper.xml` file with the parameters for the connection to ClickHouse Keeper.
- `task-path` — The path to the ClickHouse Keeper node. This node is used for syncing `clickhouse-copier` processes and storing tasks. Tasks are stored in `$task-path/description`.
- `task-file` — Optional path to file with task configuration for initial upload to ClickHouse Keeper.
- `task-upload-force` — Force upload `task-file` even if node already exists. Default is false.
- `base-dir` — The path to logs and auxiliary files. When it starts, `clickhouse-copier` creates `clickhouse-copier_YYYYMMHHSS_` subdirectories in `$base-dir`. If this parameter is omitted, the directories are created in the directory where `clickhouse-copier` was launched.

## Format of keeper.xml

``` xml

trace
100M
3

127.0.0.1
2181

```

## Configuration of Copying Tasks

``` xml

false

127.0.0.1
9000

...

...

3

1

source_cluster
test
hits

destination_cluster
test
hits2

ENGINE=ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/hits2', '{replica}')
PARTITION BY toMonday(date)
ORDER BY (CounterID, EventDate)

jumpConsistentHash(intHash64(UserID), 2)

CounterID != 0

'2018-02-26'
'2018-03-05'
...

...

...

```

`clickhouse-copier` tracks the changes in `/task/path/description` and applies them on the fly. For instance, if you change the value of `max_workers`, the number of processes running tasks will also change.

## Build from sources

You don't have to. Download the binaries from the [final release](releases/tag/final).

But if you want, use the following repository snapshot https://github.com/ClickHouse/ClickHouse/tree/1179a70c21eeca88410a012a73a49180cc5e5e2e and proceed with the normal ClickHouse build. The built `clickhouse` binary will contain the copier tool.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/clickhouse/copier

Awesome Lists containing this project

README