{"id":20572008,"url":"https://github.com/netease/lakehouse-benchmark-ingestion","last_synced_at":"2026-03-15T11:15:37.887Z","repository":{"id":63274037,"uuid":"561644537","full_name":"NetEase/lakehouse-benchmark-ingestion","owner":"NetEase","description":"A ingestion tool for Lakehouse benchmark","archived":false,"fork":false,"pushed_at":"2022-12-06T02:11:32.000Z","size":510,"stargazers_count":4,"open_issues_count":2,"forks_count":3,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-04-14T17:07:45.405Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NetEase.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-04T06:41:34.000Z","updated_at":"2023-08-15T09:55:06.000Z","dependencies_parsed_at":"2023-01-23T05:45:36.351Z","dependency_job_id":null,"html_url":"https://github.com/NetEase/lakehouse-benchmark-ingestion","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase%2Flakehouse-benchmark-ingestion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase%2Flakehouse-benchmark-ingestion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase%2Flakehouse-benchmark-ingestion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase%2Flakehouse-benchmark-ingestion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NetEase","download_url":"https://codeload.github.com/NetEase/lakehouse-benchmark-ingestion/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248923765,"owners_count":21183953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T05:18:01.755Z","updated_at":"2026-03-15T11:15:32.854Z","avatar_url":"https://github.com/NetEase.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 总览\n欢迎使用lakehouse-benchmark-ingestion。lakehouse-benchmark-ingestion 是网易开源的数据湖性能基准测试 [lakehouse-benchmark](https://github.com/NetEase/lakehouse-benchmark) 项目下的数据同步工具，该工具基于 Flink-CDC 实现，能够将数据库中的数据实时同步到数据湖。\n\n## 快速开始\n1. 通过命令 `wget https://github.com/NetEase/lakehouse-benchmark-ingestion/releases/download/beta-1.0-arctic-0.4/lakehouse_benchmark_ingestion.tar.gz` 下载项目相关的 release 包，并通过`tar -zxvf lakehouse_benchmark_ingestion.tar.gz`命令解压得到 lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar 和 conf 目录 \n2. 修改 conf 目录下的 ingestion-conf.yaml ，填写配置项信息 \n3. 通过`java -cp lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar com.netease.arctic.benchmark.ingestion.MainRunner -confDir [confDir] -sinkType [arctic/iceberg/hudi] -sinkDatabase [dbName]`命令启动数据同步工具 \n4. 通过`localhost:8081`打开 Flink Web UI ，观察数据同步的情况\n\n除了[快速开始](#快速开始)部分通过 release 包启动 lakehouse-benchmark-ingestion 工具，您也可以选择从源码进行构建或者在 Docker 模式下进行部署的方式。\n## 从源码构建\n1. 下载项目代码 `git clone https://github.com/NetEase/lakehouse-benchmark-ingestion.git`\n2. 参考[源码构建相关说明](#源码构建相关说明)部分，构建项目所需的依赖\n3. 通过命令`mvn clean install -DskipTests`编译项目。进入 target 目录，通过`tar -zxvf lakehouse_benchmark_ingestion.tar.gz`命令解压得到 lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar 和 conf 目录\n4. 修改 conf 目录下的 ingestion-conf.yaml ，填写配置项信息\n5. 通过`java -cp lakehouse-benchmark-ingestion-1.0-SNAPSHOT.jar com.netease.arctic.benchmark.ingestion.MainRunner -confDir [confDir] -sinkType [arctic/iceberg/hudi] -sinkDatabase [dbName]`命令启动数据同步工具\n6. 通过`localhost:8081`打开 Flink Web UI ，观察数据同步的情况\n\n## Docker模式下部署\n请参考 [lakehouse-benchmark](https://github.com/NetEase/lakehouse-benchmark) 项目的[ Docker 部署文档](https://github.com/NetEase/lakehouse-benchmark/tree/master/docker/benchmark)\n\n## 支持的参数\n### 命令行参数\n\n| 参数项          | 是否必须 | 默认值    | 描述                                                                  |\n|--------------|------|--------|---------------------------------------------------------------------|\n| confDir      | 是    | (none) | 配置文件 ingestion-conf.yaml 所在目录的绝对路径。基于快速开始的步骤，confDir为解压后conf目录所在的路径 |\n| sinkType     | 是    | (none) | 目标端数据湖 Format 的类型，支持 Arctic/Iceberg/Hudi                            |\n| sinkDatabase | 是    | (none) | 目标端数据库的名称                                                           |\n| restPort     | 否    | 8081   | Flink Web UI的端口                                                     |\n\n### 配置文件参数\n以下参数均可以通过 conf/ingestion-conf.yaml 文件进行配置。\n\n| 参数项                      | 是否必须 | 默认值     | 描述                                                            |\n|--------------------------|------|---------|---------------------------------------------------------------|\n| source.type              | 是    | (none)  | 源端数据库的类型，目前仅支持 MySQL                                          |\n| source.username          | 是    | (none)  | 源端数据库用户名                                                      |\n| source.password          | 是    | (none)  | 源端数据库密码                                                       |\n| source.hostname          | 是    | (none)  | 源端数据库地址                                                       |\n| source.port              | 是    | (none)  | 源端数据库端口                                                       |\n| source.table.name        | 否    | *       | 指定需要同步的表名称，支持指定多张表，默认情况下同步整个数据库                               |\n| source.scan.startup.mode | 否    | initial | MySQL CDC connector 消费 binlog 时的启动模式，支持 initial/latest-offset |\n| source.server.timezone   | 否    | (none)  | MySQL数据库服务器的会话时区                                              |\n| source.parallelism       | 否    | 4       | 读取源端数据时的任务并行度                                                 |      |         |                                                       |\n| hadoop.user.name         | 否    | (none)  | 用于设置 HADOOP_USER_NAME 的值                                      |\n\n**Arctic相关**\n\n| 参数项                         | 是否必须 | 默认值     | 描述                                             |\n|-----------------------------|------|---------|------------------------------------------------|\n| arctic.metastore.url        | 是    | (none)  | Arctic metastore 的 URL 地址                      |\n| arctic.optimize.enable      | 是    | true    | 是否开启Arctic Optimize                            |\n| arctic.optimize.group.name  | 否    | default | Arctic Optimizer 资源组                           |\n| arctic.optimize.table.quota | 否    | (none)  | 指定 Arctic 表占用 Optimizer 资源的配额，支持指定多张表，以Map形式传入 |\n| arctic.write.upsert.enable  | 否    | false   | 是否开启upsert功能                                   |\n| arctic.sink.parallelism     | 否    | 4       | Arctic Writer的并行度                              |\n \n**Iceberg相关**\n\n| 参数项                         | 是否必须 | 默认值    | 描述                                 |\n|-----------------------------|------|--------|------------------------------------|\n| iceberg.uri                 | 是    | (none) | Hive metastore 的thrift URI         |\n| iceberg.warehouse           | 是    | (none) | Hive warehouse 的地址                 |\n| iceberg.catalog-type        | 否    | hive   | Iceberg catalog 的类型，支持 hive/hadoop |\n| iceberg.write.upsert.enable | 否    | false  | 是否开启upsert功能                       |\n| iceberg.sink.parallelism    | 否    | 4      | Iceberg Writer的并行度                 |\n\n**Hudi相关**\n\n| 参数项                              | 是否必须 | 默认值           | 描述                                       |\n|----------------------------------|------|---------------|------------------------------------------|\n| hudi.catalog.path                | 是    | (none)        | Hudi Catalog 的地址                         |\n| hudi.hive_sync.enable            | 否    | true          | 是否开启 hive 同步功能                           |\n| hudi.hive_sync.metastore.uris    | 否    | (none)        | Hive Metastore URL，当开启 hive 同步功能时需要填写该参数 |\n| hudi.table.type                  | 否    | MERGE_ON_READ | 表操作的类型，支持 MERGE_ON_READ/COPY_ON_WRITE    |\n| hudi.read.tasks                  | 否    | 4             | 读算子的并行度                                  |\n| hudi.compaction.tasks            | 否    | 4             | 在线 compaction 的并行度                       |\n| hudi.write.tasks                 | 否    | 4             | 写算子的并行度                                  |\n\n\n## 已支持的数据库与数据湖Format\n### 源端数据库\n1. [MySQL](https://www.mysql.com/)\n### 目标端数据湖Format\n1. [Arctic](https://arctic.netease.com/ch/)\n2. [Iceberg](https://iceberg.apache.org/)\n3. [Hudi](https://hudi.apache.org/cn/)\n\n\n## 源码构建相关说明\n* 本项目使用的 arctic-flink-runtime-1.14 依赖需要基于Arctic工程进行源码编译，请通过命令`git clone https://github.com/NetEase/arctic.git -b 0.4.x`下载[ Arctic 工程](https://github.com/NetEase/arctic)的代码并切换到 0.4.x 分支，执行命令`mvn clean install -DskipTests -pl '!trino'`进行构建\n* 本项目使用的 hudi-flink1.14-bundle_2.12 依赖需要基于Hudi工程进行源码编译，请通过命令`git clone https://github.com/apache/hudi.git -b release-0.11.1`下载[ Hudi 工程](https://github.com/apache/hudi)的代码并切换到 release-0.11.1 ，执行命令`mvn clean install -DskipTests -Dflink1.14 -Dscala-2.12`进行构建\n\n## 项目依赖的版本信息\n\n| Maven依赖项  | 版本信息   |\n|-----------|--------|\n| Flink     | 1.14.6 |\n| Flink-CDC | 2.3.0  |\n| Hadoop    | 2.9.2  |\n| Hive      | 2.1.1  |\n| Arctic    | 0.4.x  |\n| Iceberg   | 0.14.0 |\n| Hudi      | 0.11.1 |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease%2Flakehouse-benchmark-ingestion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetease%2Flakehouse-benchmark-ingestion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease%2Flakehouse-benchmark-ingestion/lists"}