{"id":26898589,"url":"https://github.com/tianlangstudio/dataxserver","last_synced_at":"2025-05-13T01:49:49.714Z","repository":{"id":19316527,"uuid":"86768065","full_name":"TianLangStudio/DataXServer","owner":"TianLangStudio","description":"为DataX(https://github.com/alibaba/DataX)  提供远程多语言调用（ThriftServer，HttpServer） 分布式运行（DataX on YARN） 功能","archived":false,"fork":false,"pushed_at":"2023-04-18T08:50:54.000Z","size":1136,"stargazers_count":144,"open_issues_count":11,"forks_count":72,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-05-13T01:49:45.307Z","etag":null,"topics":["data-hamal","datax","datax-server","dataxonyarn","dataxserver","etl","hamal","http","http-server","server","thrift","thrift-server","yarn"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TianLangStudio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-31T02:08:56.000Z","updated_at":"2024-06-15T14:11:46.000Z","dependencies_parsed_at":"2023-01-11T20:25:16.096Z","dependency_job_id":null,"html_url":"https://github.com/TianLangStudio/DataXServer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TianLangStudio%2FDataXServer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TianLangStudio%2FDataXServer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TianLangStudio%2FDataXServer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TianLangStudio%2FDataXServer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TianLangStudio","download_url":"https://codeload.github.com/TianLangStudio/DataXServer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253856616,"owners_count":21974576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-hamal","datax","datax-server","dataxonyarn","dataxserver","etl","hamal","http","http-server","server","thrift","thrift-server","yarn"],"created_at":"2025-04-01T05:48:08.636Z","updated_at":"2025-05-13T01:49:49.696Z","avatar_url":"https://github.com/TianLangStudio.png","language":"Scala","readme":"*DataX Server*\n================  \n\n为 [DataX](https://github.com/alibaba/DataX) 提供远程调用（Thrift Server， Http Server）分布式运行（DataX On YARN）功能\n   \n**Feature**\n---------------\n- 1. Thrift Server \n- 2. DataX on Yarn\n- 3. Http Server \n- 4. 单机多线程方式运行\n- 5. 单机多进程方式运行\n- 6. 分布式运行(On Yarn)\n- 7. 混合模式运行（Yarn+多进程模式运行）\n- 8. 自动伸缩\n## TODO\n- ~~1.Http Server~~   \n- ~~2.代码重构~~    \n- ~~3.按照功能类型拆分到多个子项目中　重新组织包名　方便后续新增功能~~\n- 4.完善文档示例\n\n## Deploy\n   下载发布包[DataXServer-0.0.1.tar.gz](http://pan.baidu.com/s/1hrHcbqs) 并解压 进入 0.0.1 目录     \n   \n   启动Thrift Server\n   ```shell\n   ./bin/startThriftServer.sh     \n   ```\n   使用NodeJS提交测试任务到Thrift Server  \n   ```shell\n   cd example/nodejs    \n   node submitStream2Stream.js \n   ```\n     \n   \n   \n   \n**Develop**\n---------------  \n  ### 下载程序源码\n  __项目依赖阿里 DataX__\n  ```bash\n  git clone https://github.com/alibaba/DataX.git \n  cd DataX    \n  mvn install\n  \n  git clone https://github.com/TianLangStudio/DataXServer.git  \n  cd DataXServer  \n  mvn clean compile install -DskipTests\n  ```\n  ### 单机多线程模式运行http server (已部署好datax 且能正常运行job/test_job.json)\n  - 配置DataX安装目录\n  \u003e 修改pom.xml中的datax-home配置项为部署datax的地址\n  ```xml\n   \u003cdatax-home\u003e/data/test/datax\u003c/datax-home\u003e\n  ```\n  - 启动http server\n  ```bash\n   cd httpserver\n   mvn scala:run -Dlauncher=httpserver -DskipTests\n  ```\n  - 提交任务 获取任务ID\n  ```bash\n  curl -XPOST -d \"@测试文件路径\" 127.0.0.1:9808/dataxserver/task\n```\n  \u003e tianlang@tianlang:job$ curl  -XPOST -d \"@job/test_job.json\" 127.0.0.1:9808/dataxserver/task  \n  \u003e 0 （任务ID）\n  - 获取任务执行状态结果耗时\n  ```bash\n  curl  127.0.0.1:9808/dataxserver/task/status/0\n  curl  127.0.0.1:9808/dataxserver/task/0\n  curl  127.0.0.1:9808/dataxserver/task/cost/0\n```\n![运行成功日志](https://raw.githubusercontent.com/TianLangStudio/DataXServer/master/images/test_job_success.png) \n### 单机多进程模式运行\n- 配置DataX安装目录       \n        同多线程模式\n- 启动server\n ```bash\n   cd hamal-yarn\n   mvn scala:run -Dlauncher=httpserver-mp -DskipTests\n  ```\n- 提交运行任务同多线程模式  \n\n### 多机多进程模式运行(On Yarn)\n- 配置DataX 安装目录\n修改hamal-yarn/src/main/resources/master.conf　里的datax.home配置项的值为\nDataX安装目录  \n- 打包\n```bash\ncd hamal-yarn\nmvn clean package -DskipTests\n\n```\n\n- 上传jar包到hdfs\n将hamal-yarn/target/hamal-yarn-*-with-dependencies.jar上传到hdfs /app/hamal/master.jar \n将hamal-yarn/target/hamal-yarn-*-package.zip上传到hdfs /app/hamal/executor.zip\n```bash\nhdfs dfs -put hamal-yarn-*-with-dependencies.jar /app/hamal/master.jar\nhdfs dfs -put hamal-yarn-*-package.zip /app/hamal/executor.zip\n\n```\n\n- 运行Master\n```bash\nyarn jar hamal-yarn-*_with-dependencies.jar  org.tianlangstudio.data.hamal.yarn.Client /app/hamal/master.jar\n```\n可以通过yarn　ui看到运行的Master\n\n- 提交运行任务同多线程模式\n\n提交任务后可看到，　container数量增加， master运行日志中可看到当前executor数量\n,在master.conf文件中可以配置最大executor数量，可以将local.num.max设置为不为０的值即代表可以在本机启动executor.\nexecutor空闲一段时间后自动销毁。\n\n![On Yarn](https://raw.githubusercontent.com/TianLangStudio/DataXServer/master/images/onyarn.png) \n![Hamal Master On Yarn Log](https://raw.githubusercontent.com/TianLangStudio/DataXServer/master/images/yarn-log.png) \n\n***如用在生产环境建议修改ID生成策略，提交任务存储方式等***　　\n\n## QA\n- 编译失败\n\u003e 检查是否是依赖包下载失败，可以将依赖包安装到本机  \n\u003e 可以尝试注释掉pom文件中`recompileMode`配置  \n- 是否集群中每台机器都要安装datax  \n\u003e 不需要每台机器都安装datax,可以把datax打包到excutor的部署zip包中，放到hdfs上  \n- Excutor和Master是通过http还是thrift通信？  \n\u003e Excutor和Master的通信是基于akka实现的  \n- Excutor的个数会随着任务个数增减？  \n\u003e 是的，但不会大于配置的最大Excutor个数\n           \n## Document\nTODO\n## 问题交流可加群\nQQ群：579896894\n----------------\n![KeepLearning QQ](https://raw.githubusercontent.com/TianLangStudio/DataXServer/master/images/tianlangstudio-keeplearning-qrcode.jpg)  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftianlangstudio%2Fdataxserver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftianlangstudio%2Fdataxserver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftianlangstudio%2Fdataxserver/lists"}