{"id":18524962,"url":"https://github.com/paddlepaddle/elasticctr","last_synced_at":"2025-08-21T15:31:18.704Z","repository":{"id":37559431,"uuid":"225549581","full_name":"PaddlePaddle/ElasticCTR","owner":"PaddlePaddle","description":"ElasticCTR，即飞桨弹性计算推荐系统，是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务，帮助用户在Kubernetes环境中一键完成推荐系统部署，具备高性能、工业级部署、端到端体验的特点，并且作为开源套件，满足二次深度开发的需求。","archived":false,"fork":false,"pushed_at":"2020-07-11T05:12:47.000Z","size":1422,"stargazers_count":180,"open_issues_count":3,"forks_count":45,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-11-14T02:11:55.362Z","etag":null,"topics":["ctr","hdfs","k8s","personalization","ranking","recommender-system"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PaddlePaddle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-03T06:42:00.000Z","updated_at":"2024-09-05T08:28:01.000Z","dependencies_parsed_at":"2022-08-02T15:30:59.014Z","dependency_job_id":null,"html_url":"https://github.com/PaddlePaddle/ElasticCTR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FElasticCTR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FElasticCTR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FElasticCTR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaddlePaddle%2FElasticCTR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PaddlePaddle","download_url":"https://codeload.github.com/PaddlePaddle/ElasticCTR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230520391,"owners_count":18238948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ctr","hdfs","k8s","personalization","ranking","recommender-system"],"created_at":"2024-11-06T17:44:01.112Z","updated_at":"2024-12-20T01:14:20.978Z","avatar_url":"https://github.com/PaddlePaddle.png","language":"Python","readme":"# ElasticCTR\n\nElasticCTR是分布式训练CTR预估任务和Serving流程一键部署的方案，用户只需配置数据源、样本格式即可完成一系列的训练与预测任务\n\n* [1. 总体概览](#head1)\n* [2. 配置集群](#head2)\n* [3. 一键部署教程](#head3)\n* [4. 训练进度追踪](#head4)\n* [5. 预测服务](#head5)\n\n## \u003cspan id='head1'\u003e1. 总体概览\u003c/span\u003e\n\n本项目提供了端到端的CTR训练和二次开发的解决方案，主要特点如下：\n\n1.快速部署\n\nElasticCTR当前提供的方案是基于百度云的Kubernetes集群进行部署，用户可以很容易扩展到其他原生的Kubernetes环境运行ElasticCTR。\n  \n2.高性能\n\nElasticCTR采用PaddlePaddle提供的全异步分布式训练方式，在保证模型训练效果的前提下，近乎线性的扩展能力可以大幅度节省训练资源。在线服务方面，ElasticCTR采用Paddle Serving中高吞吐、低延迟的稀疏参数预估引擎，高并发条件下是常见开源组件吞吐量的10倍以上。\n\n3.可定制\n\n用户可以通过统一的配置文件，修改训练中的训练方式和基本配置，包括在离线训练方式、训练过程可视化指标、HDFS上的存储配置等。除了通过修改统一配置文件进行训练任务配置外，ElasticCTR采用全开源软件栈，方便用户进行快速的二次开发和改造。底层的Kubernetes、Volcano可以轻松实现对上层任务的灵活调度策略；基于PaddlePaddle的灵活组网能力、飞桨的分布式训练引擎Fleet和远程预估服务Paddle Serving，用户可以对训练模型、并行训练的模式、远程预估服务进行快速迭代；MLFlow提供的训练任务可视化能力，用户可以快速增加系统监控需要的各种指标。\n\n\n\n本方案整体结构请参照这篇文章 [ElasticCTR架构](elasticctr_arch.md)\n\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n\u003cimg src='doc/ElasticCTR.png' width = \"800\" height = \"300\"\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n## \u003cspan id='head2'\u003e2. 配置集群\u003c/span\u003e\n\n运行本方案前，需要用户已经搭建好k8s集群，并安装好volcano组件。k8s环境部署比较复杂，本文不涉及。百度智能云CCE容器引擎申请后即可使用，百度云上创建k8s的方法用户可以参考这篇文档[百度云创建k8s教程及使用指南](cluster_config.md)。此外，Elastic CTR还支持在其他云上部署，可以参考以下两篇文档[华为云创建k8s集群](huawei_k8s.md)，[aws创建k8s集群](aws_k8s.md).\n\n准备好K8S集群之后，我们需要配置HDFS作为数据集的来源[HDFS配置教程](HDFS_TUTORIAL.md)\n\n\n## \u003cspan id='head3'\u003e3. 一键部署教程\u003c/span\u003e\n\n您可以使用我们提供的脚本elastic-control.sh来完成部署，在运行脚本前，请确保您的机器装有python3并通过pip安装了mlflow，安装mlflow的命令如下：\n```bash\npython3 -m pip install mlflow -i https://pypi.tuna.tsinghua.edu.cn/simple\n```\n脚本的使用方式如下：\n```bash\nbash elastic-control.sh [COMMAND] [OPTIONS]\n```\n其中可选的命令(COMMAND)如下：\n- **-c|--config_client**    检索客户端二进制文件用于发送预测服务请求并接收预测结果\n- **-r|--config_resource**  定义训练配置\n- **-a|--apply**            应用配置并启动训练\n- **-l|--log**              打印训练状态，请确保您已经启动了训练\n\n在定义训练配置时，您需要添加附加选项(OPTIONS)来指定配置的资源，可选的配置如下：\n- **-u|--cpu**              每个训练节点的CPU核心数\n- **-m|--mem**              每个节点的内存容量\n- **-t|--trainer**          trainer节点的数量\n- **-p|--pserver**          parameter-server节点的数量\n- **-b|--cube**             cube分片数\n- **-hd|--hdfs_address**    存储数据文件的HDFS地址\n\n注意：您的数据文件的格式应为以下示例格式：\n```\n$show $click $feasign0:$slot0 $feasign1:$slot1 $feasign2:$slot2......\n```\n举例如下：\n```\n1 0 17241709254077376921:0 132683728328325035:1 9179429492816205016:2 12045056225382541705:3\n```\n    \n- **-f|--datafile**         数据路径文件，需要指明HDFS地址并指定起始与截止日期（截止日期可选）\n- **-s|--slot_conf**        特征槽位配置文件，请注意文件后缀必须为'.txt'\n\n以下是`data.config`文件，其中`START_DATE_HR`和`END_DATE_HR`就是我们在上一步配置HDFS的路径。\n```\nexport HDFS_ADDRESS=\"hdfs://${IP}:9000\" # HDFS地址\nexport HDFS_UGI=\"root,i\" # HDFS用户名密码\nexport START_DATE_HR=20200401/00 # 训练集开始时间，代表2020年4月1日0点\nexport END_DATE_HR=20200401/03 # 训练集结束时间，代表2020年4月1日3点\nexport DATASET_PATH=\"/train_data\" # 训练集在HDFS上的前缀\nexport SPARSE_DIM=\"1000001\" # 稀疏参数维度，可不动\n```\n\n脚本的使用示例如下：\n```\nbash elastic-control.sh -r -u 4 -m 20 -t 2 -p 2 -b 5 -s slot.conf -f data.config\nbash elastic-control.sh -a\nbash elastic-control.sh -l\nbash elastic-control.sh -c\n```\n\n## \u003cspan id='head4'\u003e4. 训练进度追踪\u003c/span\u003e\n我们提供了两种方法让用户可以观察训练的进度，具体方式如下：\n\n1.命令行查看\n\n在训练过程中，用户可以随时输入以下命令，将Trainer0和file server的状态日志打印到标准输出上以便查看\n```bash\nbash elastic-control.sh -l\n```\n\n## \u003cspan id='head5'\u003e5. 预测服务\u003c/span\u003e\n用户可以输入以下指令查看file server日志：\n```bash\nbash elastic-control.sh -l\n```\n当发现有模型产出后，可以进行预测，预测的方法是输入以下命令\n```bash\nbash elastic-control.sh -c\n```\n并按照屏幕上打出的提示继续执行即可进行预测，结果会打印在标准输出\n![infer_help.png](https://github.com/suoych/WebChat/raw/master/infer_help.png)\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaddlepaddle%2Felasticctr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpaddlepaddle%2Felasticctr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaddlepaddle%2Felasticctr/lists"}