{"id":13488769,"url":"https://github.com/Qihoo360/XLearning-XDML","last_synced_at":"2025-03-28T01:37:47.526Z","repository":{"id":33855937,"uuid":"161740173","full_name":"Qihoo360/XLearning-XDML","owner":"Qihoo360","description":"extremely distributed machine learning","archived":false,"fork":false,"pushed_at":"2022-12-27T14:52:33.000Z","size":262,"stargazers_count":122,"open_issues_count":6,"forks_count":37,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-08-01T18:40:07.550Z","etag":null,"topics":["ai","distributed","hadoop","hazelcast","kudu","machine-learning","parameter-server","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Qihoo360.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-12-14T06:15:16.000Z","updated_at":"2024-05-31T08:44:03.000Z","dependencies_parsed_at":"2023-01-15T02:59:04.364Z","dependency_job_id":null,"html_url":"https://github.com/Qihoo360/XLearning-XDML","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qihoo360%2FXLearning-XDML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qihoo360%2FXLearning-XDML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qihoo360%2FXLearning-XDML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qihoo360%2FXLearning-XDML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Qihoo360","download_url":"https://codeload.github.com/Qihoo360/XLearning-XDML/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222333976,"owners_count":16968058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","distributed","hadoop","hazelcast","kudu","machine-learning","parameter-server","spark"],"created_at":"2024-07-31T18:01:21.472Z","updated_at":"2025-03-28T01:37:47.507Z","avatar_url":"https://github.com/Qihoo360.png","language":"Scala","funding_links":[],"categories":["Scala","人工智能"],"sub_categories":["机器学习"],"readme":"\u003cbr\u003e\n\u003cdiv\u003e\n  \u003ca href=\"https://github.com/Qihoo360/XLearning-XDML\"\u003e\n    \u003cimg width=\"400\" heigth=\"400\" src=\"./doc/img/logo.jpg\"\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n  \n[![license](https://img.shields.io/badge/license-Apache2.0-blue.svg?style=flat)](./LICENSE)\n[![Release Version](https://img.shields.io/badge/release-1.0-red.svg)]()\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)]()\n\n\n**XDML**是一款基于参数服务器（Parameter Server），采用专门缓存机制的分布式机器学习平台。\nXDML内化了学界最新研究成果，在效果保持稳定的同时，能大幅加速收敛进程，显著提升模型与算法的性能。同时，XDML还对接了一些优秀的开源成果和360公司自研成果，站在巨人的肩膀上，博采众长。 XDML还兼容hadoop生态，提供更好的大数据框架使用体验，将开发者从繁杂的工作中解脱出来。XDML已经在360内部海量规模数据上进行了大量测试和调优，在大规模数据量和超高维特征的机器学习任务上，具有良好的稳定性，扩展性和兼容性。 \n\n欢迎对机器学习或分布式有兴趣的同仁一起贡献代码，提交Issues或者Pull Requests。\n\n## 架构设计\n![architecture](./doc/img/xdml.png)\n\n针对超大规模机器学习的场景，奇虎360开源了内部的超大规模机器学习计算框架XDML。XDML是一款基于参数服务器（Parameter Server），采用专门缓存机制的分布式机器学习平台。它在360内部海量规模数据上进行了测试和调优，在大规模数据量和超高维特征的机器学习任务上，具有良好的稳定性，扩展性和兼容性。\n\n## 功能特性\n#### 1.提供特征预处理/分析，离线训练，模型管理等功能模块\n#### 2.实现常用的大规模数据量场景下的机器学习算法\n#### 3.充分利用现有的成熟技术，保证整个框架的高效稳定\n#### 4.完全兼容hadoop生态，和现有的大数据工具实现无缝对接，提升处理海量数据的能力\n#### 5.在系统架构和算法层面实现深度的工程优化，在不损失精度的前提下，大幅提高性能\n\n\n## 代码结构\n\n### 1.ps  \nXDML的核心参数服务器架构，包括以下组件： \n \n - [PS](./doc/PS.md)\n - [PSClient](./doc/PSClient.md)\n\n### 2.conf\nXDML的配置包，包括对参数服务器的配置和对作业及模型相关的配置。包括以下组件：\n\n - [JobConfiguration](./doc/JobConfigure.md)\n - [PSConfiguration](./doc/PSConfiguration.md)\n - ...\n\n### 3.task\nXDML向PS提交的作业，包括拉取和推送。包括以下任务：\n\n - Task\n - PullTask\n - PushTask\n\n### 4.optimization\nXDML模型的优化算法包。包括以下优化算法：\n\n - [BinaryClassify](./doc/BinaryClassify.md)\n - [FFM](./doc/FFMProcessor.md)\n - ...\n\n### 5.ml\nXDML中已经实现的部分机器学习模型。包括以下模型：\n\n - [LogisticRegression](./doc/LogisticRegression.md)\n - [LogisticRegressionWithDCASGD](./doc/LogisticRegressionWithDCASGD.md)\n - [LogisticRegressionWithFTRL](./doc/LogisticRegressionWithFTRL.md)\n - [LogisticRegressionWithMomentum](./doc/LogisticRegressionWithMomentum.md)\n - [FieldwareFactorizationMachine](./doc/FieldawareFactorizationMachine.md)\n - ...\n\n### 6.feature\nXDML中特征分析和特征处理模块。\n\n- [特征分析](./doc/FeatureAnalysis.md)\n\n  \t特征分析覆盖常见的分析指标，如数值型特征的偏度、峰度、分位数，与label相关的auc、ndcg、互信息、相关系数等指标。\n\n- [特征处理](./doc/FeatureProcess.md)\n\t\n\t特征处理覆盖常见的数值型、类别型特征预处理方法。包括以下算子：\n\t- CategoryEncoder\n\t- MultiCategoryEncoder\n\t- NumericBuckter\n\t- NumericStandardizer\n\n### 7.model\nXDML中包含用南京大学李武军老师提出的[Scope](https://arxiv.org/pdf/1602.00133.pdf)优化算法进行训练的线性模型，以及部分[H2O](https://www.h2o.ai/)模型的spark pipeline封装。具体包括以下模型：\n\n[Model：](./doc/Model.md)\n\n - LinearScope\n - MultiLinearScope\n - OVRLinearScope\n - H2ODRF\n - H2OGBM\n - H2OGLM\n - H2OMLP\n\n### 8.example\nXDML中作业提交实例，可以参考[Example](./doc/Example.md).\n\n## 编译\u0026部署指南\n\nXDML是基于Kudu、HazelCast以及Hadoop生态圈的一款基于参数服务器的，采用专门缓存机制的分布式机器学习平台。\n\n### 环境依赖\n- centos \u003e= 7\n- Jdk \u003e= 1.8\n- Maven \u003e= 3.5.4\n- scala \u003e= 2.11\n- hadoop \u003e= 2.7.3\n- spark \u003e= 2.3.0\n- sparkling-water-core \u003e= 2.3.0\n- kudu \u003e= 1.9\n- HazelCast \u003e= 3.9.3\n\n### Kudu安装部署\nXDML基于Kudu，请首先部署Kudu。Kudu的安装部署请参考[Kudu](https://github.com/apache/kudu/tree/1.7.0)。\n\n### 源码下载 \n```git clone https://github.com/Qihoo360/XLearning-XDML```\n\n### 编译\n```mvn clean package -Dmaven.test.skip=true```\n编译完成后，在源码根目录的`target`目录下会生成：`xdml-1.0.jar`、`xdml-1.0-jar-with-dependencies.jar`等多个文件，`xdml-1.0.jar`为未加spark、kudu等第三方依赖，`xdml-1.0-jar-with-dependencies.jar`添加了spark、kudu等依赖包。\n\n## 运行示例\n\n### 提交参数 \n* **算法参数**  \n   * spark.xdml.learningRate：学习率  \n* **训练参数**  \n   * spark.xdml.job.type：作业类型  \n   * spark.xdml.train.data.path：训练数据路径  \n   * spark.xdml.train.data.partitionNum：训练数据分区    \n   * spark.xdml.model.path：模型存储路径  \n   * spark.xdml.train.iter：训练迭代次数  \n   * spark.xdml.train.batchsize：训练数据batch大小  \n* **PS相关参数**  \n   * spark.xdml.hz.clusterNum：hazelcast集群机器数目  \n   * spark.xdml.table.name：kudu表名称  \n\n### 提交命令    \n可以通过以下命令提交示例训练作业：  \n\n```    \n  $SPARK_HOME/bin/spark-submit \\   \n    --master yarn-cluster \\    \n    --class net.qihoo.xitong.xdml.example.LRTest \\   \n    --num-executors 50 \\   \n    --executor-memory 40g \\   \n    --executor-cores 2 \\   \n    --driver-memory 4g \\   \n    --conf \"spark.xdml.table.name=lrtest\" \\   \n    --conf \"spark.xdml.job.type=train\" \\   \n    --conf \"spark.xdml.train.data.path=$trainpath\" \\   \n    --conf \"spark.xdml.train.data.partitionNum=50\" \\   \n    --conf \"spark.xdml.hz.clusterNum=50\" \\   \n    --conf \"spark.xdml.model.path=$modelpath\" \\   \n    --conf \"spark.xdml.train.iter=5\" \\   \n    --conf \"spark.xdml.train.batchsize=10000\" \\   \n    --conf \"spark.xdml.learningRate=0.1\" \\   \n    --jars xdml-1.0-jar-with-dependencies.jar \\   \n    xdml-1.0-jar-with-dependencies.jar   \n\n```\n\n注：提交命令中的设置有`$SPARK_HOME`、`$trainpath`、`$modelpath` 分别代表spark客户端路径、训练数据HDFS路径、模型存储HDFS路径  \n\n## FAQ\n[**XDML常见问题**](./doc/faq_cn.md)\n\n## 参考文献\nXDML参考了学界及工业界诸多优秀成果，对此表示感谢！\n\n- Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, Wu-Jun Li, [SCOPE: Scalable Composite Optimization for Learning on Spark](https://arxiv.org/pdf/1602.00133.pdf). AAAI 2017: 2928-2934.\n- Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, Wu-Jun Li.[Proximal SCOPE for Distributed Sparse Learning](https://arxiv.org/pdf/1803.05621.pdf).Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2018.\n- Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Zhi-Ming Ma and Tie-Yan Liu, [Asynchronous Stochastic Gradient Descent with Delay Compensation](https://arxiv.org/pdf/1609.08326.pdf), ICML 2017.\n\n## 联系我们\n\nMail： \u003cg-xlearning-dev@360.cn\u003e     \nQQ群：874050710  \n![qq](./doc/img/qq.jpg)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQihoo360%2FXLearning-XDML","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FQihoo360%2FXLearning-XDML","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQihoo360%2FXLearning-XDML/lists"}