{"id":33158630,"url":"https://github.com/Angel-ML/automl","last_synced_at":"2025-11-20T14:02:47.410Z","repository":{"id":99778612,"uuid":"199402434","full_name":"Angel-ML/automl","owner":"Angel-ML","description":"An automatic machine learning toolkit, including hyper-parameter tuning and feature engineering.","archived":false,"fork":false,"pushed_at":"2019-10-29T14:33:47.000Z","size":1782,"stargazers_count":58,"open_issues_count":2,"forks_count":21,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-30T16:02:44.251Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Angel-ML.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-07-29T07:34:58.000Z","updated_at":"2024-05-23T12:22:41.000Z","dependencies_parsed_at":"2023-03-13T15:45:50.253Z","dependency_job_id":null,"html_url":"https://github.com/Angel-ML/automl","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Angel-ML/automl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Angel-ML%2Fautoml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Angel-ML%2Fautoml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Angel-ML%2Fautoml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Angel-ML%2Fautoml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Angel-ML","download_url":"https://codeload.github.com/Angel-ML/automl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Angel-ML%2Fautoml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285447937,"owners_count":27173436,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-20T02:00:05.334Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-15T21:00:27.012Z","updated_at":"2025-11-20T14:02:47.405Z","avatar_url":"https://github.com/Angel-ML.png","language":"Scala","funding_links":[],"categories":["Libraries","人工智能"],"sub_categories":["机器学习"],"readme":"# AutoML\n\nAngel's automatic machine learning toolkit.\n\nAngel-AutoML provides automatic hyper-parameter tuning and feature engineering operators.\nIt is developed with Scala. \nAs a stand-alone library, Angel-AutoML can be easily integrated in Java and Scala projects.\n\nWe welcome everyone interested in machine learning to contribute code, create issues or pull requests. Please refer to  [Angel Contribution Guide](https://github.com/Tencent/angel/blob/master/CONTRIBUTING.md) for more detail.\n\n## Hyper-parameter tuning\n\n### Strategies\nAngel-AutoML has three tuning strategies, i.e., Grid search, Random search, and Bayesian optimization.\n\n![Grid search and random search](docs/img/grid_vs_random.png)\n\n![Bayesian optimization](docs/img/bo.png)\n\n- **Grid search** equally divides the search space into grids with a fundamental assumption that the distributions of hyper-parameters are uniform. \nThough intuitive, grid search has two significant drawbacks: 1) the computing cost increases exponentially with respect to the number of parameters; \nand 2) the distributions of hyper-parameter are usually not uniform in real cases. \nThus, grid search might spend great efforts on optimizing less important hyper-parameters in many cases.\n- **Random search** randomly samples a sequence of hyper-parameter combinations from the configuration space, \nand evaluates the sampled combinations. \nThough this approach can be more likely to pay more attention to more important hyper-parameters, \nthere is still no guarantee of finding the optimal combination.\n- **Bayesian optimization (BO)** is different from the traditional modeless methods.\nIt treats the tuning problem as a black-box function, where the input is the hyper-parameter combination, \nand the output is the model metric such as accuracy and auc.\nBO uses a cheap surrogate function to approximate the unknown target function. \nThe surrogate function generates the probabilistic mean and variance of a given hyper-parameter combination. \nThen, an acquisition function evaluates the expected improvement of the generated combination.\nThe hyper-parameter combination with highest improvement is chosen to conduct the next evaluation.\nThis suggest-evaluate-feedback process iterates until convergence.\nSuch a probabilistic interpretation approach enables Bayesian optimization to find the optima with much less evaluations on target function. \n\nFor BO, Angel-AutoML implements a series of surrogate functions and acquisition functions.\n-\t**Surrogate function**: Gaussian process and random forest. \nWe also implement the EM+LBFGS to optimize the hyper-parameters in kernel functions of Gaussian process.  \n-\t**Acquisition function**: Probability of Improvement (PI), Expected Improvement (EI) and Upper Confidence Bound (UCB).\n\n### Usage\n\nThe tuning component of Angel-AutoML provides easy-to-use interfaces.\nUsers can integrate it into their programs with fewer than 10 lines.\n\n- **Define hyper-parameter space.**\nSupported format of discrete hyper-parameter: {v1,v2,v3,v4} or {start: end: step}.\n```scala\nval param1 = ParamSpace.fromConfigString(\"param1\", \"{1.0,2.0,3.0,4.0,5.0}\")\nval param2 = ParamSpace.fromConfigString(\"param2\", \"{1:10:1}\")\n```\nSupported format of continuous hyper-parameter: [start,end] or [start: end: num_of_elements]\n```scala\nval param1 = ParamSpace.fromConfigString(\"param1\", \"[1,10]\")\nval param2 = ParamSpace.fromConfigString(\"param2\", \"[1:10:10]\")\n```\n- **Create solver of hyper-parameter tuning.**\nThe first param is hyper-parameters defined above.\nThe second param indicates whether the goal is minimizing the metric.\nThe third param defines the surrogate (Random, Grid, or GaussianProcess).\n```scala\nval solver: Solver = Solver(Array(param1, param2), true, surrogate = \"Random\")\n```\n- **Solver suggests a batch of hyper-parameter combinations.**\nThe default batch size is 100. You can change this value via TunerParam.setBatchSize().\n```scala\nval configs: Array[Configuration] = solver.suggest()\n```\n- **User evaluates the objective function with the suggested hyper-parameter combinations.**\n```scala\nval results: Array[Double] = objective.evaluate(configs)\n```\n- **User feeds the results to the solver.**\n```scala\nsolver.feed(configs, results)\n```\n- Jump to Step 3 and iterate until convergence.\n\n## Feature engineering\n\nFeature engineering, such as feature selection and feature synthesis, has significant importance in industry level applications of machine learning.\nAngel-AutoML implements useful feature engineering operators with Spark MLlib.\nThey can be easily assembled into Spark pipeline.\n\n### Feature selection\n\nSince the feature selection operators in Spark MLlib is not enough,\nwe enhance Spark by adding two categories of operators.\n- Statistic-based operators, including VarianceSelector and FtestSelector.\n- Model-based operators, including LassoSelector and RandomForestSelector.\n\n### Feature synthesis\n\nA majority of online recommendation systems choose linear models, such as Logistic Regression, \nas their machine learning model for its high throughput and low latency.\nBut Logistic Regression requires manual feature synthesis to achieve high accuracy,\nwhich makes automatic feature synthesis essential. \nHowever, existing automatic feature synthesis methods simply generate high-order cross features by cartesian product, \nincurring problem of dimension curse.\nTherefore, we propose Auto Feature synthesis (AFS), an iterative approach to generate high-order features.\n\n![Automatic feature synthesis](docs/img/feature_synthesis.png)\n\nIn AFE, each iteration is composed of two stages:\n- Amplification stage: cartesian product of arbitrary features\n- Reduction stage: feature selection and feature re-indexing.\n\nThe above figure is an example of an AFS iteration:\n-\tThe features are first amplified through a **cartesian product operator**. \nThe number of features will increase quadratically after this step.\n-\tNext, the most important features are selected from the previous step by a **feature selector operator** (e.g. VarianceSelector and RandomForestSelector).\n-\tThen, the selected features are re-indexed to reduce the feature space by a **feature re-index operator**.\n-\tFinally, the generated features and the original features are concatenated by a **vector assembler operator**.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAngel-ML%2Fautoml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAngel-ML%2Fautoml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAngel-ML%2Fautoml/lists"}