{"id":13400622,"url":"https://github.com/rstudio/sparkxgb","last_synced_at":"2025-04-25T19:31:27.591Z","repository":{"id":33926436,"uuid":"158515685","full_name":"rstudio/sparkxgb","owner":"rstudio","description":"R interface for XGBoost on Spark","archived":false,"fork":false,"pushed_at":"2024-05-01T17:36:04.000Z","size":188,"stargazers_count":46,"open_issues_count":16,"forks_count":14,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-07-31T19:25:34.834Z","etag":null,"topics":["apache-spark","machine-learning","r","rstats","spark","xgboost"],"latest_commit_sha":null,"homepage":"https://spark.posit.co/packages/sparkxgb/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rstudio.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-21T08:31:04.000Z","updated_at":"2024-05-06T15:14:01.000Z","dependencies_parsed_at":"2024-04-22T19:38:08.315Z","dependency_job_id":"106797a1-c30f-4c1f-b33c-b7bbcbe1cd0e","html_url":"https://github.com/rstudio/sparkxgb","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fsparkxgb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fsparkxgb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fsparkxgb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fsparkxgb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rstudio","download_url":"https://codeload.github.com/rstudio/sparkxgb/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250882606,"owners_count":21502337,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","machine-learning","r","rstats","spark","xgboost"],"created_at":"2024-07-30T19:00:54.007Z","updated_at":"2025-04-25T19:31:27.288Z","avatar_url":"https://github.com/rstudio.png","language":"R","funding_links":[],"categories":["R","Sparklyr Analysis Tools"],"sub_categories":["Tree Model"],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n# sparkxgb\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/rstudio/sparkxgb/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/rstudio/sparkxgb/actions/workflows/R-CMD-check.yaml)\n[![Spark Tests](https://github.com/rstudio/sparkxgb/actions/workflows/Tests.yaml/badge.svg)](https://github.com/rstudio/sparkxgb/actions/workflows/Tests.yaml)\n[![Codecov test coverage](https://codecov.io/gh/rstudio/sparkxgb/branch/main/graph/badge.svg)](https://app.codecov.io/gh/rstudio/sparkxgb?branch=main)\n[![CRAN status](https://www.r-pkg.org/badges/version/sparkxgb)](https://CRAN.R-project.org/package=sparkxgb)\n\u003c!-- badges: end --\u003e\n\n## Overview\n\n**sparkxgb** is a [sparklyr](https://spark.posit.co/) extension that provides\nan interface to [XGBoost](https://github.com/dmlc/xgboost) on Spark.\n\n## Installation\n\n```r\ninstall.packages(\"sparkxgb\")\n```\n\n### Development version \n\nYou can install the development version of `sparkxgb` with:\n\n``` r\n# install.packages(\"pak\")\npak::pak(\"rstudio/sparkxgb\")\n```\n\n## Example\n\n**sparkxgb** supports the familiar formula interface for specifying models:\n\n```{r, message = FALSE}\nlibrary(sparkxgb)\nlibrary(sparklyr)\nlibrary(dplyr)\n\nsc \u003c- spark_connect(master = \"local\")\niris_tbl \u003c- sdf_copy_to(sc, iris)\n\nxgb_model \u003c- xgboost_classifier(\n  iris_tbl,\n  Species ~ .,\n  num_class = 3,\n  num_round = 50,\n  max_depth = 4\n)\n\nxgb_model %\u003e%\n  ml_predict(iris_tbl) %\u003e%\n  select(Species, predicted_label, starts_with(\"probability_\")) %\u003e%\n  glimpse()\n```\n\nIt also provides a Pipelines API, which means you can use a `xgboost_classifier`\nor `xgboost_regressor` in a pipeline as any `Estimator`, and do things like \nhyperparameter tuning:\n\n```{r}\npipeline \u003c- ml_pipeline(sc) %\u003e%\n  ft_r_formula(Species ~ .) %\u003e%\n  xgboost_classifier(num_class = 3)\n\nparam_grid \u003c- list(\n  xgboost = list(\n    max_depth = c(1, 5),\n    num_round = c(10, 50)\n  )\n)\n\ncv \u003c- ml_cross_validator(\n  sc,\n  estimator = pipeline,\n  evaluator = ml_multiclass_classification_evaluator(\n    sc,\n    label_col = \"label\",\n    raw_prediction_col = \"rawPrediction\"\n  ),\n  estimator_param_maps = param_grid\n)\n\ncv_model \u003c- cv %\u003e%\n  ml_fit(iris_tbl)\n\nsummary(cv_model)\n\nspark_disconnect(sc)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frstudio%2Fsparkxgb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frstudio%2Fsparkxgb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frstudio%2Fsparkxgb/lists"}