{"id":18612378,"url":"https://github.com/winvector/bigdatarstrata2017","last_synced_at":"2026-03-02T05:01:35.386Z","repository":{"id":142726343,"uuid":"82865893","full_name":"WinVector/BigDataRStrata2017","owner":"WinVector","description":"All material for \"Modeling big data with R, sparklyr, and Apache Spark\" Strata Hadoop 2017.","archived":false,"fork":false,"pushed_at":"2020-06-22T21:11:55.000Z","size":31532,"stargazers_count":63,"open_issues_count":0,"forks_count":40,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-06-16T10:54:53.419Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55791","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WinVector.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-02-23T00:18:49.000Z","updated_at":"2024-07-18T16:29:56.000Z","dependencies_parsed_at":"2023-04-09T13:50:49.239Z","dependency_job_id":null,"html_url":"https://github.com/WinVector/BigDataRStrata2017","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/WinVector/BigDataRStrata2017","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2FBigDataRStrata2017","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2FBigDataRStrata2017/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2FBigDataRStrata2017/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2FBigDataRStrata2017/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WinVector","download_url":"https://codeload.github.com/WinVector/BigDataRStrata2017/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2FBigDataRStrata2017/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29993026,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T01:47:34.672Z","status":"online","status_checked_at":"2026-03-02T02:00:07.342Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T03:16:51.219Z","updated_at":"2026-03-02T05:01:35.373Z","avatar_url":"https://github.com/WinVector.png","language":"HTML","readme":"---\noutput:\n  md_document:\n    variant: markdown_github\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\nMaterials for:\n\n\n#### [Modeling big data with R, sparklyr, and Apache Spark](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55791)\n\n    1:30pm–5:00pm Tuesday, March 14, 2017\n    Data science \u0026 advanced analytics\n    Location: LL21 C/D\n    Level: Intermediate\n    Secondary topics:  R\n\n    John Mount  (Win-Vector LLC)\n    \n\nWe have a short video showing how to install [Spark](http://spark.apache.org) using [R](https://cran.r-project.org) and [RStudio](https://www.rstudio.com) [here](https://youtu.be/qnINvPqcRvE).\n\nAlso please click through for slides from Edgar Ruiz's excellent [Strata Sparklyr presentation](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55800) and [cheat-sheet](http://spark.rstudio.com/images/sparklyr-cheatsheet.pdf).\n    \n\nDescription from Strata announcement\n-------------\n  \n  \n  Modeling big data with R, sparklyr, and Apache Spark\n  \n[John Mount](http://www.win-vector.com/site/staff/john-mount/) ([Win Vector LLC](http://www.win-vector.com/))\n1:30pm–5:00pm Tuesday, March 14, 2017\nData science \u0026 advanced analytics\nLocation: LL21 C/D\nLevel: Intermediate\nSecondary topics:  R\n\n#### Who is this presentation for?\n\nData scientists, data analysts, modelers, R users, Spark users, statisticians, and those in IT\n\n\n#### Prerequisite knowledge\n\n\n##### Basic familiarity with R\n\nExperience using the [dplyr](https://CRAN.R-project.org/package=dplyr) R package (If you have not used dplyr before, please read this chapter before coming to class.)\nMaterials or downloads needed in advance.\n\nA WiFi-enabled laptop (You'll be provided an [RStudio Server Pro](https://www.rstudio.com/products/rstudio-server-pro/) login for students to use on the day of the workshop.)\n\n##### What you'll learn\n\nLearn how to quickly set up a local Spark instance, store big data in Spark and then connect to the data with R, use R to apply machine-learning algorithms to big data stored in Spark, and filter and aggregate big data stored in Spark and then import the results into R for analysis and visualization\nUnderstand how to extend R and use [sparkly](http://spark.rstudio.com)) to access the entire Spark API\n\n### Description\n\nSparklyr, developed by RStudio in conjunction with IBM, Cloudera, and [H2O](http://www.h2o.ai), provides an R interface to Spark’s distributed machine-learning algorithms and much more. Sparklyr makes practical machine learning scalable and easy. With sparklyr, you can interactively manipulate Spark data using both dplyr and SQL (via DBI); filter and aggregate Spark datasets then bring them into R for analysis and visualization; orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater; create extensions that call the full Spark API and provide interfaces to Spark packages; and establish Spark connections and browse Spark data frames within the RStudio IDE.\n\nJohn Mount demonstrates how to use sparklyr to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also also explores the sparklyr integration built into the RStudio IDE.\n\n\nDerived from [R for big data](https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/52369) (GitHub\"\" [https://github.com/rstudio/Strata2016](https://github.com/rstudio/Strata2016)).\n\nPublic repository is: [https://github.com/WinVector/BigDataRStrata2017](https://github.com/WinVector/BigDataRStrata2017).\n\n\n###### config\n\n\n\n \nCurrent list of CRAN packages used:\n\n```{r cranpackges, eval=FALSE}\n# often a good idea, though try \"n\" to build source\n# may interfere with us pinning h2o to a specific version\n# update.packages(ask=FALSE) \ncranpkgs \u003c- c(\n 'babynames',\n 'caret',\n 'DBI',\n 'devtools',\n 'dplyr',\n 'dygraphs',\n 'e1071',\n 'formatR',\n 'ggplot2',\n  # 'h2o', # installed a bit later\n 'lubridate',\n 'nycflights13',\n 'plotly',\n 'rbokeh',\n 'rsparkling',\n 'RSQLite',\n 'sparklyr',\n 'tidyr',\n 'tidyverse',\n 'titanic',\n 'xtable'\n )\ninstall.packages(cranpkgs)\n```\n\n```{r githubddevpkgs, eval=FALSE}\ndevpkgs \u003c- c(\n  'RStudio/EDAWR',\n  'WinVector/replyr',\n  'WinVector/WVPlots' )\n\nfor(pkgi in devpkgs) {\n  devtools::install_github(pkgi)\n}\n```\n\n\nAlso it is critical to look at [Exercises/solutions/RsparklingExample.Rmd](Exercises/solutions/RsparklingExample.Rmd) as it installs and configures some packages.  A refresh of all packages will break the matching version numbers required by `h2o` and `rsparkling`.  So please work through the details in `RsparklingExample.Rmd` after updating and installing all the above packages.\n\nA copy of those note are below (but it is better to look at `RsparklingExample.Rmd`).\n\n\n```{r installh2o, eval=FALSE}\n# updated from https://gist.github.com/edgararuiz/6453d44a91c85a87998cfeb0dfed9fa9\n# The following two commands remove any previously installed H2O packages for R.\nif (\"package:h2o\" %in% search()) { detach(\"package:h2o\", unload=TRUE) }\nif (\"h2o\" %in% rownames(installed.packages())) { remove.packages(\"h2o\") }\n\n# Next, we download packages that H2O depends on.\npkgs \u003c- c(\"methods\", \"statmod\", \"stats\",\n          \"graphics\", \"RCurl\", \"jsonlite\",\n          \"tools\", \"utils\")\nfor (pkg in pkgs) {\n  if (! (pkg %in% rownames(installed.packages()))) {\n     install.packages(pkg)\n  }\n}\n\n# Now we download, install and initialize the H2O package for R.\ninstall.packages(\"h2o\", type = \"source\", repos = \"http://h2o-release.s3.amazonaws.com/h2o/rel-turnbull/2/R\")\n\n# Installing 'rsparkling' from CRAN\ninstall.packages(\"rsparkling\")\noptions(rsparkling.sparklingwater.version = \"2.0.3\")\n# Reinstalling 'sparklyr' \ninstall.packages(\"sparklyr\")\nsparklyr::spark_install(version = \"2.0.0\")\n```\n\nNote: please note using `dplyr::compute()` (or `sparklyr::sdf_checkpoint()`) with `sparklyr` can have issues (see [sparklyr issue 721](https://github.com/rstudio/sparklyr/issues/721)).\n\n  \n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwinvector%2Fbigdatarstrata2017","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwinvector%2Fbigdatarstrata2017","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwinvector%2Fbigdatarstrata2017/lists"}