{"id":23821368,"url":"https://github.com/returnstring/cloudburst","last_synced_at":"2025-09-05T19:47:16.652Z","repository":{"id":147346254,"uuid":"196853440","full_name":"returnString/cloudburst","owner":"returnString","description":"Cloud-based distributed computing for R","archived":false,"fork":false,"pushed_at":"2019-08-08T21:27:43.000Z","size":33,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-02T08:40:13.217Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/returnString.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-14T15:33:46.000Z","updated_at":"2021-10-19T11:05:23.000Z","dependencies_parsed_at":"2023-07-07T06:31:12.347Z","dependency_job_id":null,"html_url":"https://github.com/returnString/cloudburst","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/returnString%2Fcloudburst","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/returnString%2Fcloudburst/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/returnString%2Fcloudburst/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/returnString%2Fcloudburst/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/returnString","download_url":"https://codeload.github.com/returnString/cloudburst/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240102442,"owners_count":19748013,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-02T08:39:13.780Z","updated_at":"2025-02-21T23:42:23.070Z","avatar_url":"https://github.com/returnString.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cloudburst\nCloudburst brings cloud compute resources to your R project, allowing data scientists and engineers to build complex data processing pipelines with whatever resources are required.\n\nThere's no requirement to pre-provision clusters of machines, or configure auto-scaling to keep costs down for variable workloads; compute resources are spun up exactly as necessary, and stopped once their work is done.\n\nYou can split your work up into stages that look just like functions, flowing data through your process, and Cloudburst will automatically wire them up into a DAG for optimal parallel execution where possible.\n\nCurrently, Amazon Web Services (AWS) is the only supported provider, leveraging Fargate/ECS for compute and S3 for transient storage.\n\n## Demo\n```r\nlibrary(magrittr)\n\n# you need to initialise a provider; we'll use AWS for this example\ncloudburst::init_aws(\n   # s3 is the default storage backend for AWS; we need this to marshal results between stages\n  storage_bucket = \"my-s3-bucket\",\n   # let's indicate which cluster we're running in and which subnets to use for ECS\n  compute_cluster = \"data\",\n  compute_subnets = c(\"subnet-abcdef\", \"subnet-ghijkl\"),\n  compute_assign_public_ip = T,\n  # you need a Docker image, see the \"Managing Dependencies\" section below\n  compute_image = \"12345.dkr.ecr.us-east-1.amazonaws.com/my-cloudburst-image:latest\",\n  # the execution role can just be the default ECS execution role for your account\n  compute_execution_role = \"arn:aws:iam::12345.role/ecsTaskExecutionRole\",\n  # the task role gives your R code access to any AWS services it might need, like S3\n  compute_task_role = \"arn:aws:iam::12345:role/my-cloudburst-role\"\n)\n\n# variables are transparently made available to stages as needed\nnum_observations \u003c- 1000\n\n# let's pretend we've got two stages that build large datasets somehow\nget_data_x \u003c- cloudburst::stage(cpu = 1024, memory = 2048, function() {\n  data.frame(x = runif(num_observations))\n})\n\nget_data_y \u003c- cloudburst::stage(cpu = 1024, memory = 2048, function() {\n  data.frame(y = rnorm(num_observations))\n})\n\n# and a third stage that does some \"intensive\" computation over the two\nbuild_model \u003c- cloudburst::stage(cpu = 2048, memory = 4096, function(data_x, data_y) {\n  data \u003c- cbind(data_x, data_y)\n  lm(y ~ x, data)\n})\n\n# stages are called just like regular functions\n# we just have to call 'execute' at the end to bring the result back to R\nbuild_model(get_data_x(), get_data_y()) %\u003e%\n  cloudburst::execute(\"super-complex-pipeline\") -\u003e result\n```\n\nThis would spin up two tasks to run `get_data_x` and `get_data_y` in parallel, each with 1 vCPU and 2GB of RAM, and then a third task on completion of both those stages to build the linear model, with 2 vCPUs and 4GB of RAM.\n\nOn completion, if we were to inspect `result`, we'd see a standard linear model, just as we'd expect from running `lm` in a normal R process.\n\n## Managing Dependencies\nMost projects aren't just base R; they require packages installed from CRAN or elsewhere, so we need to make sure that those same packages are available, no matter where your R code is being executed. To do this, we can use [Packrat](https://rstudio.github.io/packrat) to track our dependencies, and [Docker](https://www.docker.com) to bundle up an R environment with all the same packages as you're using locally.\n\nWe can tie this all together and automate it using the [containr](https://github.com/hypothesci/containr) package.\n\nYou can use `containr::docker_deploy` to automatically create a Docker image based on your installed version of R with all your Packrat dependencies, and push it to a Docker repository of your choosing. For example, with the example above in which we used AWS ECR to store our container, we could just run `containr::docker_deploy(\"12345.dkr.ecr.us-east-1.amazonaws.com/my-cloudburst-image:latest\")` from inside our project to bundle up all our required packages alongside the correct version of R.\n\n## Installation\nThis package and some of its dependencies are not yet available on CRAN and so must be installed directly from GitHub.\n\n```r\nlapply(c(\"aws.ecs\", \"containr\", \"cloudburst\"), function(p) remotes::install_github(paste0(\"hypothesci/\", p)))\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freturnstring%2Fcloudburst","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freturnstring%2Fcloudburst","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freturnstring%2Fcloudburst/lists"}