{"id":37026841,"url":"https://github.com/spotify/spydra","last_synced_at":"2026-01-14T03:10:16.997Z","repository":{"id":20491108,"uuid":"89217740","full_name":"spotify/spydra","owner":"spotify","description":"Ephemeral Hadoop clusters using Google Compute Platform","archived":true,"fork":false,"pushed_at":"2022-03-31T11:16:50.000Z","size":576,"stargazers_count":136,"open_issues_count":12,"forks_count":34,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-07-06T11:49:44.992Z","etag":null,"topics":["dataproc","google-cloud","hadoop"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spotify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-24T08:48:42.000Z","updated_at":"2025-06-17T03:07:31.000Z","dependencies_parsed_at":"2022-07-26T06:47:07.007Z","dependency_job_id":null,"html_url":"https://github.com/spotify/spydra","commit_stats":null,"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"purl":"pkg:github/spotify/spydra","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fspydra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fspydra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fspydra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fspydra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spotify","download_url":"https://codeload.github.com/spotify/spydra/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fspydra/sbom","scorecard":{"id":842232,"data":{"date":"2025-08-11","repo":{"name":"github.com/spotify/spydra","commit":"3be506ce0be6a3e72b5de5c3161b724ea5af560a"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.5,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":3,"reason":"Found 6/18 approved changesets -- score normalized to 3","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"project is archived","details":["Warn: Repository is archived."],"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"internal error: internal error: invalid Dockerfile: unterminated heredoc","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 23 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"137 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-h46c-h94j-95f3","Warn: Project is vulnerable to: GHSA-wf8f-6423-gfxg","Warn: Project is vulnerable to: GHSA-288c-cq4h-88gq","Warn: Project is vulnerable to: GHSA-4gq5-ch57-c2mg","Warn: Project is vulnerable to: GHSA-4w82-r329-3q67","Warn: Project is vulnerable to: GHSA-57j2-w4cx-62h2","Warn: Project is vulnerable to: GHSA-5949-rw7g-wx7w","Warn: Project is vulnerable to: GHSA-5r5r-6hpj-8gg9","Warn: Project is vulnerable to: GHSA-5ww9-j83m-q7qx","Warn: Project is vulnerable to: GHSA-645p-88qh-w398","Warn: Project is vulnerable to: GHSA-6fpp-rgj9-8rwc","Warn: Project is vulnerable to: GHSA-6wqp-v4v6-c87c","Warn: Project is vulnerable to: GHSA-85cw-hj65-qqv9","Warn: Project is vulnerable to: GHSA-89qr-369f-5m5x","Warn: Project is vulnerable to: GHSA-8c4j-34r4-xr8g","Warn: Project is vulnerable to: GHSA-8w26-6f25-cm9x","Warn: Project is vulnerable to: GHSA-9gph-22xh-8x98","Warn: Project is vulnerable to: GHSA-9m6f-7xcq-8vf8","Warn: Project is vulnerable to: GHSA-9mxf-g3x6-wv74","Warn: Project is vulnerable to: GHSA-c8hm-7hpq-7jhg","Warn: Project is vulnerable to: GHSA-cf6r-3wgc-h863","Warn: Project is vulnerable to: GHSA-cggj-fvv3-cqwv","Warn: Project is vulnerable to: GHSA-cjjf-94ff-43w7","Warn: Project is vulnerable to: GHSA-cmfg-87vq-g5g4","Warn: Project is vulnerable to: GHSA-cvm9-fjm9-3572","Warn: Project is vulnerable to: GHSA-f3j5-rmmp-3fc5","Warn: Project is vulnerable to: GHSA-f9hv-mg5h-xcw9","Warn: Project is vulnerable to: GHSA-f9xh-2qgp-cq57","Warn: Project is vulnerable to: GHSA-fmmc-742q-jg75","Warn: Project is vulnerable to: GHSA-fqwf-pjwf-7vqv","Warn: Project is vulnerable to: GHSA-gjmw-vf9h-g25v","Warn: Project is vulnerable to: GHSA-gwp4-hfv6-p7hw","Warn: Project is vulnerable to: GHSA-gww7-p5w4-wrfv","Warn: Project is vulnerable to: GHSA-h3cw-g4mq-c5x2","Warn: Project is vulnerable to: GHSA-h592-38cm-4ggp","Warn: Project is vulnerable to: GHSA-h822-r4r5-v8jg","Warn: Project is vulnerable to: GHSA-jjjh-jjxp-wpff","Warn: Project is vulnerable to: GHSA-m6x4-97wx-4q27","Warn: Project is vulnerable to: GHSA-mph4-vhrx-mv67","Warn: Project is vulnerable to: GHSA-mx7p-6679-8g3q","Warn: Project is vulnerable to: GHSA-mx9v-gmh4-mgqw","Warn: Project is vulnerable to: GHSA-p43x-xfjf-5jhr","Warn: Project is vulnerable to: GHSA-q93h-jc49-78gg","Warn: Project is vulnerable to: GHSA-qjw2-hr98-qgfh","Warn: Project is vulnerable to: GHSA-qr7j-h6gg-jmgc","Warn: Project is vulnerable to: GHSA-qxxx-2pp7-5hmx","Warn: Project is vulnerable to: GHSA-r3gr-cxrf-hg25","Warn: Project is vulnerable to: GHSA-r695-7vr9-jgc2","Warn: Project is vulnerable to: GHSA-rfx6-vp9g-rh7v","Warn: Project is vulnerable to: GHSA-rgv9-q543-rqg4","Warn: Project is vulnerable to: GHSA-rpr3-cw39-3pxh","Warn: Project is vulnerable to: GHSA-v585-23hc-c647","Warn: Project is vulnerable to: GHSA-vfqx-33qm-g869","Warn: Project is vulnerable to: GHSA-w3f4-3q6j-rh82","Warn: Project is vulnerable to: GHSA-wh8g-3j2c-rqj5","Warn: Project is vulnerable to: GHSA-x2w5-5m2g-7h5m","Warn: Project is vulnerable to: GHSA-h4x4-5qp2-wp46","Warn: Project is vulnerable to: GHSA-4jrv-ppp4-jm57","Warn: Project is vulnerable to: GHSA-5mg8-w23w-74h3","Warn: Project is vulnerable to: GHSA-7g45-4rm6-3mm3","Warn: Project is vulnerable to: GHSA-mvr2-9pj6-7w5j","Warn: Project is vulnerable to: GHSA-f263-c949-w85g","Warn: Project is vulnerable to: GHSA-hw42-3568-wj87","Warn: Project is vulnerable to: GHSA-4gg5-vx3j-xwc7","Warn: Project is vulnerable to: GHSA-735f-pc8j-v9w8","Warn: Project is vulnerable to: GHSA-77rm-9x9h-xj3g","Warn: Project is vulnerable to: GHSA-g5ww-5jh7-63cx","Warn: Project is vulnerable to: GHSA-h4h5-3hr4-j3g2","Warn: Project is vulnerable to: GHSA-wrvw-hg22-4m67","Warn: Project is vulnerable to: GHSA-3vrc-rrpw-r5pw","Warn: Project is vulnerable to: GHSA-pfh2-hfmq-phg5","Warn: Project is vulnerable to: GHSA-q446-82vq-w674","Warn: Project is vulnerable to: GHSA-5m48-vr54-vmh3","Warn: Project is vulnerable to: GHSA-6phf-73q6-gh87","Warn: Project is vulnerable to: GHSA-wxr5-93ph-8wr9","Warn: Project is vulnerable to: GHSA-6hgm-866r-3cjv","Warn: Project is vulnerable to: GHSA-fjq5-5j5f-mvxh","Warn: Project is vulnerable to: GHSA-pvp8-3xj6-8c6x","Warn: Project is vulnerable to: GHSA-3832-9276-x7gf","Warn: Project is vulnerable to: GHSA-78wr-2p64-hpwj","Warn: Project is vulnerable to: GHSA-gwrp-pvrq-jmwv","Warn: Project is vulnerable to: GHSA-j288-q9x7-2f5v","Warn: Project is vulnerable to: GHSA-cgp8-4m63-fhh5","Warn: Project is vulnerable to: GHSA-5mcr-gq6c-3hq2","Warn: Project is vulnerable to: GHSA-7vpq-g998-qpv7","Warn: Project is vulnerable to: GHSA-9vjp-v76f-g363","Warn: Project is vulnerable to: GHSA-cqqj-4p63-rrmm","Warn: Project is vulnerable to: GHSA-f256-j965-7f32","Warn: Project is vulnerable to: GHSA-grg4-wf29-r9vv","Warn: Project is vulnerable to: GHSA-p2v9-g2qv-p635","Warn: Project is vulnerable to: GHSA-wm47-8v5p-wjpj","Warn: Project is vulnerable to: GHSA-wx5j-54mm-rqqq","Warn: Project is vulnerable to: GHSA-xfv3-rrfm-f2rv","Warn: Project is vulnerable to: GHSA-2qrg-x229-3v8q","Warn: Project is vulnerable to: GHSA-65fg-84f6-3jq3","Warn: Project is vulnerable to: GHSA-f7vh-qwp3-x37m","Warn: Project is vulnerable to: GHSA-fp5r-v3w9-4333","Warn: Project is vulnerable to: GHSA-w9p3-5cr8-m3jj","Warn: Project is vulnerable to: GHSA-493p-pfq6-5258","Warn: Project is vulnerable to: GHSA-v528-7hrm-frqp","Warn: Project is vulnerable to: GHSA-r7pg-v2c8-mfg3","Warn: Project is vulnerable to: GHSA-rhrv-645h-fjfh","Warn: Project is vulnerable to: GHSA-4g9r-vxhx-9pgx","Warn: Project is vulnerable to: GHSA-7hfm-57qf-j43q","Warn: Project is vulnerable to: GHSA-crv7-7245-f45f","Warn: Project is vulnerable to: GHSA-mc84-pj99-q6hh","Warn: Project is vulnerable to: GHSA-xqfj-vm6h-2x34","Warn: Project is vulnerable to: GHSA-7q56-mp4c-gggg","Warn: Project is vulnerable to: GHSA-8r28-r8cp-g6cp","Warn: Project is vulnerable to: GHSA-8wm5-8h9c-47pc","Warn: Project is vulnerable to: GHSA-f5fw-25gw-5m92","Warn: Project is vulnerable to: GHSA-f8vc-wfc8-hxqh","Warn: Project is vulnerable to: GHSA-g48f-ff5h-5f64","Warn: Project is vulnerable to: GHSA-gx2c-fvhc-ph4j","Warn: Project is vulnerable to: GHSA-h24p-qwf4-84q8","Warn: Project is vulnerable to: GHSA-mf7c-35mq-75pj","Warn: Project is vulnerable to: GHSA-rmpj-7c96-mrg8","Warn: Project is vulnerable to: GHSA-7r82-7xv7-xcpj","Warn: Project is vulnerable to: GHSA-cfh5-3ghh-wfjx","Warn: Project is vulnerable to: GHSA-fmj5-wv96-r2ch","Warn: Project is vulnerable to: GHSA-2hw2-62cp-p9p7","Warn: Project is vulnerable to: GHSA-7286-pgfv-vxvh","Warn: Project is vulnerable to: GHSA-7cwj-j333-x7f7","Warn: Project is vulnerable to: GHSA-ccqf-c5hq-77mp","Warn: Project is vulnerable to: GHSA-c27h-mcmw-48hv","Warn: Project is vulnerable to: GHSA-r6j9-8759-g62w","Warn: Project is vulnerable to: GHSA-56h3-78gp-v83r","Warn: Project is vulnerable to: GHSA-7rf3-mqpx-h7xg","Warn: Project is vulnerable to: GHSA-grr4-wv38-f68w","Warn: Project is vulnerable to: GHSA-q6g2-g7f3-rr83","Warn: Project is vulnerable to: GHSA-x27m-9w8j-5vcw","Warn: Project is vulnerable to: GHSA-3vqj-43w4-2q58","Warn: Project is vulnerable to: GHSA-4jq9-2xhw-jpx7","Warn: Project is vulnerable to: GHSA-55g7-9cwv-5qfv","Warn: Project is vulnerable to: GHSA-fjpj-2g6w-x25r","Warn: Project is vulnerable to: GHSA-pqr6-cmr2-h8hf","Warn: Project is vulnerable to: GHSA-qcwq-55hx-v3vh"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-23T20:44:09.505Z","repository_id":20491108,"created_at":"2025-08-23T20:44:09.505Z","updated_at":"2025-08-23T20:44:09.505Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28408814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T01:52:23.358Z","status":"online","status_checked_at":"2026-01-14T02:00:06.678Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataproc","google-cloud","hadoop"],"created_at":"2026-01-14T03:10:16.117Z","updated_at":"2026-01-14T03:10:16.990Z","avatar_url":"https://github.com/spotify.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spydra (Beta / Inactive)\n\n[![License](https://img.shields.io/github/license/spotify/spydra.svg)](LICENSE)\n\n**Note** This project is inactive. \n\nEphemeral Hadoop clusters using Google Compute Platform\n\n## Description\n`Spydra` is \"Hadoop Cluster as a Service\" implemented as a library utilizing [Google Cloud Dataproc](https://cloud.google.com/dataproc/) \nand [Google Cloud Storage](https://cloud.google.com/storage/). The intention of `Spydra` is to enable the use of ephemeral Hadoop clusters while hiding\nthe complexity of cluster lifecycle management and keeping troubleshooting simple. `Spydra` is designed to be integrated\nas a `hadoop jar` replacement.\n\n`Spydra` is part of Spotify's effort to migrate its data infrastructure to Google Compute Platform and is being used in\nproduction. The principles and the design of `Spydra` are based on our experiences in scaling and maintaining our Hadoop\ncluster to over 2500 nodes and over 100 PBs of capacity running about 20,000 independent jobs per day. \n\n`Spydra` supports submitting data processing jobs to Dataproc as well as to existing on-premise Hadoop infrastructure \nand is designed to ease the migration to and/or dual use of Google Cloud Platform and on-premise infrastructure.\n\n`Spydra` is designed to be very configurable and allows the usage of all job types and configurations supported by the \n[gcloud dataproc clusters create](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/) and\n[gcloud dataproc jobs submit](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/) commands.\n\n### Development Status\n`Spydra` is the rewrite of a concept that has been developed at Spotify for more than a year. The current version of\n`Spydra` is in beta, used in production at Spotify, and actively developed and supported by our data infrastructure team.\n\n`Spydra` is in beta and things might change but we are aiming at not breaking the currently exposed APIs and configuration.\n\n### Spydra at Spotify\nAt Spotify, `Spydra` is being used for our on-going migration to Google Cloud Platform. It handles the \nsubmission of on-premise Hadoop jobs as well as Dataproc jobs, simplifying the switch from on-premise Hadoop\nto Dataproc.\n\n`Spydra` is packaged in a [docker](https://www.docker.com/) image that is used to deploy data\npipelines. This docker image includes Hadoop tools and configurations to be able to submit to our on-premise Hadoop\ncluster as well as an installation of [gcloud](https://cloud.google.com/sdk/gcloud/) and other basic dependencies\nrequired to execute Hadoop jobs in our environment. Pipelines are then scheduled using [Styx](https://github.com/spotify/styx)\nand orchestrated by [Luigi](https://github.com/spotify/luigi) which then invokes `Spydra` instead of `hadoop jar`.\n\n### Design\n\n`Spydra` is built as a wrapper around Google Cloud Dataproc and designed not to have any central component. It exposes\nall functionality supported by Dataproc via its own configuration while adding some defaults. `Spydra` manages\nclusters and submits jobs invoking the `gcloud dataproc` command. `Spydra` ensures that clusters are eventually deleted\nby updating a heartbeat marker in the cluster's metadata and utilizes [initialization-actions](https://cloud.google.com/dataproc/docs/concepts/init-actions)\nto set up a self-deletion script on the cluster to handle the deletion of the cluster in the event of client failures.\n\nFor submitting jobs to an existing on-premise Hadoop infrastructure, `Spydra` utilizes the `hadoop jar` command which is\nrequired to be installed and configured in the environment. \n\nFor Dataproc as well as on-premise submissions, `Spydra` will act similar to hadoop jar and print out driver output.\n\n#### Credentials\n`Spydra` is designed to ease the usage of Google Compute Platform credentials by utilizing \n[service accounts](https://cloud.google.com/compute/docs/access/service-accounts). The same credential that is \nused locally by `Spydra` to manage the cluster and submit jobs, is also by default forwarded to the Hadoop cluster when\ncalling Dataproc. This means that access rights to resources need only be given to a single set of credentials.\n\n#### Storing Execution Data and Logs\nTo make job execution data available after an ephemeral cluster was shut down, and to provide similar functionality to\nthe Hadoop MapReduce History Server, `Spydra` stores execution data and logs on Google Cloud Storage, grouping it by \na user-defined client id. Typically client id is unique per job. The execution data and logs are then made available via \n`Spydra` commands. These allow spinning up a local MapReduce History Server to access execution data and logs\nas well as dumping them.\n\n#### Autoscaler\n`Spydra` has an **experimental** autoscaler which can be executed on the cluster. It monitors the current resource\nutilization on the cluster and scales the cluster according to a user defined utilization factor and maximum worker count\nby adding [preemptible VMs](https://cloud.google.com/dataproc/docs/concepts/preemptible-vms). Note that the use of \npreemptible VMs might negatively impact performance as nodes might be shut down any time.\n\nThe autoscaler is being installed on the cluster using a Dataproc [initialization-action](https://cloud.google.com/dataproc/docs/concepts/init-actions).\n\n#### Cluster Pooling\n`Spydra` has **experimental** support for cluster pooling withing a single Google Compute Platform project. Cluster pooling\ncan be used to limit the resources used by the job submissions, and also limit the cluster initialization overhead.\nThe maximum number of clusters to be used can be defined as well as their maximum lifetime. Upon job submission, a random cluster\nis chosen to submit the job into. When reaching their maximum lifetime, pooled clusters are being deleted by the self-deletion\nmechanism.\n\n## Usage\n### Installation\nThere's a pre-built [Spydra on maven central](https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.spotify.data.spydra%22%20a%3A%22spydra%22). This is built using the parameters from `.travis.yml`, the bucket `spydra-init-actions` is provided for by Spotify.\n\n### Prerequisites\nTo be able to use Dataproc and on-premise Hadoop, a few things need to be set up before using `Spydra`.\n\n* Java 8\n* A [Google Cloud Platform project](https://cloud.google.com/resource-manager/docs/creating-managing-projects)\n  with the right (Google Cloud Dataproc API) [APIs enabled](https://support.google.com/cloud/answer/6158841?hl=en)\n* A [service account](https://cloud.google.com/compute/docs/access/service-accounts) with\n  [project editor](https://cloud.google.com/compute/docs/access/iam) rights in your project. The service account can be specified in two ways:\n  * A JSON key for the service account, and the environment variable [GOOGLE_APPLICATION_CREDENTIALS](https://developers.google.com/identity/protocols/application-default-credentials)\n    needs to point to the location of this service account JSON key. This cannot be a user credential.\n  * If the [GOOGLE_APPLICATION_CREDENTIALS](https://developers.google.com/identity/protocols/application-default-credentials)\n    environment variable is not set, Spydra will attempt to use application default credentails. In a local development environment\n    application default credentials can be obtained by authenticating with the command\n    [gcloud auth application-default login](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login). When running\n    on Google Compute Platform managed nodes, the application default credentials are provided by the default service account of the node.\n* [gcloud](https://cloud.google.com/sdk/gcloud/) needs to be installed\n* `gcloud` needs to be [authenticated using the service account](https://cloud.google.com/sdk/gcloud/reference/auth/)\n* [hadoop jar](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/CommandsManual.html#jar)\n  needs to be installed and configured to submit to your cluster\n\n### Spydra CLI\n\n`Spydra` CLI supports multiple sub-commands:\n\n* [`submit`](#submission) - submitting jobs to on-premise Hadoop and GCP Dataproc\n* [`run-jhs`](#running-an-embedded-jobhistoryserver) - embedded history server\n* [`dump-logs`](#retrieving-logs) - viewing logs\n* [`dump-history`](#retrieving-history-data) - viewing history\n\n#### Submission\n\n```\n$ java -jar spydra/target/spydra-VERSION-jar-with-dependencies.jar submit --help\n\nusage: submit [options] [jobArgs]\n    --clientid \u003carg\u003e     client id, used as identifier in job history output\n    --spydra-json \u003carg\u003e  path to the spydra configuration json\n    --jar \u003carg\u003e          main jar path, overwrites the configured one if\n                         set\n    --jars \u003carg\u003e         jar files to be shipped with the job, can occur\n                         multiple times, overwrites the configured ones if\n                         set\n    --job-name \u003carg\u003e     job name, used as dataproc job id\n -n,--dry-run            Do a dry run without executing anything\n```\n\nOnly a few basic things can be supplied on the command line; a client-id (an arbitrary identifier\nof the client running `Spydra`), the main and additional JAR files for the job, and arguments for\nthe job. For any use-case requiring more details, the user needs to create a JSON file and supply\nthe path to that as a parameter. All the command-line options will override the corresponding\noptions in the JSON config. Apart from all the command-line options and some general settings,\nit can also transparently pass along parameters to the `gcloud` command for\n[cluster creation](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create) or\n[job submission](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/hadoop).\n\nA job name can also be supplied. This will be sanitized and have a unique identifier attached\nto it, which will then be used as the Dataproc job ID. This is useful in finding the job in\nthe Google Cloud Console.\n\n##### The spydra-json argument\nAll properties that cannot be controlled via the few arguments of the *submit* command, can be set in the\nconfiguration file supplied with the --spydra-json parameter. The configuration file follows the structure of the \n`cloud dataproc clusters create` and `cloud dataproc jubs submit` commands and allows to set all\nthe possible arguments for these commands. The basic structure looks as follows:\n\n```json\n{\n  \"client_id\": \"spydra-test\",                 # Spydra client id. Usually left out as set by the frameworks during runtime.\n  \"cluster_type\": \"dataproc\",                 # Where to execute. Either dataproc or onpremise. Defaults to onpremise.\n  \"job_type\": \"hadoop\",                       # Defaults to hadoop. For supported types see gcloud dataproc jobs submit --help\n  \"log_bucket\": \"spydra-test-logs\",           # The bucket where Hadoop logs and history information are stored.\n  \"region\": \"europe-west1\",                   # The region in which the cluster is spun up\n  \"cluster\": {                                # All cluster related configuration\n    \"options\": {                              # Map supporting all options from the gcloud dataproc clusters create command\n      \"project\": \"spydra-test\",\n      \"num-workers\": \"13\",\n      \"worker-machine-type\": \"n1-standard-2\", # The default machine type used by Dataproc is n1-standard-8.\n      \"master-machine-type\": \"n1-standard-4\"\n    }\n  },\n  \"submit\": {                                 # All configuration related to job submission\n    \"job_args\": [                             # Job arguments. Usually left out as set by the frameworks during runtime.\n      \"pi\",\n      \"2\",\n      \"2\"\n    ],\n    \"options\": {                              # Map supporting all options from the gcloud dataproc jobs submit [hadoop,spark,hive...] command\n      \"jar\": \"/path/my.jar\"                   # Path of the job jar file. Usually left out as set by the frameworks during runtime.\n    }\n  }\n}\n```\n\nFor details on the format of the JSON file see\n[this schema](/spydra/src/main/resources/spydra_config_schema.json) and\n[these examples](spydra/src/main/resources/config_examples/).\n\n##### Minimal Submission Example\n\nUsing only the command-line:\n```\n$ java -jar spydra/target/spydra-VERSION-jar-with-dependencies.jar submit --client-id simple-spydra-test --jar hadoop-mapreduce-examples.jar pi 8 100\n```\n\nJSON config:\n```\n$ cat examples.json\n{\n  \"client_id\": \"simple-spydra-test\",\n  \"cluster_type\": \"dataproc\",\n  \"log_bucket\": \"spydra-test-logs\",\n  \"region\": \"europe-west1\",\n  \"cluster\": {\n    \"options\": {\n      \"project\": \"spydra-test\"\n    }\n  },\n  \"submit\": {\n    \"job_args\": [\n      \"pi\",\n      \"8\",\n      \"100\"\n    ],\n    \"options\": {\n      \"jar\": \"hadoop-mapreduce-examples.jar\"\n    }\n  }\n}\n$ spydra submit --spydra-json example.json\n```\n\n##### Cluster Autoscaling (Experimental)\nThe `Spydra` autoscaler provides automatic sizing for `Spydra` clusters by adding enough preemptible worker\nnodes until a user supplied percentage of containers is running in parallel on the cluster.\nIt enables cluster sizes to automatically adjust to growing resource needs over time and removes\nthe need to come up with a good size when scheduling a job executed on `Spydra`.\nThe autoscaler has two modes, upscale only and downscale.\n\nDownscale will remove nodes when the cluster is not fully utilized. After\nchoosing to downscale, it will wait for the `downscale_timeout` to allow active\njobs to complete before terminating nodes. Note that though nodes may not have\nactive YARN containers running, active jobs may be storing intermediate\n\"shuffle\" data on them. See [Dataproc Graceful\nDownscale](https://cloud.google.com/dataproc/docs/concepts/scaling-clusters#using_graceful_decommissioning)\nfor more information.\n\nTo enable autoscaling, add an autoscaler section similar to the one below to your `Spydra` configuration.\n\n```json\n{\n  \"cluster:\" {...},\n  \"submit:\" {...},\n  \"auto_scaler\": {\n    \"interval\": \"2\",        # Execution interval of the autoscaler in minutes\n    \"max\": \"20\",            # Maximum number of workers\n    \"factor\": \"0.3\",        # Percentage of YARN containers that should be running at any point in time 0.0 to 1.0.\n    \"downscale\": \"false\",    # Whether or not to downscale.\n    # If downscale is enabled, how long in minutes to wait for active jobs to finish\n    # before terminating nodes and potentially interrupting those jobs.\n    # Note that the autoscaler will not be able to add nodes during this interval.\n    \"downscale_timeout\": \"10\"\n  }\n}\n```\n\n##### Static Cluster Submission\nIf you prefer to manage your Dataproc clusters manually you still can use Spydra for job submission and just skip dynamic cluster creation part. The only change that is needed to be done to Spydra configurations is that you need to specify the name of the cluster you want to submit the job to. Here is an example:\n\n```json\n{\n  \"client_id\": \"simple-spydra-test\",\n  \"cluster_type\": \"dataproc\",\n  \"log_bucket\": \"spydra-test-logs\",\n  \"submit\": {\n    \"options\": {\n        \"project\": \"spydra-test\",\n        \"cluster\": \"NAME_OF_YOUR_CLUSTER\"\n    }\n    \"job_args\": [\n      \"pi\",\n      \"8\",\n      \"100\"\n    ],\n    \"options\": {\n      \"jar\": \"hadoop-mapreduce-examples.jar\"\n    }\n  }\n}\n```\nAlso notice that `project` parameter is specified in `submit/options` section instead of `cluster/options` section.\n\n##### Cluster Pooling (Experimental)\nDisclaimer: The usage of the pooling is experimental!\n\nThe `Spydra` cluster pooling provides automatic pooling for `Spydra` clusters by selecting an existing\ncluster according to certain conditions.\n\nTo enable cluster pooling add a pooling section similar to the one below to your `Spydra` configuration.\n\n```json\n{\n  \"cluster:\" {...},\n  \"submit:\" {...},\n  \"pooling\": {\n    \"limit\": 2,     # limit of concurrent clusters\n    \"max_age\": \"P1D\"# A java.time.Duration for the maximum age of a cluster\n  }\n}\n```\n\n##### Submission Gotchas\n   * You can use `--` if you need to pass a parameter starting with dashes to your job,\n     e.g. `submit --jar=jar ... -- -myParam`\n   * Don't forget to specify `=` for arguments like `--jar=$jar`, otherwise the CLI parsing\n     will break.\n   * If the specified jar contains a Main-Class entry in it's manifest, specifying --mainclass\n     will often lead to undesired behaviour, as the value of main-class will be passed as\n     an argument to the application instead of invoking this class.\n   * Not setting the default fs to GCS using the `fs.defaultFS` property can lead to crashes\n     and undesired behavior as a lot of the frameworks use the default filesystem implementation\n     instead of getting the correct filesystem for a given URI. It can also lead to the Crunch\n     output committer working very slowly while copying all files from HDFS to GCS in a\n     last non-distributed step.\n\n#### Running an Embedded JobHistoryServer\nThe *run-jhs* is designed for an interactive exploration of the job execution. This command spawns an embedded \nJobHistoryServer that can display all jobs executed using the client id associated with your job submission.\nFamiliarity with the use of JobHistoryServer from on-premise Hadoop is assumed.\nThe JHS is accessible on default port 19888.\n\nThe client id used when executing the job, and the log bucket is required for running *run-jhs* command.\n\n```java -jar spydra/target/spydra-VERSION-jar-with-dependencies.jar run-jhs --clientid=JOB_CLIENT_ID --log-bucket=LOG_BUCKET```\n\n#### Retrieving Logs\nThe *dump-logs* command will dump logs for an application to stdout. Currently only full logs of\nthe YARN application can be dumped - similarly to YARN logs when no specific container is specified.\nThis is useful for processing/exploration with further tools in the shell.\n\nThe client id used when executing the job, the Hadoop application id, and the log bucket is required\nfor running *dump-logs* command.\n\n```java -jar spydra/target/spydra-VERSION-jar-with-dependencies.jar dump-logs --clientid=MY_CLIENT_ID --username=HADOOP_USER_NAME --log-bucket=LOG_BUCKET --application=APPLICATION_ID```\n\n#### Retrieving History Data\nThe history files can be dumped as in regular Hadoop using the *dump-history* command.\n\nThe client id used when executing the job, the Hadoop application id, and the log bucket is required\nfor running *dump-history* command.\n\n```java -jar spydra/target/spydra-VERSION-jar-with-dependencies.jar dump-history --clientid=MY_CLIENT_ID --log-bucket=LOG_BUCKET --application=APPLICATION_ID```\n\n## Accessing Hadoop Web Interfaces for Ephemeral Clusters\n[Dataprocxy](https://github.com/spotify/dataprocxy) can be used to open the web interfaces of the Hadoop daemons of\nan ephemeral cluster as long as the cluster is running.\n\n\n## Building\n\n### Prerequisites\n* Java JDK 8\n* Maven 3.2.2\n* A Google Compute Platform project with Dataproc enabled\n* A Google Cloud Storage bucket for uploading init-actions. Ensure that this bucket is readable with all credentials used with `Spydra`.\n* A Google Cloud Storage bucket for storing integration test logs\n* JSON key for a [service account](https://cloud.google.com/compute/docs/access/service-accounts)\n  with editor access to the project and bucket.\n* The environment variable `GOOGLE_APPLICATION_CREDENTIALS` pointing at the location of the service\n  account JSON key\n* [gcloud](https://cloud.google.com/sdk/gcloud/) authenticated with the service account\n* [gsutil](https://cloud.google.com/storage/docs/gsutil) authenticated with the service account\n\n### Integration Test Configuration\n\nIn order to run integration tests, basic configuration needs to be provided during the build process.\nCreate a file with name *integration-test-config.json* similar to the one below and reference\nit during the maven invocation.\n\n```json\n{\n  \"log_bucket\": \"YOUR_GCS_LOG_BUCKET\",\n  \"region\": \"europe-west1\"\n}\n```\n\nReplace the YOUR_GCS_LOG_BUCKET with a bucket you have in your GCP project for storing the logs.\n\nThe project will be taken from the service account credentials, you do not need to specify\nthe *project* parameter in *integration-test-config.json* (or elsewhere).\n\nNotice that the file name must be exactly *integration-test-config.json* as that is what the\nintegration test will search for when it is run on the maven verify phase.\n\n#### Integration testing with application default credentials\n\nDue to a limitation in the\n[GCS Connector library](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage),\nthe integration tests do not work when using application default credentials, unless the tests are launched on\na Google Compute Platform managed node. Scripts for launching the tests in a Google Kubernetes Engine cluster\nhave been provided in [integration_test_k8s](./integration_test_k8s)\n\n### Build, Test and Package\n\nIn the following command, replace YOUR_INIT_ACTION_BUCKET with the bucket you created\nwhen setting up the prerequisites and YOUR_TEST_CONFIG_DIR with a directory name containing\nthe file *integration-test-config.json* you created in the previous step. YOUR_TEST_CONFIG_DIR\ncannot be the same as the package root, so create a separate directory for this purpose.\nThen execute the maven command:\n\n```\nmvn clean install -Dinit-action-uri=gs://YOUR_INIT_ACTION_BUCKET/spydra -Dtest-configuration-dir=YOUR_TEST_CONFIG_DIR\n```\n\nExecuting the maven command above will run the integration tests, and create\na spydra-VERSION-jar-with-dependencies.jar under spydra/target that packages `Spydra`,\nwhich can be executed with `java -jar`. Using `package` instead of `install` can be used \nto run just unit-tests and package Spydra.\n\nIf you want to copy the init-scripts into the defined init-action bucket, activate profile\n`install-init-scripts`:\n\n```\nmvn clean install -Pinstall-init-scripts -Dinit-action-uri=gs://YOUR_INIT_ACTION_BUCKET/spydra -Dtest-configuration-dir=YOUR_TEST_CONFIG_DIR\n```\n\nDo not run Maven `deploy` step, as it will try to upload created packages into the Spotify owned\nrepositories, which will fail unless you have Spotify specific credentials.\n\n## Communications\n\nIf you use Spydra and experience any issues, please create an issue under this Github project\nin [here](https://github.com/spotify/spydra/issues/new).\n\nYou can also ask for help and talk to us on Spydra related issues in\n[Spotify FOSS Slack](https://spotify-foss.slack.com) on channel\n[#spydra](https://spotify-foss.slack.com/messages/C58L8GVS8/).\n\n## Contributing\n\nThis project adheres to the [Open Code of Conduct][code-of-conduct]. By participating,\nyou are expected to honor this code.\n\n[code-of-conduct]: https://github.com/spotify/code-of-conduct/blob/master/code-of-conduct.md\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fspydra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspotify%2Fspydra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fspydra/lists"}