{"id":21881201,"url":"https://github.com/wtsi-hgi/hgi-cloud","last_synced_at":"2025-07-16T21:07:15.485Z","repository":{"id":40642278,"uuid":"197594876","full_name":"wtsi-hgi/hgi-cloud","owner":"wtsi-hgi","description":"terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger","archived":false,"fork":false,"pushed_at":"2023-09-08T09:31:43.000Z","size":1644,"stargazers_count":1,"open_issues_count":3,"forks_count":3,"subscribers_count":3,"default_branch":"develop","last_synced_at":"2025-04-15T05:16:49.541Z","etag":null,"topics":["ansible","hail","iac","openstack","packer","spark","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wtsi-hgi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-18T13:42:32.000Z","updated_at":"2020-06-03T20:18:19.000Z","dependencies_parsed_at":"2022-08-31T08:11:12.122Z","dependency_job_id":null,"html_url":"https://github.com/wtsi-hgi/hgi-cloud","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtsi-hgi%2Fhgi-cloud","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtsi-hgi%2Fhgi-cloud/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtsi-hgi%2Fhgi-cloud/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wtsi-hgi%2Fhgi-cloud/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wtsi-hgi","download_url":"https://codeload.github.com/wtsi-hgi/hgi-cloud/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249010221,"owners_count":21197796,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ansible","hail","iac","openstack","packer","spark","terraform"],"created_at":"2024-11-28T09:18:08.202Z","updated_at":"2025-04-15T05:16:55.002Z","avatar_url":"https://github.com/wtsi-hgi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# hgi-systems-cluster-spark\n\nA reboot of the HGI's IaC project. This specific project has been created to\naddress one, simple, initial objective: the lifecycle management of a spark cluster.\n\n# Why a reboot?\n\nThe code was not effective any more: the team was not confident with the\ncodebase, the building process and the infrastructure generated by the code\nwas missing a number of must-have features for today's infrastructures.\nWe chose to have a fresh start on the IaC, rather then refactoring legacy\ncode. This will let us choose simple and effective objectives, outline better\nrequirements, and design around operability from the very beginning.\n\n# Guide\n\n## Using this repository\n1. `terraform 0.11` executable anywhere in your `PATH`\n2. `packer 1.4` executable anywhere in your `PATH`\n3. `docker` distribution [installed](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n4. Ensure that the following packages are installed:\n   * build-essential\n   * cmake\n   * g++\n   * libatlas3-base\n   * liblz4-dev\n   * libnetlib-java\n   * libopenblas-base\n   * make\n   * openjdk-8-jdk\n   * python3\n   * python3-dev\n   * python3-pip\n   * r-base\n   * r-recommended\n   * scala\n5. Ensure that python requirements in `requirements.txt` are installed\n4. Follow the [setup](docs/setup.md) runbook\n\n## Running tasks\n`invoke.sh` is shell script made to wrap `pyinvoke` quite extensive list of\ntasks and collections, and meke its usage even easier. `invoke.sh`. To\nunderstand how to use `invoke.sh`, you can run:\n```\nbash invoke.sh --help\n```\nTo have an idea of what the tasks are and do, please have a look at the\n[tasks](tasks/README.md) documentation.\nFor a quick list of example usages, please refer to the\n[users](docs/runbook_users.md) or [ops](docs/runbook_ops.md) runbooks.\n\n# Try your Jupyter's notebook\n\n## Jupyter Notebook\nOpen your hail-master Jupyter URL http://\\\u003cIP\\_OR\\_NAME\\\u003e/jupyter/ in a web\nbrowser, create a notebook, then initialise it:\n```\nimport os\nimport hail\nimport pyspark\n\ntmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')\nsc = pyspark.SparkContext()\n\nhail.init(sc=sc, tmp_dir=tmp_dir)\n```\n\n## Interactive pyspark\n\n(TODO: include a .ssh/config snippet to allow for an easier ssh run)\n`ssh` into your hail-master node:\n```\n$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@\u003cIP_OR_NAME\u003e\n```\nOnce you've logged in, become the application user (i.e. hgi -- for now)\n```\n$ sudo --login --user=hgi --group=hgi\n```\nThe `--login` option will create a login shell that will have a lot of\npre-configured environment variables and commands, including a pre-configured\nalias to `pyspark`, so you should not need to remember any option. Once you\nstarted `pyspark`, you can initialise hail like this:\n\n```\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /__ / .__/\\_,_/_/ /_/\\_\\   version 2.4.3\n      /_/\n\nUsing Python version 3.7.3 (default, Mar 27 2019 22:11:17)\nSparkSession available as 'spark'.\n\u003e\u003e\u003e import os\n\u003e\u003e\u003e import hail\n\u003e\u003e\u003e tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')\n\u003e\u003e\u003e mail.init(sc=sc, tmp_dir=tmp_dir)\n```\n\n# Non-interactive pyspark\nHail initialisation in a non-interactive `pyspark` session is the same as for\nthe Jupyter Notebooks:\n```\nimport os\nimport hail\nimport pyspark\n\ntmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')\nsc = pyspark.SparkContext()\n\nhail.init(sc=sc, tmp_dir=tmp_dir)\n```\n\n# How to contribute\nRead the [CONTRIBUTING.md](CONTRIBUTING.md) file\n\n# Licese\nRead the [LICENSE.md](LICENSE.md) file\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwtsi-hgi%2Fhgi-cloud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwtsi-hgi%2Fhgi-cloud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwtsi-hgi%2Fhgi-cloud/lists"}