https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger
https://github.com/wtsi-hgi/hgi-cloud

ansible hail iac openstack packer spark terraform

Last synced: 21 days ago
JSON representation

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

Host: GitHub
URL: https://github.com/wtsi-hgi/hgi-cloud
Owner: wtsi-hgi
License: bsd-3-clause
Created: 2019-07-18T13:42:32.000Z (almost 6 years ago)
Default Branch: develop
Last Pushed: 2023-09-08T09:31:43.000Z (over 1 year ago)
Last Synced: 2025-04-15T05:16:49.541Z (21 days ago)
Topics: ansible, hail, iac, openstack, packer, spark, terraform
Language: Python
Homepage:
Size: 1.57 MB
Stars: 1
Watchers: 3
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        # hgi-systems-cluster-spark

A reboot of the HGI's IaC project. This specific project has been created to

address one, simple, initial objective: the lifecycle management of a spark cluster.

# Why a reboot?

The code was not effective any more: the team was not confident with the

codebase, the building process and the infrastructure generated by the code

was missing a number of must-have features for today's infrastructures.

We chose to have a fresh start on the IaC, rather then refactoring legacy

code. This will let us choose simple and effective objectives, outline better

requirements, and design around operability from the very beginning.

# Guide

## Using this repository

1. `terraform 0.11` executable anywhere in your `PATH`

2. `packer 1.4` executable anywhere in your `PATH`

3. `docker` distribution [installed](https://docs.docker.com/install/linux/docker-ce/ubuntu/)

4. Ensure that the following packages are installed:

   * build-essential

   * cmake

   * g++

   * libatlas3-base

   * liblz4-dev

   * libnetlib-java

   * libopenblas-base

   * make

   * openjdk-8-jdk

   * python3

   * python3-dev

   * python3-pip

   * r-base

   * r-recommended

   * scala

5. Ensure that python requirements in `requirements.txt` are installed

4. Follow the [setup](docs/setup.md) runbook

## Running tasks

`invoke.sh` is shell script made to wrap `pyinvoke` quite extensive list of

tasks and collections, and meke its usage even easier. `invoke.sh`. To

understand how to use `invoke.sh`, you can run:

```

bash invoke.sh --help

```

To have an idea of what the tasks are and do, please have a look at the

[tasks](tasks/README.md) documentation.

For a quick list of example usages, please refer to the

[users](docs/runbook_users.md) or [ops](docs/runbook_ops.md) runbooks.

# Try your Jupyter's notebook

## Jupyter Notebook

Open your hail-master Jupyter URL http://\/jupyter/ in a web

browser, create a notebook, then initialise it:

```

import os

import hail

import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')

sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

```

## Interactive pyspark

(TODO: include a .ssh/config snippet to allow for an easier ssh run)

`ssh` into your hail-master node:

```

$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@

```

Once you've logged in, become the application user (i.e. hgi -- for now)

```

$ sudo --login --user=hgi --group=hgi

```

The `--login` option will create a login shell that will have a lot of

pre-configured environment variables and commands, including a pre-configured

alias to `pyspark`, so you should not need to remember any option. Once you

started `pyspark`, you can initialise hail like this:

```

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3

      /_/

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)

SparkSession available as 'spark'.

>>> import os

>>> import hail

>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')

>>> mail.init(sc=sc, tmp_dir=tmp_dir)

```

# Non-interactive pyspark

Hail initialisation in a non-interactive `pyspark` session is the same as for

the Jupyter Notebooks:

```

import os

import hail

import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')

sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)

```

# How to contribute

Read the [CONTRIBUTING.md](CONTRIBUTING.md) file

# Licese

Read the [LICENSE.md](LICENSE.md) file

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wtsi-hgi/hgi-cloud

Awesome Lists containing this project

README