Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger
https://github.com/wtsi-hgi/hgi-cloud

ansible hail iac openstack packer spark terraform

Last synced: 3 months ago
JSON representation

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

Awesome Lists containing this project

README

        

# hgi-systems-cluster-spark

A reboot of the HGI's IaC project. This specific project has been created to
address one, simple, initial objective: the lifecycle management of a spark cluster.

# Why a reboot?

The code was not effective any more: the team was not confident with the
codebase, the building process and the infrastructure generated by the code
was missing a number of must-have features for today's infrastructures.
We chose to have a fresh start on the IaC, rather then refactoring legacy
code. This will let us choose simple and effective objectives, outline better
requirements, and design around operability from the very beginning.

# Guide

## Using this repository
1. `terraform 0.11` executable anywhere in your `PATH`
2. `packer 1.4` executable anywhere in your `PATH`
3. `docker` distribution [installed](https://docs.docker.com/install/linux/docker-ce/ubuntu/)
4. Ensure that the following packages are installed:
* build-essential
* cmake
* g++
* libatlas3-base
* liblz4-dev
* libnetlib-java
* libopenblas-base
* make
* openjdk-8-jdk
* python3
* python3-dev
* python3-pip
* r-base
* r-recommended
* scala
5. Ensure that python requirements in `requirements.txt` are installed
4. Follow the [setup](docs/setup.md) runbook

## Running tasks
`invoke.sh` is shell script made to wrap `pyinvoke` quite extensive list of
tasks and collections, and meke its usage even easier. `invoke.sh`. To
understand how to use `invoke.sh`, you can run:
```
bash invoke.sh --help
```
To have an idea of what the tasks are and do, please have a look at the
[tasks](tasks/README.md) documentation.
For a quick list of example usages, please refer to the
[users](docs/runbook_users.md) or [ops](docs/runbook_ops.md) runbooks.

# Try your Jupyter's notebook

## Jupyter Notebook
Open your hail-master Jupyter URL http://\/jupyter/ in a web
browser, create a notebook, then initialise it:
```
import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)
```

## Interactive pyspark

(TODO: include a .ssh/config snippet to allow for an easier ssh run)
`ssh` into your hail-master node:
```
$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@
```
Once you've logged in, become the application user (i.e. hgi -- for now)
```
$ sudo --login --user=hgi --group=hgi
```
The `--login` option will create a login shell that will have a lot of
pre-configured environment variables and commands, including a pre-configured
alias to `pyspark`, so you should not need to remember any option. Once you
started `pyspark`, you can initialise hail like this:

```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/

Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> import os
>>> import hail
>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
>>> mail.init(sc=sc, tmp_dir=tmp_dir)
```

# Non-interactive pyspark
Hail initialisation in a non-interactive `pyspark` session is the same as for
the Jupyter Notebooks:
```
import os
import hail
import pyspark

tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()

hail.init(sc=sc, tmp_dir=tmp_dir)
```

# How to contribute
Read the [CONTRIBUTING.md](CONTRIBUTING.md) file

# Licese
Read the [LICENSE.md](LICENSE.md) file