Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wtsi-hgi/hgi-cloud
terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger
https://github.com/wtsi-hgi/hgi-cloud
ansible hail iac openstack packer spark terraform
Last synced: 3 months ago
JSON representation
terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger
- Host: GitHub
- URL: https://github.com/wtsi-hgi/hgi-cloud
- Owner: wtsi-hgi
- License: bsd-3-clause
- Created: 2019-07-18T13:42:32.000Z (over 5 years ago)
- Default Branch: develop
- Last Pushed: 2023-09-08T09:31:43.000Z (over 1 year ago)
- Last Synced: 2024-04-15T03:07:19.050Z (10 months ago)
- Topics: ansible, hail, iac, openstack, packer, spark, terraform
- Language: Python
- Homepage:
- Size: 1.57 MB
- Stars: 1
- Watchers: 4
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# hgi-systems-cluster-spark
A reboot of the HGI's IaC project. This specific project has been created to
address one, simple, initial objective: the lifecycle management of a spark cluster.# Why a reboot?
The code was not effective any more: the team was not confident with the
codebase, the building process and the infrastructure generated by the code
was missing a number of must-have features for today's infrastructures.
We chose to have a fresh start on the IaC, rather then refactoring legacy
code. This will let us choose simple and effective objectives, outline better
requirements, and design around operability from the very beginning.# Guide
## Using this repository
1. `terraform 0.11` executable anywhere in your `PATH`
2. `packer 1.4` executable anywhere in your `PATH`
3. `docker` distribution [installed](https://docs.docker.com/install/linux/docker-ce/ubuntu/)
4. Ensure that the following packages are installed:
* build-essential
* cmake
* g++
* libatlas3-base
* liblz4-dev
* libnetlib-java
* libopenblas-base
* make
* openjdk-8-jdk
* python3
* python3-dev
* python3-pip
* r-base
* r-recommended
* scala
5. Ensure that python requirements in `requirements.txt` are installed
4. Follow the [setup](docs/setup.md) runbook## Running tasks
`invoke.sh` is shell script made to wrap `pyinvoke` quite extensive list of
tasks and collections, and meke its usage even easier. `invoke.sh`. To
understand how to use `invoke.sh`, you can run:
```
bash invoke.sh --help
```
To have an idea of what the tasks are and do, please have a look at the
[tasks](tasks/README.md) documentation.
For a quick list of example usages, please refer to the
[users](docs/runbook_users.md) or [ops](docs/runbook_ops.md) runbooks.# Try your Jupyter's notebook
## Jupyter Notebook
Open your hail-master Jupyter URL http://\/jupyter/ in a web
browser, create a notebook, then initialise it:
```
import os
import hail
import pysparktmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()hail.init(sc=sc, tmp_dir=tmp_dir)
```## Interactive pyspark
(TODO: include a .ssh/config snippet to allow for an easier ssh run)
`ssh` into your hail-master node:
```
$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@
```
Once you've logged in, become the application user (i.e. hgi -- for now)
```
$ sudo --login --user=hgi --group=hgi
```
The `--login` option will create a login shell that will have a lot of
pre-configured environment variables and commands, including a pre-configured
alias to `pyspark`, so you should not need to remember any option. Once you
started `pyspark`, you can initialise hail like this:```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> import os
>>> import hail
>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
>>> mail.init(sc=sc, tmp_dir=tmp_dir)
```# Non-interactive pyspark
Hail initialisation in a non-interactive `pyspark` session is the same as for
the Jupyter Notebooks:
```
import os
import hail
import pysparktmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()hail.init(sc=sc, tmp_dir=tmp_dir)
```# How to contribute
Read the [CONTRIBUTING.md](CONTRIBUTING.md) file# Licese
Read the [LICENSE.md](LICENSE.md) file