Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lresende/ansible-spark-cluster
Ansible roles to install an Spark Standalone cluster (HDFS/Spark/Jupyter Notebook) or Ambari based Spark cluster
https://github.com/lresende/ansible-spark-cluster
anaconda ansible ansible-roles apache-ambari apache-spark jupyter-enterprise-gateway jupyter-notebook
Last synced: 3 months ago
JSON representation
Ansible roles to install an Spark Standalone cluster (HDFS/Spark/Jupyter Notebook) or Ambari based Spark cluster
- Host: GitHub
- URL: https://github.com/lresende/ansible-spark-cluster
- Owner: lresende
- License: apache-2.0
- Created: 2016-05-23T05:52:30.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-01-30T13:54:26.000Z (12 months ago)
- Last Synced: 2024-09-30T17:42:03.872Z (4 months ago)
- Topics: anaconda, ansible, ansible-roles, apache-ambari, apache-spark, jupyter-enterprise-gateway, jupyter-notebook
- Language: Shell
- Homepage:
- Size: 202 KB
- Stars: 60
- Watchers: 9
- Forks: 35
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
This repository defines multiple ansible roles to help deploying different modes of a Spark cluster and
Data Science Platform based on Anaconda and Jupyter Notebook stack# Requirements
You will need a driver machine with ansible installed and a clone of the current repository:
* If you are running on cloud (public/private network)
* Install ansible on the edge node (with public ip)
* if you are running on private cloud (public network access to all nodes)
* Install ansible on your laptop and drive the deployment from it### Installing Ansible on RHEL
```
curl -O https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -i epel-release-latest-7.noarch.rpm
sudo yum update -y
sudo yum install -y ansible
```### Installing Ansible on Mac
* Install Annaconda
* Use pip install ansible```
pip install --upgrade ansible
```### Updating Ansible configuration
In order to have variable overriding from host inventory, please add the following configuration into your ~/.ansible.cfg file
```
[defaults]
host_key_checking = False
hash_behaviour = merge
```### Supported/Tested Platform
* RHEL 7.x
* Ansible 2.6.3# Defining your cluster deployment metadata (host inventory)
Ansible uses 'host inventory' files to define the cluster configuration, nodes, and groups of nodes
that serves a given purpose (e.g. master node).Below is a host inventory sample definition:
```
[all:vars]
ansible_connection=ssh
#ansible_user=root
#ansible_ssh_private_key_file=~/.ssh/ibm_rsa
gather_facts=True
gathering=smart
host_key_checking=False
install_java=True
install_temp_dir=/tmp/ansible-install
install_dir=/opt
python_version=2[master]
lresende-elyra-node-1 ansible_host=IP ansible_host_private=IP ansible_host_id=1[nodes]
lresende-elyra-node-2 ansible_host=IP ansible_host_private=IP ansible_host_id=2
lresende-elyra-node-3 ansible_host=IP ansible_host_private=IP ansible_host_id=3
lresende-elyra-node-4 ansible_host=IP ansible_host_private=IP ansible_host_id=4
lresende-elyra-node-5 ansible_host=IP ansible_host_private=IP ansible_host_id=5```
Some specific configurations are:
* install_java=True : install/update java 8
* install_temp_dir=/tmp/ansible-install : temporary folder used for install files
* install_dir=/opt : where packages are installed (e.g. Spark)
* python_version=2 : python version to use, influence which version of Anaconda to download**Note:** ansible_host_id is only used when deploying a "Spark Standalone" cluster.
**Note:** Ambari is currently only supporting Python 2.x# Deploying Spark using Ambari and HDP distribution
In this scenario, a minimal blueprint is used to deploy the required components
to run YARN and Spark.### Related ansible roles
* **Common** Deploys Java and common dependencies
* **Ambari** Deploys Ambari cluster with HDP Stack### Deployment playbook
The sample playbook below can be used to deploy an Spark using an HDP distribution
```
- name: ambari setup
hosts: all
remote_user: root
roles:
- role: common
- role: ambari
```### Deploying
```
ansible-playbook --verbose -i
```Example:
```
ansible-playbook --verbose setup-ambari.yml -c paramiko -i hosts-fyre-ambari
```# Deploying Spark standalone
In this scenario, a Standalone Spark cluster will be deployed with a few optional components.
### Related ansible roles
* **Common** Deploys Java and common dependencies
* **HDFS** Deploys HDFS filesystem using slave nodes as data nodes
* **Spark** Deploys Spark in Standalone mode using slave nodes as workers
* **Spark-CLuster-Admin** Utility scripts for managing Spark cluster
* **ElasticSearch** Deploy ElasticSearch nodes on all slave nodes
* **Zookeeper** Depoys Zookeeper on all nodes (required by Kafka)
* **Kafka** Deploy Kafka nodes on all slave nodes### Deployment playbook
```
- name: spark setup
hosts: all
remote_user: root
roles:
- role: common
- role: hdfs
- role: spark
- role: spark-cluster-admin```
**Note:** When deploying Kafka, the Zookeeper role is required
### Deploying
```
ansible-playbook --verbose -i
```Example:
```
ansible-playbook --verbose setup-spark-standalone.yml -c paramiko -i hosts-fyre-spark
```# Deploying Data Science Platform components
In this scenario, an existing Spark cluster is updated with necessary components to build a data science platform
based on Anaconda and Jupyter Notebook stack.### Related ansible roles
* **Anaconda** Deploys Anaconda Python distribution on all nodes
* **Notebook** Deploys Notebook Platform### Deployment playbook
```
- name: anaconda
hosts: all
vars:
anaconda:
update_path: true
remote_user: root
roles:
- role: anaconda- name: notebook platform dependencies
hosts: all
vars:
notebook:
use_anaconda: true
deploy_kernelspecs_to_workers: false
remote_user: root
roles:
- role: notebook
```**Playbook Configuration**
* **use_anaconda**: Flag to identify if anaconda is available and should be used as python package manager
* **deploy_kernelspecs_to_workers**: optionally deploy kernelspecs for Python, R, and Scala to all nodes# Legal Disclaimers
* The **Ambari** role will install [MySQL community edition](https://www.mysql.com/products/community/)
which is available under GPL license.* The **Notebook** role will install [R](https://www.r-project.org/) which is available under [GPL2 | GPL 3](https://www.r-project.org/Licenses/)
By deploying these packages via the ansible utility scripts in this project you are accepting the
license terms for these components.