https://github.com/awesome-kyuubi/hadoop-testing

Testing Sandbox for Hadoop Ecosystem Components
https://github.com/awesome-kyuubi/hadoop-testing
docker-compose hadoop sandbox
Last synced: 3 months ago
JSON representation
Testing Sandbox for Hadoop Ecosystem Components
Host: GitHub
URL: https://github.com/awesome-kyuubi/hadoop-testing
Owner: awesome-kyuubi
License: apache-2.0
Created: 2023-03-01T09:20:19.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-10-29T09:01:44.000Z (9 months ago)
Last Synced: 2024-10-29T10:56:26.772Z (9 months ago)
Topics: docker-compose, hadoop, sandbox
Language: Jinja
Homepage:
Size: 2.04 MB
Stars: 33
Watchers: 1
Forks: 12
Open Issues: 20
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        Hadoop Testing

==============

This serves as a testing sandbox for Hadoop, equipped with fundamental components

of the Hadoop ecosystem to facilitate the rapid establishment of test environments.

We try to deploy a big data ecosystem in multiple Docker containers to simulate the production environment. Generally speaking, it contains two types of deployment modes(standalone and mixed deployed). Standalone mode is just like a SaaS service provided by cloud vendors, while the mixed deployed mode is just like the semi-managed EMR service of cloud vendors. The whole deployment architecture is shown below:

![deployment_architecture](./docs/imgs/deployment_architecture.png)

> Draw by [excalidraw](https://excalidraw.com/)

## Features

* Realistic simulation of production environment;

* Kerberos ready, and optional;

* Lightweight, highly scalable and tailored Hadoop ecosystem;

* Multi-purpose, multi-scenario, suitable for:

   - Component developer: unit and integration testing;

   - DevOps engineer: parameter adjustment verification, compatibility testing of component upgrades;

   - Solution architect: Sandbox simulation of migration work, work shop demonstration;

   - Data ETL engineer: a test environment that is easy to build and destroy;

## Components

The supported components are listed below:

| Name           | Version | Kerberos Ready | Optional | Default Enabled | Variables                              |

| -------------- | ------- | -------------- | -------- | --------------- | -------------------------------------- |

| JDK 8          | 8.0.432 | Not Applicable | No       | Yes             |                                        |

| JDK 11         | 11.0.25 | Not Applicable | No       | Yes             |                                        |

| JDK 17         | 17.0.13 | Not Applicable | No       | Yes             |                                        |

| JDK 21         | 21.0.5  | Not Applicable | Yes      | No              | jdk21_enabled                          |

| KDC            | latest  | Yes            | Yes      | No              | kerberos_enabled                       |

| MySQL          | 8.0     | No             | No       | Yes             |                                        |

| ZooKeeper      | 3.8.4   | Not Yet        | No       | Yes             |                                        |

| Hadoop HDFS    | 3.3.6   | Yes            | No       | Yes             |                                        |

| Hadoop YARN    | 3.3.6   | Yes            | No       | Yes             |                                        |

| Hive Metastore | 2.3.9   | Yes            | No       | Yes             |                                        |

| HiveServer2    | 2.3.9   | Yes            | No       | Yes             |                                        |

| Kyuubi         | 1.10.1  | Yes            | No       | Yes             |                                        |

| Spark          | 3.5.2   | Yes            | Yes      | Yes             | spark_enabled, spark_custom_name       |

| Flink          | 1.20.0  | Yes            | Yes      | No              | flink_enabled                          |

| Trino          | 436     | Not Yet        | Yes      | No              | trino_enabled                          |

| Ranger         | 2.4.0   | Not Yet        | Yes      | No              | ranger_enabled                         |

| Zeppelin       | 0.12.0  | Not Yet        | Yes      | Yes             | zeppelin_enabled, zeppelin_custom_name |

| Kafka          | 3.6.2   | Not Yet        | Yes      | No              | kafka_enabled, kafka_ui_enabled        |

| Grafana        | 11.1.3  | Not Applicable | Yes      | No              | grafana_enabled                        |

| Prometheus     | 2.53.1  | Not Applicable | Yes      | No              | promeheus_enabled                      |

| Loki           | 3.1.0   | Not Applicable | Yes      | No              | loki_enabled                           |

| Iceberg        | 1.7.0   | Yes            | Yes      | Yes             | iceberg_enabled                        |

| Hudi           | 0.14.1  | Yes            | Yes      | No              | hudi_enabled                           |

| Parquet        | 1.15.0  | Not Applicable | Yes      | Yes             | parquet_enabled                        |

**Note** :

- Most components respect `JAVA_HOME`, which is configured as JDK 8

- Spark is configured to use JDK 17

- Trino is configured to use JDK 21

- Zeppelin is configured to use JDK 11

## Prepare

This project uses [Ansible](https://www.ansible.com/) to render the Dockerfile, shell scripts, and configuration files from the templates. Please make sure you have installed it before building.

 (Optional, Recommended) Use pyenv to manage Python environments 

Considering, ansible strongly depends on the Python environment. To make the Python environment independent and easy to manage, it is recommended to use [pyenv-virtualenv](https://github.com/pyenv/pyenv-virtualenv) to manage Python environment.

### Install pyenv

Here we provide guides for macOS and CentOS users.

#### macOS

Install from Homebrew

```bash

brew install pyenv pyenv-virtualenv

```

Append to `~/.zshrc`, and perform `source ~/.zshrc` or open a new terminal to take effect.

```bash

eval "$(pyenv init -)"

eval "$(pyenv virtualenv-init -)"

```

#### CentOS

Before installing, we need to install some required packages.

```bash

yum install gcc make patch zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel tk-devel libffi-devel xz-devel

```

Then, install pyenv:

```bash

curl https://pyenv.run | bash

```

If you use `bash`, add it into `~/.bash_profile` or `~/.bashrc`:

```bash

export PYENV_ROOT="$HOME/.pyenv"

[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"

eval "$(pyenv init -)"

```

Add it into `~/.bashrc`:

```bash

eval "$(pyenv virtualenv-init -)"

```

After all, source `~/.bash_profile` and `~/.bashrc`.

### Use pyenv

Create virtualenv

```bash

pyenv install 3.11

pyenv virtualenv 3.11 hadoop-testing

```

Localize virtualenv

```bash

pyenv local hadoop-testing

```

Install packages to the isolated virtualenv

```bash

pip install -r requirements.txt

```

(Optional) Configure SSH containers from host

This step allows you to ssh all the `hadoop-*` containers from your host, then can use ansible to control all the `hadoop-*` containers.

### Install nc

Here we provide guides for macOS and CentOS users.

#### macOS

The macOS should have pre-installed `nc`.

#### CentOS

Install `nc` using YUM:

```bash

yum install epel-release && yum install -y nc

```

### Configure SSH

Then configure the `~/.ssh/config` file in your host:

```bash

Host hadoop-*

    Hostname %h.orb.local

    User root

    Port 22

    ForwardAgent yes

    IdentityFile ~/.ssh/id_rsa_hadoop_testing

    StrictHostKeyChecking no

    ProxyCommand nc -x 127.0.0.1:18070 %h %p

```

**Note** : DO NOT forget to reduce access permission by invoking this command:

```bash

chmod 600 ~/.ssh/id_rsa_hadoop_testing

```

After all the containers have been launched, test the controllability via this command:

```bash

ansible-playbook test-ssh.yaml

```

It should print all nodes' OS information (include host and hadoop related containers).

If not, use `-vvv` config option to debug it.

## How to use

Firstly, use ansible to render some templates files, including `download.sh`, `.env`, `compose.yaml`, `Dockerfile`, configuration files, etc.

```bash

ansible-playbook build.yaml

```

Ensure you specify the inventory file when running the playbook if it is not located in the default path or correctly recognized by Ansible. For instance, use `ansible-playbook -i hosts playbook.yaml` to explicitly define the inventory.

By default, all services disable authN, you can enable Kerberos by passing the `kerberos_enabled` variable:

```bash

ansible-playbook build.yaml -e "{kerberos_enabled: true}"

```

And some components are disabled by default, you can enable them by passing the `_enabled` variable:

```bash

ansible-playbook build.yaml -e "{jdk21_enabled: true, trino_enabled: true}"

```

Note: the whole variable list are defined in `host_vars/local.yaml`.

If something goes wrong, consider adding `-vvv` arg to debug the playbook:

```bash

ansible-playbook build.yaml -vvv

```

Download all required artifacts, which will be used for building Docker images.

This script will download a large amount of artifacts, depending on your network bandwidth,

it may take a few minutes or even hours to complete. You can also download them manually and

put them into the `download` directory, the scripts won't download them again if they already

exist.

```bash

./download.sh

```

Build docker images

```bash

./build-image.sh

```

Run the testing playground

```bash

docker compose up

```

Note: depending on the performance of your hardware, some components may take dozens of seconds

after container launching to complete the initial work, generally, the slowest one is creating

folders on HDFS, the progress can be observed by monitoring container logs.

## Access services

### Networks

#### Option 1: OrbStack (macOS only)

For macOS users, it's recommended to use [OrbStack](https://docs.orbstack.dev/) as the container runtime. OrbStack provides an out-of-box [container domain name resolving feature](https://docs.orbstack.dev/docker/domains) to allow accessing each container via `.orb.local`.

#### Option 2: Socks5 Proxy

For other platforms, or you start the containers on a remote server, we provide a socks5 proxy server in a container named `socks5`, which listens 18070 port and is exposed to the dockerd host by default, you can forward traffic to this socks server to access services run in other containers.

For example, to access service in Browser, use [SwitchyOmega](https://github.com/FelisCatus/SwitchyOmega) to forward traffic of `*.orb.local` to `:18070`.

![img](docs/imgs/switchy-omega-1.png "step 1")

![img](docs/imgs/switchy-omega-2.png "step 2")

![img](docs/imgs/switchy-omega-3.png "step 3")

### Service endponits

Once the testing environment is fully operational, the following services will be accessible:

- Grafana: http://grafana.orb.local:3000

- Prometheus: http://prometheus.orb.local:9090

- Kafka UI: http://kafka-ui.orb.local:19092

- Kyuubi UI: http://hadoop-master1.orb.local:10099

- Spark History Server: http://hadoop-master1.orb.local:18080

- Flink History Server: http://hadoop-master1.orb.local:8082

- Hadoop HDFS: http://hadoop-master1.orb.local:9870

- Hadoop YARN: http://hadoop-master1.orb.local:8088

- Hadoop MapReduce JobHistory: http://hadoop-master1.orb.local:19888

- Ranger Admin: http://hadoop-master1.orb.local:6080 (admin/Ranger@admin123)

- Trino Web UI: http://hadoop-master1.orb.local:18081 (admin/)

- Zeppelin: http://hadoop-master1.orb.local:8081

![img](docs/imgs/namenode-ui.png)

## Roadmap

1. Add more components, such as LDAP, HBase, Superset etc.

2. Fully templatized. Leverage Ansible and Jinja2 to templatize the Dockerfiles, shell scripts, and configuration files, so that users can easily customize the testing environment by modifying the configurations, e.g. only enabling a subset of components, and changing the version of the components.

3. Provide user-friendly docs, with some basic tutorials and examples, e.g. how to create a customized testing environment, how to run some basic examples, how to add a new component, etc.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/awesome-kyuubi/hadoop-testing

Awesome Lists containing this project

README