Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mobiletelesystems/hadoop-docker
Docker image with Hadoop cluster
https://github.com/mobiletelesystems/hadoop-docker
docker-compose-template docker-image hadoop
Last synced: 22 days ago
JSON representation
Docker image with Hadoop cluster
- Host: GitHub
- URL: https://github.com/mobiletelesystems/hadoop-docker
- Owner: MobileTeleSystems
- License: apache-2.0
- Created: 2023-04-21T13:27:10.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-14T08:47:10.000Z (3 months ago)
- Last Synced: 2024-11-14T09:34:39.490Z (3 months ago)
- Topics: docker-compose-template, docker-image, hadoop
- Language: Shell
- Homepage:
- Size: 114 KB
- Stars: 0
- Watchers: 7
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# [Hadoop docker image](https://github.com/MobileTeleSystems/hadoop-docker)
[![Repo Status](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://github.com/MobileTeleSystems/hadoop-docker/workflows/Test%20build/badge.svg)](https://github.com/MobileTeleSystems/hadoop-docker/actions)
[![Docker Pulls](https://img.shields.io/docker/pulls/mtsrus/hadoop)](https://hub.docker.com/r/mtsrus/hadoop)**Test purpose only!**
## HDFS
All-in-one HDFS container with:
* HDFS namenode
* HDFS secondary namenode
* HDFS datanode### Versions
* `mtsrus/hadoop:hadoop2.7.3-hdfs`
* `mtsrus/hadoop:hadoop2-hdfs` - same as above* `mtsrus/hadoop:hadoop3.3.6-hdfs`
* `mtsrus/hadoop:hadoop3-hdfs` - same as above### Prerequisites
Minimal resources could start with are:
* 200m CPU
* 700Mb RAM
* 1Gb storage### Examples
See [docker-compose.yml](hdfs/docker-compose.yml).
### Port numbers
NOTE: Hadoop 2 image uses the same port numbers as Hadoop 3:
- `9820:9820` - HDFS IPC
- `9870:9870` - WebHDFS### Configuration
#### `/var/hadoop/conf/*.xml` files
##### Defaults
* [/var/hadoop/conf/core-site.xml](hdfs/conf/hadoop/core-site.xml)
* [/var/hadoop/conf/hdfs-site.xml](hdfs/conf/hadoop/hdfs-site.xml)You can mount custom config files to `/var/hadoop/conf` directory inside container to override default Hadoop configuration.
##### Substitutions
The following substitutions are replaced with proper values:
* `{{hostname}}` - current hostname
#### Container env variables
* `WAIT_TIMEOUT_SECONDS=120` - timeout in seconds after starting each service to check if it is alive
#### `/var/hadoop/conf/hadoop-env.sh` environment variables
##### Defaults
* `export HADOOP_HEAPSIZE=512` - max JVM memory in megabytes, applied for all Hadoop components (if no overrides)
If container fails with `OutOfMemory`, you should increase this value, e.g. up to `1024` or `2048`.
##### Variables per service
* `export HADOOP_NAMENODE_OPTS=-Xmx2048m` - max JVM memory for Namenode
* `export HADOOP_SECONDARYNAMENODE_OPTS=-Xmx2048m` - max JVM memory for Secondary Namenode
* `export HADOOP_DATANODE_OPTS=-Xmx1024m` - max JVM memory for DatanodeSee https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-common/ClusterSetup.html#Configuring_Environment_of_Hadoop_Daemons
## Yarn
All-in-one Yarn container with:
* HDFS namenode
* HDFS secondary namenode
* HDFS datanode
* Yarn ResourceManager
* Yarn NodeManager
* MapReduce JobHistory server (if `WITH_JOBHISTORY_SERVER=true`)### Versions
* `mtsrus/hadoop:hadoop2.7.3-yarn`
* `mtsrus/hadoop:hadoop2-yarn` - same as above* `mtsrus/hadoop:hadoop3.3.6-yarn`
* `mtsrus/hadoop:hadoop3-yarn` - same as above### Prerequisites
Minimal resources could start with are:
* 400m CPU
* 1Gb RAM
* 1Gb storage### Examples
See [docker-compose.yml](yarn/docker-compose.yml).
### Port numbers
NOTE: Hadoop 2 image uses the same port numbers as Hadoop 3:
- `9820:9820` - HDFS IPC
- `9870:9870` - HDFS WebHDFS
- `8042:8042` - NodeManager UI
- `8088:8088` - Yarn UIif `WITH_JOBHISTORY_SERVER=true`:
- `10020:10020` - MapReduce JobServer
- `19888:19888` - MapReduce JobServer History### Configuration
#### `/var/hadoop/conf/*.xml` files
##### Default
* [/var/hadoop/conf/core-site.xml](hdfs/conf/hadoop/core-site.xml)
* [/var/hadoop/conf/hdfs-site.xml](hdfs/conf/hadoop/hdfs-site.xml)
* [/var/hadoop/conf/yarn-site.xml](yarn/conf/hadoop/yarn-site.xml)
* [/var/hadoop/conf/capacity-scheduler.xml](yarn/conf/hadoop/capacity-scheduler.xml)
* [/var/hadoop/conf/mapred-site.xml](yarn/conf/hadoop/mapred-site.xml)You can mount custom config files to `/var/hadoop/conf` directory inside container to override default Hadoop configuration.
##### Substitutions
The following substitutions are replaced with proper values:
* `{{hostname}}` - current hostname
#### Container env variables
* `WAIT_TIMEOUT_SECONDS=120` - ti_meout in seconds after starting each service to check if it is alive
* `WITH_JOBHISTORY_SERVER=false` - set to `true` to start MapReduce JobHistory server#### `/var/hadoop/conf/hadoop-env.sh` environment variables
See HDFS image documentation.
#### `/var/hadoop/conf/yarn-env.sh` environment variables
* `export YARN_RESOURCEMANAGER_OPTS=-Xmx1024m` - max JVM memory for Yarn ResourceManager
* `export YARN_NODEMANAGER_OPTS=-Xmx1024m` - max JVM memory for NodeManager
* `export HADOOP_JOB_HISTORYSERVER_OPTS=-Xmx1024m` - max JVM memory for MapReduce JobHistory serverSee https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-common/ClusterSetup.html#Configuring_Environment_of_Hadoop_Daemons
## Hive
All-in-one Hive container with:
* HDFS namenode
* HDFS secondary namenode
* HDFS datanode
* Yarn ResourceManager
* Yarn NodeManager
* MapReduce JobHistory server
* Hive server
* Hive Metastore server### Versions
* `mtsrus/hadoop:hadoop2.7.3-hive2.3.10`
* `mtsrus/hadoop:hadoop2-hive` - same as above* `mtsrus/hadoop:hadoop3.3.6-hive3.1.3`
* `mtsrus/hadoop:hadoop3-hive` - same as above### Prerequisites
Minimal resources could start with are:
* 500m CPU
* 2Gb RAM
* 1Gb storage
* Running RDBMS (e.g. Postgres) instance to operate Metastore### Examples
See [docker-compose.yml](hive/docker-compose.yml).
### Port numbers
NOTE: Hadoop 2 image uses the same port numbers as Hadoop 3:
- `9820:9820` - HDFS IPC
- `9870:9870` - HDFS WebHDFSif `WITH_HIVE_SERVER=true`:
- `8042:8042` - NodeManager UI
- `8088:8088` - Yarn UI
- `19888:19888` - MapReduce JobServer History
- `10000:10000` - Hive server
- `10002:10002` - Hive Admin UIif `WITH_HIVE_METASTORE_SERVER=true`:
- `9083:9083` - Hive Metastore server### Configuration
#### `/var/hive/conf/*.xml` and `var/hadoop/conf/*.xml` files
##### Defaults
* [/var/hadoop/conf/hive-site.xml](hive/conf/hive/hive-site.xml)
You can mount custom config files to `/var/hive/conf` directory inside container to override default Hive configuration.
HDFS and Yarn configs still can be passed to `var/hadoop/conf` directory.
##### Substitutions
The following substitutions are replaced with proper values:
* `{{hostname}}` - current hostname
* `{{HIVE_METASTORE_DB_URL}}` - `HIVE_METASTORE_DB_URL` env variable (default `jdbc:postgresql://postgres:5432/metastore`)
* `{{HIVE_METASTORE_DB_DRIVER}}` - `HIVE_METASTORE_DB_DRIVER` env variable (default `org.postgresql.Driver`)
* `{{HIVE_METASTORE_DB_USER}}` - `HIVE_METASTORE_DB_USER` env variable (default `hive`)
* `{{HIVE_METASTORE_DB_PASSWORD}}` - `HIVE_METASTORE_DB_PASSWORD` env variable (default `hive`)#### Metastore database
Hive stores metadata in `{{HIVE_METASTORE_DB_URL}}` using driver from `{{HIVE_METASTORE_DB_DRIVER}}`. By default, Postgres is used.
You can change URL components by setting environment variables mentioned above, or replace the entire URL by updating the `/var/hive/conf/hive-site.xml` file.
You can also use any other supported RDMBS, like MySQL, by changing connection URL and embedding/mounting JDBC driver to `/opt/hive/lib/drivername.jar` path inside container. Postgres JDBC driver is already embedded into image.
#### Container env variables
* `WAIT_TIMEOUT_SECONDS=120` - timeout in seconds after starting each service to check if it is alive
* `WITH_HIVE_SERVER=true` - set to `false` to disable Hive server
* `WITH_HIVE_METASTORE_SERVER=true` - set to `false` to disable Hive metastore server#### `/var/hadoop/conf/hadoop-env.sh` environment variables
See HDFS image documentation.
#### `/var/hadoop/conf/yarn-env.sh` environment variables
See Yarn image documentation.
#### `/var/hive/conf/hive-env.sh` environment variables
* `export HIVE_SERVER2_HEAPSIZE=256` - max JVM memory in megabytes for Hive server
* `export HIVE_METASTORE_HEAPSIZE=256` - max JVM memory in megabytes for Hive metastore serverSee https://www.alibabacloud.com/help/en/emr/emr-on-ecs/user-guide/modify-the-memory-parameters-of-hive