Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dhzdhd/hadoop-docker-compose

Last synced: about 6 hours ago
JSON representation

Host: GitHub
URL: https://github.com/dhzdhd/hadoop-docker-compose
Owner: dhzdhd
Created: 2024-02-06T16:13:17.000Z (11 months ago)
Default Branch: master
Last Pushed: 2024-04-08T09:58:07.000Z (9 months ago)
Last Synced: 2024-04-08T11:01:47.373Z (9 months ago)
Language: Dockerfile
Size: 61.5 KB
Stars: 4
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hadoop docker setup

## About

A Docker compose setup for the following tools
- Hadoop
- Pig
- HBase
- Hive
- Spark
- ZooKeeper (startup by init-extra)
- Mahout (startup by init-extra)
- Kafka (startup by init-extra)

Currently only consists of one master node. (Which is enough to run programs)

## Credit

Derived from the [docker-hadoop repo](https://github.com/silicoflare/docker-hadoop)

## Setup

- Windows
1. Install [Docker desktop](https://docs.docker.com/desktop/install/windows-install/).
2. Run Docker desktop to start the Docker engine
3. Clone the [repository](https://github.com/dhzdhd/hadoop-docker-compose.git) locally (`git clone `)
4. `cd `
5. If you are using the terminal
1. Run `./run.ps1` in the repository root directory.
2. Run `init`
6. If you want to use Docker desktop UI instead
1. Run `exit`
2. Open Docker desktop UI and start the container (run button)
3. Click on the container named `master`
4. Head on to the `Exec` tab
5. Run the following in the terminal displayed
- `bash`
- `init`
7. Test Hadoop commands to verify successful build.
- Linux
1. Install [Docker engine](https://docs.docker.com/engine/install/).
2. Follow the [Post Installation Steps (Configure Docker to start on boot only)](https://docs.docker.com/engine/install/linux-postinstall/#configure-docker-to-start-on-boot-with-systemd).
3. Clone the [repository](https://github.com/dhzdhd/hadoop-docker-compose.git) locally (`git clone `)
4. `cd `
5. Run `chmod 777 ./run.sh` in the repository root directory.
6. Run `./run.sh`
7. Run `init`
8. Test Hadoop commands to verify successful build.
- Read the `Post install` section

## Post install

1. `cd `
2. Run `./run.ps1` or `./run.sh` depending on your platform.
3. Run `init`.

## Setup Extra (only if you need Mahout & Kafka)

- Run the above steps but use `init-extra` instead of `init`

## Note

### Data loss

- The current configuration does not store any data on HDFS/HBase. Please backup everything you create in the `/workdir` directory which is directly mapped to your system.

### Hive MetastoreSessionClient error

- Hive should always be started from the `/` directory (root)
- When you enter the container, the default directory is the root directory
- Starting Hive from any other directory causes the metastore to get corrupted

### Any sort of Hadoop error (MapReduce/Hive/HBase)

- Exit the container (`exit`), `./run.ps1`|`.run.sh` and `init`

### Pig takes way too much time and produces errors in mapreduce mode

- Start the job history server with `mapred historyserver &` (press enter to exit the logs)
- Run pig in mapreduce mode as normal

### ./run.ps1 not recognised as a command (Windows only)

- This error occurs when you are using command prompt
- run.ps1 is a PowerShell file and hence should be run with PowerShell only

### run.ps1 cannot be loaded because running scripts is disabled on this system (Windows only)

- Enable running foreign scripts in `Developer Settings` in the Settings application.
- Refer to the net for more information.

### Hadoop errors out after `restart`

- `restart` has been deprecated
- Simply run `exit`, `./run.ps1`|`.run.sh` and `init`

### Error on container build (download step)

- Your workplace/institute has probably blocked a link used to download one of the Hadoop tools.
- Connect to your local internet/hotspot to continue building.

### Accessing your files for backup/host usage

- The `/workdir` directory in the container is linked to `/workdir` on the host
- Any files you want to access on the host have to be saved/moved to `/workdir`

### Accessing HDFS through URL

- You can access HDFS through the URL - `hdfs://master:9000`
- For example
- `hadoop fs -ls /` is equivalent to `hadoop fs -ls hdfs://master:9000/`

### Accessing the web UI

- Access the web interfaces at `localhost:8088` and `localhost:9870`
- Port `9000` is also exposed to host but cannot and should not be accessed