https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker

One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)
https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker

docker docker-compose kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark spark-kafka spark-streaming-kafka spark-zeppelin streaming zeppelin

Last synced: 3 months ago
JSON representation

One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)

Host: GitHub
URL: https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker
Owner: EthicalML
License: mit
Created: 2019-04-29T13:58:08.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2021-07-20T21:44:22.000Z (almost 4 years ago)
Last Synced: 2025-03-23T17:12:29.575Z (4 months ago)
Topics: docker, docker-compose, kafka, kafka-spark, kafka-spark-streaming, kafka-zeppelin, spark, spark-kafka, spark-streaming-kafka, spark-zeppelin, streaming, zeppelin
Homepage:
Size: 1020 KB
Stars: 120
Watchers: 10
Forks: 74
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ![GitHub](https://img.shields.io/badge/Release-PROD-green.svg)

![GitHub](https://img.shields.io/badge/Version-0.0.1-lightgrey.svg)

![GitHub](https://img.shields.io/badge/License-MIT-blue.svg)

# One Click Deploy: Kafka Spark Streaming with Zeppelin UI

This repository contains a docker-compose stack with Kafka and Spark Streaming, together with monitoring with Kafka Manager and a Grafana Dashboard. The networking is set up so Kafka brokers can be accessed from the host.

It also comes with a producer-consumer example using a small subset of the [US Census adult income prediction dataset](https://www.kaggle.com/johnolafenwa/us-census-data).

## High level features:

Monitoring with grafana



Zeppelin UI



Kafka access from host



Multiple spark interpreters



## Detail Summary

| Container | Image | Tag | Accessible | 

|-|-|-|-|

| zookeeper | wurstmeister/zookeeper | latest | 172.25.0.11:2181 |

| kafka1 | wurstmeister/kafka | 2.12-2.2.0 | 172.25.0.12:9092 (port 8080 for JMX metrics) |

| kafka1 | wurstmeister/kafka | 2.12-2.2.0 | 172.25.0.13:9092 (port 8080 for JMX metrics) |

| kafka_manager | hlebalbau/kafka_manager | 1.3.3.18 | 172.25.0.14:9000 |

| prometheus | prom/prometheus | v2.8.1 | 172.25.0.15:9090 |

| grafana | grafana/grafana | 6.1.1 | 172.25.0.16:3000 |

| zeppelin | apache/zeppelin | 0.8.1 | 172.25.0.19:8080 |

# Quickstart

The easiest way to understand the setup is by diving into it and interacting with it.

## Running Docker Compose

To run docker compose simply run the following command in the current folder:

```

docker-compose up -d

```

This will run deattached. If you want to see the logs, you can run:

```

docker-compose logs -f -t --tail=10

```

To see the memory and CPU usage (which comes in handy to ensure docker has enough memory) use:

```

docker stats

```

## Accessing the notebook

You can access the default notebook by going to http://172.25.0.19:8080/#/notebook/2EAB941ZD. Now we can start running the cells.

### 1) Setup

#### Install python-kafka dependency

![](images/zeppelin-1.jpg)

### 2) Producer

We have an interpreter called %producer.pyspark that we'll be able to run in parallel.

#### Load our example dummy dataset

We have made available a 1000-row version of the [US Census adult income prediction dataset](https://www.kaggle.com/johnolafenwa/us-census-data).

![](images/zeppelin-2.jpg)

#### Start the stream of rows

We now take one row at random, and send it using our python-kafka producer. The topic will be created automatically if it doesn't exist (given that `auto.create.topics.enable` is set to true).

![](images/zeppelin-3.jpg)

### 3) Consumer

We now use the %consumer.pyspark interpreter to run our pyspark job in parallel to the producer.

#### Connect to the stream and print

Now we can run the spark stream job to connect to the topic and listen to data. The job will listen for windows of 2 seconds and will print the ID and "label" for all the rows within that window.

![](images/zeppelin-4.jpg)

### 4) Monitor Kafka

We can now use the kafka manager to dive into the current kafka setup.

#### Setup Kafka Manager

To set up kafka manager we need to configure it. In order to do this, access http://172.25.0.14:9000/addCluster and fill up the following two fields:

* Cluster name: Kafka

* Zookeeper hosts: 172.25.0.11:2181

Optionally:

* You can tick the following;

    * Enable JMX Polling

    * Poll consumer information

#### Access the topic information

If your cluster was named "Kafka", then you can go to http://172.25.0.14:9000/clusters/Kafka/topics/default_topic, where you will be able to see the partition offsets. Given that the topic was created automatically, it will have only 1 partition.

![](images/zeppelin-4.jpg)

#### Visualise metrics in Grafana

Finally, you can access the default kafka dashboard in Grafana (username is "admin" and password is "password") by going to http://172.25.0.16:3000/d/xyAGlzgWz/kafka?orgId=1

![](images/grafanakafka.jpg)