Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hieuung/streaming-kafka

Using various data processing tool for real time data pipeline with Kafka
https://github.com/hieuung/streaming-kafka

apache-beam apache-flink apache-spark kafka kafka-consumer kafka-producer spark-streaming spark-streaming-kafka

Last synced: about 1 month ago
JSON representation

Using various data processing tool for real time data pipeline with Kafka

Host: GitHub
URL: https://github.com/hieuung/streaming-kafka
Owner: hieuung
Created: 2024-04-24T04:30:55.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-04-30T08:41:15.000Z (8 months ago)
Last Synced: 2024-10-19T03:17:43.462Z (2 months ago)
Topics: apache-beam, apache-flink, apache-spark, kafka, kafka-consumer, kafka-producer, spark-streaming, spark-streaming-kafka
Language: Python
Homepage:
Size: 4.68 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# STREAMING ITERGRATION WITH KAFKA

## Tech stack
- Kafka, Kafka-ui
- Apache Spark
- Apache Flink
- Apache Beam
- Docker-compose

## Implementation
### Setup Kafka.
```sh
docker-compose up -d kafka zookeeper kafka-ui
```
Verify Kafka cluster on [Kafka-ui](http://localhost:8080)

### Publish message.
```sh
docker-compose up -d producer
```
Verify message on [Kafka-ui](http://localhost:8080)

### STREAMING WITH APACHE SPARK.
#### Setup long-running spark-cluster.
```sh
docker-compose up -d spark-master spark-worker
```
---
> **_NOTE:_** If the docker image build is too slow (cause by slow downloading), you should download the following file
```sh
export SPARK_VERSION=3.0.2
export HADOOP_VERSION=3.2
export SPARK_HOME=/opt/spark
https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
```
---

#### Verify Spark cluster on [Spark-ui](http://localhost:9090)

#### Access master node to start the streaming job. Transform and Publish to Source (Kafka)

```sh
docker exec -it spark-master /bin/sh
```

#### Submit job
```sh
/opt/spark/bin/spark-submit --master spark://spark-master:7077 \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.2 \ /opt/ spark-apps/spark-consumer.py
```

#### Verify spark job on [Spark-ui](http://localhost:9090), verify message on [Kafka-ui](http://localhost:8080)

### STREAMING WITH APACHE FLINK.

#### Setup long-running Flink-cluster.
##### Download essential `external_jars`
```
flink-sql-connector-kafka-3.1.0-1.18.jar
```
##### Build pyflink docker image
```sh
docker build --tag pyflink:latest ./pyflink
```
##### Start Flink cluster
```sh
docker-compose up -d taskmanager jobmanager
```
---

##### Verify Flink cluster on [Flink-ui](http://localhost:8081)

#### Access taskmanager node to start the streaming job.
```sh
docker exec -it streaming-kakfa_jobmanager_1 /bin/sh
```

#### Submit job
```sh
./bin/flink run -py ./app/test-stream.py \
--jarfile ./app/external_jars/flink-sql-connector-kafka-3.1.0-1.18.jar
```

##### Verify Flink job on [Flink-ui](http://localhost:8081), verify message on [Kafka-ui](http://localhost:8080)

### STREAMING WITH APACHE BEAM.

#### Setup long-running Flink-cluster (or Spark-cluster).

#### Verify Flink cluster on [Flink-ui](http://localhost:8081)

#### Submit Beam job to cluster
```sh
python3 beam-consumer.py --runner FlinkRunner \
--bootstrap_servers localhost:9092 \
--topics hieuung \
--flink_master localhost:8081 \
```

#### Verify Flink job on [Flink-ui](http://localhost:8081)