https://github.com/phaniteja5789/real-time-data-processing-pipeline-development

This project perform Analytics on Streaming Data.
https://github.com/phaniteja5789/real-time-data-processing-pipeline-development

kafka-producer-consumer kafka-streams pyspark-python python3

Last synced: 10 months ago
JSON representation

This project perform Analytics on Streaming Data.

Host: GitHub
URL: https://github.com/phaniteja5789/real-time-data-processing-pipeline-development
Owner: phaniteja5789
Created: 2022-02-07T15:27:56.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-02-07T16:34:58.000Z (about 4 years ago)
Last Synced: 2025-02-17T00:44:26.187Z (about 1 year ago)
Topics: kafka-producer-consumer, kafka-streams, pyspark-python, python3
Language: Python
Homepage:
Size: 9.81 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# StreamingAnalytics
This project perform Analytics on Streaming Data.

Flow Diagram of the Project

![image](https://user-images.githubusercontent.com/36558484/152830762-0b3dd11d-f54a-4d22-b7a0-76cdd81ab765.png)

**DataSimulator.py**

Python File generates JSON Messages that are appended to a File with name TemperatureRecorded.txt

**Exceution Command:**

python DataSimulator.py 100

Total command Line Arguments 2

Argv[0] = File name

Argv[1]=Total Number of JSON Messgaes that are to be generated.

Once the Exceution Command is executed it generates the text file with name TemperatureRecorded.txt in the current working directory.

**Data needs to be sent to Kafka**

**Lists the active topics in kafka cluster**
bin/kafka-topics.sh --list --zookeeper localhost:2181

Zookeeper Running on 2181 Port

**Create a Topic with Name "SensorAnalytics"**
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic SensorAnalytics

**TopicName-SensorAnalytics
Replication Factor-1 Every Partition is replicated by 1
Partitions-2 Topic has 2 partitions
**

**Produce the data into Topic by using below command**
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic SensorAnalytics < TemperatureRecorded.txt

Now the Data is stored inside Kafka Cluster under Logical Storage(Topic)

**Submit the spark job using Spark-Submit use the below command**

spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar StreamingMetrics.py

Inside StreamingMetrics.py it connects to Kafka using KafkaUtils class and creates a DSTREAM by subscribing to Topic "SensorAnalytics".

Once the DSTREAM is recieved from the Kafka the RDD Operations are applied.

**Tech Stack used
1.Python
2.PySpark
3.Kafka
**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/phaniteja5789/real-time-data-processing-pipeline-development

Awesome Lists containing this project

README