https://github.com/phaniteja5789/real-time-data-processing-pipeline-development
This project perform Analytics on Streaming Data.
https://github.com/phaniteja5789/real-time-data-processing-pipeline-development
kafka-producer-consumer kafka-streams pyspark-python python3
Last synced: 8 months ago
JSON representation
This project perform Analytics on Streaming Data.
- Host: GitHub
- URL: https://github.com/phaniteja5789/real-time-data-processing-pipeline-development
- Owner: phaniteja5789
- Created: 2022-02-07T15:27:56.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-02-07T16:34:58.000Z (almost 4 years ago)
- Last Synced: 2025-02-17T00:44:26.187Z (11 months ago)
- Topics: kafka-producer-consumer, kafka-streams, pyspark-python, python3
- Language: Python
- Homepage:
- Size: 9.81 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# StreamingAnalytics
This project perform Analytics on Streaming Data.
Flow Diagram of the Project

**DataSimulator.py**
Python File generates JSON Messages that are appended to a File with name TemperatureRecorded.txt
**Exceution Command:**
python DataSimulator.py 100
Total command Line Arguments 2
Argv[0] = File name
Argv[1]=Total Number of JSON Messgaes that are to be generated.
Once the Exceution Command is executed it generates the text file with name TemperatureRecorded.txt in the current working directory.
**Data needs to be sent to Kafka**
**Lists the active topics in kafka cluster**
bin/kafka-topics.sh --list --zookeeper localhost:2181
Zookeeper Running on 2181 Port
**Create a Topic with Name "SensorAnalytics"**
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic SensorAnalytics
**TopicName-SensorAnalytics
Replication Factor-1 Every Partition is replicated by 1
Partitions-2 Topic has 2 partitions
**
**Produce the data into Topic by using below command**
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic SensorAnalytics < TemperatureRecorded.txt
Now the Data is stored inside Kafka Cluster under Logical Storage(Topic)
**Submit the spark job using Spark-Submit use the below command**
spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar StreamingMetrics.py
Inside StreamingMetrics.py it connects to Kafka using KafkaUtils class and creates a DSTREAM by subscribing to Topic "SensorAnalytics".
Once the DSTREAM is recieved from the Kafka the RDD Operations are applied.
**Tech Stack used
1.Python
2.PySpark
3.Kafka
**