An open API service indexing awesome lists of open source software.

https://github.com/gakas14/kafka_streaming_project

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.
https://github.com/gakas14/kafka_streaming_project

aws aws-athena aws-glue ec2-instance jupyter-notebook kafka netflix-dataset pyhton3 s3-bucket sql

Last synced: 2 months ago
JSON representation

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Awesome Lists containing this project

README

          

# Kafka_streaming_project
##### This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

![kafka_netflix_data](https://github.com/gakas14/Kafka_streaming_project/assets/74584964/74278df5-b18f-40a0-a7d3-485d20027cf0)

## Launch an ec2 on AWS
Screen Shot 2024-03-26 at 1 09 49 PM

## Connect to your instance
Screen Shot 2024-03-26 at 1 13 31 PM

## Install Kafka
```
wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
tar -xvf kafka_2.13-3.7.0.tgz
```

## Install Java
```
sudo yum install java-1.8.0
java -version
```

## Edit inbound rules to allow the request from the local machine

## Change the server to run on the public IP of the ec2 instance
```
sudo nano config/server.properties
```

## Start the zookeeper
```
bin/zookeeper-server-start.sh config/zookeeper.properties
```

## Start Kafka server
```
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
cd kafka_2.13-3.7.0
bin/kafka-server-start.sh config/server.properties
```
## Create a topic
```
bin/kafka-topics.sh --create --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1
```
Screen Shot 2024-03-26 at 1 24 34 PM

## Start Producer
```
cd kafka_2.13-3.7.0
bin/kafka-console-producer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
```

## Start Consumer
```
cd kafka_2.13-3.7.0
bin/kafka-console-consumer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
```

## Create a s3 bucket
Screen Shot 2024-03-26 at 1 27 22 PM

## Open Jupyter Notebook and create a producer and consumer
### Producer
Screen Shot 2024-03-26 at 1 33 25 PM

Screen Shot 2024-03-26 at 1 34 43 PM

### Consumer
Screen Shot 2024-03-26 at 1 33 16 PM

## Check the data in the s3 bucket

Screen Shot 2024-03-26 at 2 12 40 PM

## Build a crawler in AWS Glue
### Add the s3 bucket as a data source
Screen Shot 2024-03-26 at 1 44 05 PM

### Create a database
Screen Shot 2024-03-26 at 1 41 47 PM

### Run the crawler

Screen Shot 2024-03-26 at 1 44 48 PM

Screen Shot 2024-03-26 at 1 46 33 PM

## Run queries on the table in Athena

Screen Shot 2024-03-26 at 1 47 52 PM

![Screen Shot 2024-03-26 at 1 48 03 PM](https://github.com/gakas14/Kafka_streaming_project/assets/74584964/928adb2c-0d20-4f58-975c-a1ec7397f429)

### We can run different types of queries
##### query movie add in 2020
```
SELECT * FROM "netflix_movies_db"."gakas_kafka_netflix_data" WHERE release_year=2020;
```
Screen Shot 2024-03-26 at 1 56 07 PM

Screen Shot 2024-03-26 at 1 56 16 PM

#### Query count movies by type
```
SELECT type,count(*) FROM "netflix_movies_db"."gakas_kafka_netflix_data" Group BY type;
```

Screen Shot 2024-03-26 at 2 04 23 PM