https://github.com/gakas14/kafka_streaming_project
The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.
https://github.com/gakas14/kafka_streaming_project
aws aws-athena aws-glue ec2-instance jupyter-notebook kafka netflix-dataset pyhton3 s3-bucket sql
Last synced: 2 months ago
JSON representation
The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.
- Host: GitHub
- URL: https://github.com/gakas14/kafka_streaming_project
- Owner: gakas14
- Created: 2024-03-26T04:52:26.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-26T06:59:53.000Z (over 2 years ago)
- Last Synced: 2025-06-01T02:10:26.200Z (about 1 year ago)
- Topics: aws, aws-athena, aws-glue, ec2-instance, jupyter-notebook, kafka, netflix-dataset, pyhton3, s3-bucket, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 1.51 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kafka_streaming_project
##### This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

## Launch an ec2 on AWS

## Connect to your instance

## Install Kafka
```
wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
tar -xvf kafka_2.13-3.7.0.tgz
```
## Install Java
```
sudo yum install java-1.8.0
java -version
```
## Edit inbound rules to allow the request from the local machine
## Change the server to run on the public IP of the ec2 instance
```
sudo nano config/server.properties
```
## Start the zookeeper
```
bin/zookeeper-server-start.sh config/zookeeper.properties
```
## Start Kafka server
```
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
cd kafka_2.13-3.7.0
bin/kafka-server-start.sh config/server.properties
```
## Create a topic
```
bin/kafka-topics.sh --create --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1
```

## Start Producer
```
cd kafka_2.13-3.7.0
bin/kafka-console-producer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
```
## Start Consumer
```
cd kafka_2.13-3.7.0
bin/kafka-console-consumer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
```
## Create a s3 bucket

## Open Jupyter Notebook and create a producer and consumer
### Producer


### Consumer

## Check the data in the s3 bucket

## Build a crawler in AWS Glue
### Add the s3 bucket as a data source

### Create a database

### Run the crawler


## Run queries on the table in Athena


### We can run different types of queries
##### query movie add in 2020
```
SELECT * FROM "netflix_movies_db"."gakas_kafka_netflix_data" WHERE release_year=2020;
```


#### Query count movies by type
```
SELECT type,count(*) FROM "netflix_movies_db"."gakas_kafka_netflix_data" Group BY type;
```
