https://github.com/gakas14/kafka_streaming_project

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.
https://github.com/gakas14/kafka_streaming_project

aws aws-athena aws-glue ec2-instance jupyter-notebook kafka netflix-dataset pyhton3 s3-bucket sql

Last synced: 3 months ago
JSON representation

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Host: GitHub
URL: https://github.com/gakas14/kafka_streaming_project
Owner: gakas14
Created: 2024-03-26T04:52:26.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-26T06:59:53.000Z (over 2 years ago)
Last Synced: 2025-06-01T02:10:26.200Z (about 1 year ago)
Topics: aws, aws-athena, aws-glue, ec2-instance, jupyter-notebook, kafka, netflix-dataset, pyhton3, s3-bucket, sql
Language: Jupyter Notebook
Homepage:
Size: 1.51 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Kafka_streaming_project

##### This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

![kafka_netflix_data](https://github.com/gakas14/Kafka_streaming_project/assets/74584964/74278df5-b18f-40a0-a7d3-485d20027cf0)

## Launch an ec2 on AWS 



## Connect to your instance 



## Install Kafka 

```

  wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz

  tar -xvf kafka_2.13-3.7.0.tgz

```

## Install Java  

```

  sudo yum install java-1.8.0

  java -version

```

## Edit inbound rules to allow the request from the local machine 

## Change the server to run on the public IP of the ec2 instance

```

sudo nano config/server.properties

```

## Start the zookeeper 

```

bin/zookeeper-server-start.sh config/zookeeper.properties

```

## Start Kafka server 

```

export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"

cd kafka_2.13-3.7.0

bin/kafka-server-start.sh config/server.properties

```

## Create a topic 

```

bin/kafka-topics.sh --create --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1

```



## Start Producer 

```

cd kafka_2.13-3.7.0

bin/kafka-console-producer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

```

## Start Consumer 

```

cd kafka_2.13-3.7.0

bin/kafka-console-consumer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

```

## Create a s3 bucket



## Open Jupyter Notebook and create a producer and consumer

###  Producer





### Consumer



 ## Check the data in the s3 bucket

 



## Build a crawler in AWS Glue 

### Add the s3 bucket as a data source



### Create a database



### Run the crawler





## Run queries on the table in Athena 



![Screen Shot 2024-03-26 at 1 48 03 PM](https://github.com/gakas14/Kafka_streaming_project/assets/74584964/928adb2c-0d20-4f58-975c-a1ec7397f429)

### We can run different types of queries

##### query movie add in 2020

```

SELECT * FROM "netflix_movies_db"."gakas_kafka_netflix_data" WHERE release_year=2020;

```





#### Query count movies by type

```

SELECT type,count(*)  FROM "netflix_movies_db"."gakas_kafka_netflix_data" Group BY type;

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gakas14/kafka_streaming_project

Awesome Lists containing this project

README