https://github.com/najmaelboutaheri/data-engineering-project-youtube

This project aims to securely manage, process, and analyze structured and semi-structured YouTube data based on video categories and trending metrics. The architecture leverages AWS services to ingest, store, transform, analyze, and visualize data efficiently and at scale.
https://github.com/najmaelboutaheri/data-engineering-project-youtube

aws aws-lambda awscli awsglue awsiam awss3 python3 shell

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/najmaelboutaheri/data-engineering-project-youtube
Owner: najmaelboutaheri
Created: 2024-12-09T19:09:01.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-12-28T19:53:02.000Z (4 months ago)
Last Synced: 2025-02-19T21:14:22.752Z (3 months ago)
Topics: aws, aws-lambda, awscli, awsglue, awsiam, awss3, python3, shell
Language: Python
Homepage:
Size: 32.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **YouTube Analysis Data Engineering Project**

## **Overview**

---

## **Project Goals**

1. **Data Ingestion**: Build a robust mechanism to ingest data from multiple sources.
2. **ETL System**: Transform raw data into a clean, enriched, and usable format.

ytb2

3. **Centralized Data Lake**: Store data from multiple sources in a unified and scalable repository.
4. **Scalability**: Ensure that the system can handle growing data volumes seamlessly.
5. **Cloud Processing**: Use AWS cloud infrastructure for efficient data handling and processing.
6. **Reporting**: Build a dashboard for insightful analysis and visualization.

---

## **Architecture Diagram**

![image](https://github.com/user-attachments/assets/bf684093-10f3-471f-893c-07298e7cc0dd)

---

## **Services Used**

### **1. Amazon S3**
**Purpose**: Object storage for raw, cleansed, and processed data.
**Components**:
- **Landing Area**: Raw data ingestion zone.
- **Cleansed/Enriched**: Transformed and cleaned data storage.
- **Analytics/Reporting**: Final processed data for analysis and reporting.

### **2. AWS Glue**
**Purpose**: Serverless ETL service to clean, prepare, and transform data.
**Usage**:
- Discovering schema

youtube-project-5

- Joining datasets.

youtube-project-11

- Cataloging data for analytics.

- Processing and enriching data.

### **3. AWS Lambda**
**Purpose**: Run code without managing servers.
**Usage**:
- Triggered for automated data transformations.

youtube-project-8

- Event-driven workflows.

### **4. AWS Step Functions**
**Purpose**: Orchestrate AWS services into serverless workflows.
**Usage**:
- Coordinate the ETL pipeline for seamless data flow.

### **5. AWS Athena**
**Purpose**: Query S3-stored data using SQL.
**Usage**:
- Interactive querying of processed data for insights.

youtube-project-2

output:

youtube-project-1

We used AWS athena to join between dataset that comes from diffrents source a query example:

youtube-project-3

Output:

youtube-project-4

### **6. AWS IAM**
**Purpose**: Manage access to AWS services securely.
**Usage**:
- Control permissions for AWS services.

### **7. Amazon QuickSight**
**Purpose**: BI and analytics dashboard.
**Usage**:
- Visualize trends, metrics, and insights.

youtube-project-14

### **8. AWS CloudWatch**
**Purpose**: Monitor and alert on AWS services.
**Usage**:
- Monitor ETL jobs, Lambda executions, and other resources.

---

## **Dataset Used**

The dataset consists of daily trending YouTube videos for multiple regions and contains:
- **Fields**: Video title, channel title, views, likes, dislikes, tags, comment counts, and more.
- **Additional Files**: A `category_id` mapping in a JSON file.

**Source**: [YouTube Trending Videos Dataset on Kaggle](https://www.kaggle.com/datasets/datasnaek/youtube-new)

---

## **Data Pipeline Workflow**

1. **Data Ingestion**:
- Bulk data ingested into S3 Landing Area using S3 APIs.

2. **Data Processing**:
- AWS Glue performs ETL operations to clean and transform data.
- AWS Lambda triggers processing jobs for automation.

3. **Storage**:
- Cleaned and transformed data stored in S3 under Cleansed/Enriched and Analytics zones.

4. **Query & Access**:
- AWS Athena provides SQL-based query capabilities on data stored in S3.
- Analytical access provided via APIs.

5. **Monitoring**:
- AWS CloudWatch monitors data flow, transformations, and system performance.

---

## **Technologies & Tools**

- **AWS Services**: S3, Glue, Lambda, Athena, QuickSight, CloudWatch, Step Functions, IAM.
- **Programming Language**: Python (for Glue and Lambda scripts).

---

## **How to Run the Project**

1. Set up an **AWS account** and configure the necessary services (IAM roles, S3 buckets, Glue jobs, Athena).
2. Upload the **raw dataset** to the **S3 Landing Area**.
3. Configure **AWS Glue jobs** and triggers using **Lambda** for processing.
4. **Query** processed data using **Athena** or **Redshift**.
5. Visualize results on **QuickSight** or other BI tools.

---

## **Future Enhancements**

- Automate pipeline orchestration using **Airflow**.
- Integrate real-time streaming using **Kinesis**.
- Include **anomaly detection** and **recommendation systems**.

---

## **Contacts**

- Najma el boutaheri
- Email: [[email protected]]([email protected])
- link: [Linkedin Profile](https://www.linkedin.com/in/najma-el-boutaheri-8185a1267/)

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/najmaelboutaheri/data-engineering-project-youtube

Awesome Lists containing this project

README