Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/najmaelboutaheri/data-engineering-project-youtube

This project aims to securely manage, process, and analyze structured and semi-structured YouTube data based on video categories and trending metrics. The architecture leverages AWS services to ingest, store, transform, analyze, and visualize data efficiently and at scale.
https://github.com/najmaelboutaheri/data-engineering-project-youtube

aws aws-lambda awscli awsglue awsiam awss3 python3 shell

Last synced: 11 days ago
JSON representation

This project aims to securely manage, process, and analyze structured and semi-structured YouTube data based on video categories and trending metrics. The architecture leverages AWS services to ingest, store, transform, analyze, and visualize data efficiently and at scale.

Awesome Lists containing this project

README

        

# **YouTube Analysis Data Engineering Project**

## **Overview**

This project aims to securely manage, process, and analyze structured and semi-structured YouTube data based on video categories and trending metrics. The architecture leverages AWS services to ingest, store, transform, analyze, and visualize data efficiently and at scale.

---

## **Project Goals**

1. **Data Ingestion**: Build a robust mechanism to ingest data from multiple sources.
2. **ETL System**: Transform raw data into a clean, enriched, and usable format.

ytb2

3. **Centralized Data Lake**: Store data from multiple sources in a unified and scalable repository.
4. **Scalability**: Ensure that the system can handle growing data volumes seamlessly.
5. **Cloud Processing**: Use AWS cloud infrastructure for efficient data handling and processing.
6. **Reporting**: Build a dashboard for insightful analysis and visualization.

---

## **Architecture Diagram**

![image](https://github.com/user-attachments/assets/bf684093-10f3-471f-893c-07298e7cc0dd)

---

## **Services Used**

### **1. Amazon S3**
**Purpose**: Object storage for raw, cleansed, and processed data.
**Components**:
- **Landing Area**: Raw data ingestion zone.
- **Cleansed/Enriched**: Transformed and cleaned data storage.
- **Analytics/Reporting**: Final processed data for analysis and reporting.

image

### **2. AWS Glue**
**Purpose**: Serverless ETL service to clean, prepare, and transform data.
**Usage**:
- Discovering schema

youtube-project-5

- Joining datasets.

youtube-project-11

- Cataloging data for analytics.

- Processing and enriching data.

### **3. AWS Lambda**
**Purpose**: Run code without managing servers.
**Usage**:
- Triggered for automated data transformations.

youtube-project-8

- Event-driven workflows.

### **4. AWS Step Functions**
**Purpose**: Orchestrate AWS services into serverless workflows.
**Usage**:
- Coordinate the ETL pipeline for seamless data flow.

### **5. AWS Athena**
**Purpose**: Query S3-stored data using SQL.
**Usage**:
- Interactive querying of processed data for insights.

youtube-project-2

output:

youtube-project-1

We used AWS athena to join between dataset that comes from diffrents source a query example:

youtube-project-3

Output:

youtube-project-4

### **6. AWS IAM**
**Purpose**: Manage access to AWS services securely.
**Usage**:
- Control permissions for AWS services.

### **7. Amazon QuickSight**
**Purpose**: BI and analytics dashboard.
**Usage**:
- Visualize trends, metrics, and insights.

youtube-project-14

### **8. AWS CloudWatch**
**Purpose**: Monitor and alert on AWS services.
**Usage**:
- Monitor ETL jobs, Lambda executions, and other resources.

---

## **Dataset Used**

The dataset consists of daily trending YouTube videos for multiple regions and contains:
- **Fields**: Video title, channel title, views, likes, dislikes, tags, comment counts, and more.
- **Additional Files**: A `category_id` mapping in a JSON file.

**Source**: [YouTube Trending Videos Dataset on Kaggle](https://www.kaggle.com/datasets/datasnaek/youtube-new)

image

---

## **Data Pipeline Workflow**

1. **Data Ingestion**:
- Bulk data ingested into S3 Landing Area using S3 APIs.

2. **Data Processing**:
- AWS Glue performs ETL operations to clean and transform data.
- AWS Lambda triggers processing jobs for automation.

3. **Storage**:
- Cleaned and transformed data stored in S3 under Cleansed/Enriched and Analytics zones.

4. **Query & Access**:
- AWS Athena provides SQL-based query capabilities on data stored in S3.
- Analytical access provided via APIs.

5. **Monitoring**:
- AWS CloudWatch monitors data flow, transformations, and system performance.

---

## **Technologies & Tools**

- **AWS Services**: S3, Glue, Lambda, Athena, QuickSight, CloudWatch, Step Functions, IAM.
- **Programming Language**: Python (for Glue and Lambda scripts).

---

## **How to Run the Project**

1. Set up an **AWS account** and configure the necessary services (IAM roles, S3 buckets, Glue jobs, Athena).
2. Upload the **raw dataset** to the **S3 Landing Area**.
3. Configure **AWS Glue jobs** and triggers using **Lambda** for processing.
4. **Query** processed data using **Athena** or **Redshift**.
5. Visualize results on **QuickSight** or other BI tools.

---

## **Future Enhancements**

- Automate pipeline orchestration using **Airflow**.
- Integrate real-time streaming using **Kinesis**.
- Include **anomaly detection** and **recommendation systems**.

---

## **Contacts**

- Najma el boutaheri
- Email: [[email protected]]([email protected])
- link: [Linkedin Profile](https://www.linkedin.com/in/najma-el-boutaheri-8185a1267/)

---