Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cjunwon/youtube-data-analysis

End-to-end Youtube data analysis project using Youtube Data API, MySQL, AWS, Flask
https://github.com/cjunwon/youtube-data-analysis

aws-rds data-analysis datapipeline flask nlp pandas python shell sql vader-sentiment-analysis youtube youtube-api

Last synced: about 1 month ago
JSON representation

End-to-end Youtube data analysis project using Youtube Data API, MySQL, AWS, Flask

Awesome Lists containing this project

README

        

# Youtube-Data-Analysis

An end-to-end Youtube Data analysis pipeline. Quick link to [final visualization](https://cjunwon.github.io/Youtube-Data-Analysis/).

## TABLE OF CONTENTS

* [Background](#background)
* [Objective](#objective)
* [Tools and Packages](#tools)
* [Data Pre-Processing](#data-preprocessing)
* [Data Modeling (Pipeline)](#data-modeling-pipeline)
* [Data Visualization](#data-visualization)
* [Conclusion](#conclusion)
* [Challenges and Future Work](#challenges-and-futurework)


## BACKGROUND
Youtube is the second largest search engine behind Google. It provides a valuable platform for analyzing the general public's attitude toward certain topics and how information is presented to them. This project focuses on user comment interactions and video performances based on various factors - in particular, it explores ***whether polar/extreme video titles attract more views and interactions***. This project also demonstrates automated and scalable data pipelines using APIs and SQL databases.


## OBJECTIVE
* Build data pipeline
* Retreive raw channel data and comment thread data using the Youtube Data API
* Upload data to AWS RDS MySQL database
* Automate/Schedule data updates using ngrok and invictify
* Pull data from MySQL database back into pandas dataframe for analysis
* Data Analysis (Natural Langauge Processing)
* Main question to answer: *Do videos with stronger sentiment values show high view counts?*


## TOOLS
**Languages Used:** Python, SQL (MySQL), Shell


Task
Technique
Tools/Packages Used


Data Collection
Channel and comment data extraction through Youtube Data API
Youtube API, pandas, AWS RDS, mysql.connector


Data Pre-processing
Converted string/object values to appropriate quantitative data types, extracted published day of the week from given date values, converted ISO formatted video duration values to seconds, added tag counts, removed unsused columns
pandas, numpy, datetime, isodate


Data Modeling (Pipeline)
Autmation/scheduling scripts to pull & push data to and from MySQL database
Flask, ngrok, Invictify


Text Analytics
Natural Language Processing using the VADER (Valence Aware Dictionary and sEntiment Reasoner) analysis tool from the NLTK package.
NLP, VADER, NLTK


Data Visualization
Exploratory Data Analysis. Plotted view/like counts on average comment sentiment value for each video to analyze patterns.
Matplotlib, seaborn, plotly


Environments & Platforms
Main functions stored and organized in python scripts, analysis and comment extractions hosted on Jupyter notebook
Youtube, AWS RDS, Jupyter Notebook



## DATA-PREPROCESSING

There were two stages to the data cleaning process, the first for video information collected through the Youtube Data API, and the second for preparing the comments for natural language processing.
* Video Information:
* The code for this process can be found in [youtube_api_functions.py](https://github.com/cjunwon/Youtube-Data-Analysis/blob/main/youtube_api_functions.py) under the 'clean_video_df' function.
* The Youtube API returns all video information as object values. Columns containing numeric information were converted to numeric data types.
* Added column showing published day of the week through python datetime values
* Converted duration (originally in ISO format) to seconds using the 'isodate' library
* Removed unused columns
* Comment Texts:
* The code for this process can be found in [main.ipynb](https://github.com/cjunwon/Youtube-Data-Analysis/blob/main/main.ipynb) under the 'preprocess' function.
* Since the VADER model was used for NLP analysis, the comment texts did not require heavy cleaning - the model comfortably handles emojis, stopwords, etc. The "\n" for new line was removed for all comments.


## DATA-MODELING-PIPELINE

Youtube Video Data


(Channel playlist data extracted using Youtube API - stored and cleaned in pandas dataframe)


AWS MySQL RDS


(Processed dataframe uploaded to MySQL RDS hosted through AWS - function checks to see repeating videos through unique ids and updates with current statistics)


Automation using Flask, ngrok, Invictify


(Schedule the above MySQL database push by hosting API collection and SQL upload functions on Flask server and schedule scripts using ngrok and invictify)


Data pulled from MySQL database back into pandas dataframe for analysis


(Video IDs and other relevant columns can be selected for further analysis)


Youtube Comment data


(Top level comments for selected video ids from above extracted through Youtube API for analysis)


## DATA-VISUALIZATION

Exploratory data analysis was completed using the Matplotlib and Seaborn library.

The [final interactive plot](https://cjunwon.github.io/Youtube-Data-Analysis/) was created using the plotly library.


## CONCLUSION

The Youtube Data API provides a rich set of data for selected channels and videos for various types of analysis. The tools and methods used in this project could be applied to managing your personal Youtube channel and keep a personalized and up-to-date feedback on your channel's performance.


## CHALLENGES-AND-FUTUREWORK

* Comment extraction process through the Youtube Data API can be added along with the video information and updated onto the MySQL database, stored in a separate table/schema.
* An interactive dashboard can be generated to capture and display data for multiple Youtube channels in a more efficient and accecible manner.