https://github.com/steveee27/youtube-comments-scraping-analysis

This project scrapes YouTube comments from machine learning videos in Bahasa Indonesia. It includes preprocessing, text analysis, and visualization with word clouds. Techniques like One-Hot Encoding, CountVectorizer, and TF-IDF reveal key themes for further analysis.
https://github.com/steveee27/youtube-comments-scraping-analysis

machine-learning nlp sentiment-analysis web-scraping word-cloud youtube-api-v3

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/steveee27/youtube-comments-scraping-analysis
Owner: steveee27
License: mit
Created: 2024-11-15T06:54:54.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-11-15T07:24:20.000Z (11 months ago)
Last Synced: 2025-01-30T21:32:37.118Z (8 months ago)
Topics: machine-learning, nlp, sentiment-analysis, web-scraping, word-cloud, youtube-api-v3
Language: Jupyter Notebook
Homepage:
Size: 1.77 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# YouTube Comments Scraping and Analysis

This project involves scraping and analyzing YouTube comments from machine learning-related videos in Bahasa Indonesia. The YouTube API was used to collect comments data efficiently and systematically. The comments are then processed and analyzed using various text representation techniques to extract key insights about popular terms and themes.

## Table of Contents
- [Project Overview](#project-overview)
- [Project Workflow](#project-workflow)
- [Conclusion](#conclusion)
- [License](#license)

## Project Overview

- **Data Source**: YouTube comments on machine learning videos in Bahasa Indonesia.
- **Text Representation Techniques**: TF-IDF, One-Hot Encoding, and CountVectorizer.
- **Analysis Methods**: Word frequency analysis and word cloud visualization.

## Project Workflow

1. **Data Collection**: Comments were scraped from YouTube using the YouTube API.
2. **Data Preprocessing**:
- Cleansing: Removed irrelevant text, emojis, and special characters.
- Tokenization: Split comments into individual words.
- Stopword Removal: Removed common stopwords to focus on relevant words.
- Lemmatization: Reduced words to their base forms.
3. **Text Representation**:
- **TF-IDF**: Term Frequency-Inverse Document Frequency to represent text as a numerical feature vector.
- **One-Hot Encoding**: Representing text where each unique word is a distinct column in a vector.
- **CountVectorizer**: Counts the frequency of words across the corpus.
4. **Analysis and Visualization**:
- **Word Frequency Analysis**: Identified the most common terms in the comments.
- **Word Cloud**: Visualized frequent words to reveal popular themes.

## Conclusion

This project demonstrates the process of collecting, processing, and analyzing text data from YouTube comments, highlighting various text representation techniques. Through word frequency analysis and word cloud visualization, we uncovered common themes and key terms discussed in the comments, such as "machine learning," "data," and "training." These insights provide an overview of popular topics in Indonesian-language machine learning discussions on YouTube.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/steveee27/youtube-comments-scraping-analysis

Awesome Lists containing this project

README