Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/motcapbovit/big-data-powered-vietnamese-hate-speech-detection-a-kafka-deep-learning-approach
This is the project for the course DS200 - Big Data at the University of Information Technology - Vietnam National University, Ho Chi Minh City.
https://github.com/motcapbovit/big-data-powered-vietnamese-hate-speech-detection-a-kafka-deep-learning-approach
colab-notebook hate-speech-detection jupyter-notebook kafka python
Last synced: 16 days ago
JSON representation
This is the project for the course DS200 - Big Data at the University of Information Technology - Vietnam National University, Ho Chi Minh City.
- Host: GitHub
- URL: https://github.com/motcapbovit/big-data-powered-vietnamese-hate-speech-detection-a-kafka-deep-learning-approach
- Owner: motcapbovit
- Created: 2024-03-08T16:51:24.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-03-11T15:14:10.000Z (11 months ago)
- Last Synced: 2024-11-21T00:54:03.373Z (3 months ago)
- Topics: colab-notebook, hate-speech-detection, jupyter-notebook, kafka, python
- Language: Jupyter Notebook
- Homepage: https://youtu.be/mbf41jPhKSA
- Size: 68.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Big Data-powered Vietnamese Hate Speech Detection: A Kafka - Deep Learning Approach
## Abstract
In this research, we have developed a Kafka-based system aimed at detecting hate speech within live videos streamed on the YouTube platform. The system comprises two main components: offline and online. In the offline component, we conducted experiments with various Machine Learning and Deep Learning models, utilizing two datasets to determine the most effective model for hate speech detection task. Moving on to the online component, comments from the chat box of live YouTube videos are systematically crawled, and the model predicts whether each comment should be classified as hate speech or clean. Subsequently, we label and evaluate these comments to enhance the model's performance, particularly when it encounters misclassifications. This two-fold approach ensures a robust and adaptive system for the real-time detection of hate speech in the dynamic environment of live video streams on YouTube.
## Usage
The project comprises two code files: "Youtube_Live_Chat" and "Kafka_Deep_Learning." The "Youtube_Live_Chat" file is designed to collect real-time comments from a specified live stream video on YouTube. These comments are then stored in CSV files, with each file containing 10 comments.
On the other hand, the "Kafka_Deep_Learning" file serves the purpose of establishing a comprehensive pipeline that encompasses the entire process, starting from loading data from CSV files to making predictions. The pipeline involves various steps, such as loading the CSV files, preprocessing the data, writing to Kafka topics, and utilizing Kafka to stream the data for predictive modeling. The code in the final cell of the file constitutes a complete pipeline, initialized in a loop. This loop ensures that whenever a new CSV file is detected, the pipeline automatically executes the aforementioned steps.
## System Architecture
## Model Evaluation
## Contact
[Chi Thanh Dang](https://github.com/motcapbovit), [Thuy Hong Thi Dang](https://github.com/KaytlynDangDS) and Van Nguyen DinhFaculty of Information Science and Engineering, University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam.