https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert
Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.
https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert
cyberbullying-detection data-exploration distilbert-model exploratory-data-analysis fine-tune-bert-tensorflow llm model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization
Last synced: 8 months ago
JSON representation
Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.
- Host: GitHub
- URL: https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert
- Owner: SayamAlt
- License: apache-2.0
- Created: 2024-06-10T15:20:01.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-10T15:30:56.000Z (over 1 year ago)
- Last Synced: 2024-12-28T08:09:33.843Z (10 months ago)
- Topics: cyberbullying-detection, data-exploration, distilbert-model, exploratory-data-analysis, fine-tune-bert-tensorflow, llm, model-inference, model-training-and-evaluation, multiclass-classification, natural-language-processing, text-classification, text-preprocessing, text-tokenization
- Language: Jupyter Notebook
- Homepage:
- Size: 7.24 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## About Dataset
This repository contains a balanced dataset for cyberbully detection in social media. The dataset has been carefully curated and labeled to enable researchers and developers to build accurate cyberbully detection models. It includes various types of cyberbullying instances, such as race/ethnicity, gender/sexual, and religion-related content, as well as non-cyberbullying instances. This dataset is for the paper Self-Training for Cyberbully Detection: Achieving High Accuracy with a Balanced Multi-Class Dataset.
The dataset consists of a total of approximately 100,000 tweets collected from social media platforms. It is labeled with a multi-class classification approach, where each tweet falls into one of the following categories:
Non-cyberbullying: 50,000 instances Race/Ethnicity-related cyberbullying: 17,000 instances Gender/Sexual-related cyberbullying: 17,000 instances Religion-related cyberbullying: around 16,000 instances The dataset's balance ensures equal representation of each class, allowing for effective training and evaluation of cyberbully detection models.