https://github.com/arcangelofranco/cyberbullying_classification_hlt_2023-2024
Human Language Technologies Project for the Academic Year 2023-2024
https://github.com/arcangelofranco/cyberbullying_classification_hlt_2023-2024
academic-project cyberbullying-detection data-science human-language-technologies machine-learning ml nlp text-classification unipi
Last synced: about 2 months ago
JSON representation
Human Language Technologies Project for the Academic Year 2023-2024
- Host: GitHub
- URL: https://github.com/arcangelofranco/cyberbullying_classification_hlt_2023-2024
- Owner: arcangelofranco
- License: mit
- Created: 2024-07-21T10:28:26.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-04-05T10:51:42.000Z (about 2 months ago)
- Last Synced: 2025-04-05T11:31:16.024Z (about 2 months ago)
- Topics: academic-project, cyberbullying-detection, data-science, human-language-technologies, machine-learning, ml, nlp, text-classification, unipi
- Language: Jupyter Notebook
- Homepage:
- Size: 33.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Human Language Technologies
## Academic Year Project 2023/2024# Cyberbullying Classification
## Introduction
This project, conducted as part of the Human Language Technologies (HLT) course, aims to develop and evaluate a Natural Language Processing (NLP) model for the classification of tweets from the social media platform X (formerly known as Twitter) as potential acts of cyberbullying or offensive behavior.
## Motivations
The project is driven by three primary motivations:
1. **Application of Knowledge**: To apply the theoretical and methodological concepts learned during the HLT course.
2. **Social Relevance**: To address the growing social and psychological issue of cyberbullying, which has become more prevalent since the Covid-19 pandemic.
3. **Challenging Goals**: To meet the challenging goals set by the authors of the dataset and to contribute meaningful insights into the domain of cyberbullying detection.## Project Structure
The project is organized into several key directories:
- **`_chunckdevs`**: A custom library developed by the team specifically for this project.
- **`data`**: Contains all datasets used for training and evaluation.
- **`notebooks`**: Includes commented Jupyter notebooks for preprocessing, baseline models, advanced models, and transformer-based models.
- **`outputs`**: Stores generated outputs, including trained models and other relevant files.
- **`requirements.txt`**: Contains all the libraries needed for code execution.## Dataset and Goal
The dataset, sourced from Kaggle, consists of over 47,000 tweets, each labeled according to the type of cyberbullying. The dataset is balanced, with each class containing approximately 8,000 labels. Tweets are categorized either as descriptions of bullying events or as the bullying acts themselves. The primary objectives are:
1. **Binary Classification**: To identify whether a tweet constitutes an act of cyberbullying or not.
2. **Multiclass Classification**: To detect the specific type of discriminatory act, with labels including:
- Age
- Ethnicity
- Gender
- Religion
- Other types of cyberbullying
- Not cyberbullying## Data Understanding and Preparation
### Data Understanding
Initial exploration included the creation of word clouds for each class, revealing significant semantic differences related to cyberbullying. Hashtags, initially considered, were eventually excluded due to their low frequency and lack of specificity.
### Data Preprocessing
Two versions of the dataset were prepared: one containing all tweets and another with only English texts. Both versions were split into development and test sets. Normalization was applied exclusively to the development set tweets. Duplicate tweets, particularly those labeled as "other cyberbullying," were identified and removed.
## Classification
### Models Implemented
A variety of models were implemented and evaluated:
- Baseline models
- Advanced models
- Transformer-based models
- Ensemble models### Feature Engineering
Features were engineered for both baseline and advanced models, with extensive hyperparameter tuning to optimize performance.
### Evaluation Metrics
Model performance was evaluated using metrics such as precision, recall, and F1-score.
## Results for Classification
### Baseline and Advanced Models
Ensemble models achieved the highest F1-scores, although precision for certain classes remained challenging.
### Comparison with State-of-the-Art
Our models were benchmarked against state-of-the-art (SOTA) models to evaluate relative performance.
## Conclusions
Our analysis demonstrates that while machine learning models can effectively distinguish between different types of cyberbullying, they struggle with context and intent, particularly in distinguishing non-cyberbullying tweets from harmful messages. This underscores the need for further research into context disambiguation and intent understanding to improve the efficacy of cyberbullying detection models.