https://github.com/syedt1/shared_task1_hatespeech
https://github.com/syedt1/shared_task1_hatespeech
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/syedt1/shared_task1_hatespeech
- Owner: SyedT1
- Created: 2025-07-31T15:51:53.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-11-08T05:05:12.000Z (7 months ago)
- Last Synced: 2025-11-08T06:17:39.904Z (7 months ago)
- Language: Jupyter Notebook
- Size: 5.86 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Shared Task 1: Hate Speech Detection in Bengali
## Project Overview
This repository contains comprehensive implementations for the Bengali Multi-task Hate Speech Identification shared task at BLP Workshop @IJCNLP-AACL 2025. The project addresses the complex problem of detecting and understanding hate speech in Bengali across three related subtasks: hate type classification, target identification, and multi-task analysis. The implementation explores various machine learning approaches from traditional deep learning to state-of-the-art transformer models with advanced training techniques.
## Competition Phases
### 🔬 **Developmental Phase**
- **Objective**: Model experimentation, architecture exploration, and hyperparameter tuning
- **Data**: Training and validation datasets provided by organizers
- **Focus**: Testing various approaches and techniques to identify best-performing models
- **Metrics**: Validation F1 scores on development set
### 🏆 **Evaluation Phase**
- **Objective**: Final model evaluation on unseen test data
- **Data**: Hidden test set released during evaluation period
- **Focus**: Deploying best models from developmental phase with refined configurations
- **Metrics**: Test F1 scores on official evaluation set
## Repository Structure
### Subtask 1A - Hate Speech Type Classification
Multi-class classification of Bengali text into: Abusive, Sexism, Religious Hate, Political Hate, Profane, or None.
#### 📊 **Developmental Phase Results**
##### **Deep Learning Models**
- **BiLSTM** - F1 Score: 56.25%
- **LSTM with Attention** - F1 Score: 55.18%
##### **Large Language Models (LLMs)**
- **XLM-RoBERTa-large** - F1 Score: 72.81%
- **MuRIL-large-cased** - F1 Score: 71.02%
- **BanglaBERT (csebuetnlp)** - F1 Score: 70.74%
- **BanglaBERT-large (csebuetnlp)** - F1 Score: 70.51%
- **XLM-RoBERTa-base** - F1 Score: 70.50%
- **DistilBERT-multilingual** - F1 Score: 68.03%
##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased with K-Fold** - F1 Score: 73.61%
- **XLM-RoBERTa-large with K-Fold** - F1 Score: 73.45%
- **BanglaBERT with K-Fold** - F1 Score: 73.29%
##### **K-Fold with Text Normalizer**
- **BanglaBERT with Normalizer** - F1 Score: 74.32%
- **MuRIL-large-cased with Normalizer** - F1 Score: 73.73%
- **XLM-RoBERTa-large with Normalizer** - F1 Score: 73.29%
##### **LLMs with Adversarial Training (K-Fold + FGM)**
- **BanglaBERT with K-Fold + FGM** - F1 Score: 73.87%
- **MuRIL-large-cased with K-Fold + FGM** - F1 Score: 73.68%
##### **Advanced Combined Approaches (K-Fold + FGM + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - F1 Score: 74.88% ⭐ (Best Development Score)
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - F1 Score: 73.81%
#### 🎯 **Evaluation Phase Results**
- **BanglaBERT + K-Fold + FGM + Normalizer** - Test F1: 72.33% ⭐ (Best Test Score)
- **BanglaBERT + K-Fold + FGM** - Test F1: 72.17%
- **MuRIL-large-cased + K-Fold + Normalizer** - Test F1: 72.30%
- **BanglaBERT + K-Fold** - Test F1: 72.05%
- **MuRIL-large-cased + K-Fold + FGM** - Test F1: 71.90%
- **MuRIL-large-cased + K-Fold** - Test F1: 71.88%
- **XLM-RoBERTa-large + K-Fold** - Test F1: 71.72%
- **XLM-RoBERTa-large + K-Fold + Normalizer** - Test F1: 71.57%
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - Test F1: 71.31%
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 71.14%
- **BanglaBERT (Base)** - Test F1: 70.31%
### Subtask 1B - Hate Speech Target Classification
Classification of hate speech targets into: Individuals, Organizations, Communities, or Society.
#### 📊 **Developmental Phase Results**
##### **Deep Learning Models**
- Traditional deep learning approaches implemented (scores pending)
##### **Large Language Models (LLMs)**
- **BanglaBERT** - F1 Score: 72.09%
- **MuRIL-large-cased** - F1 Score: 71.93%
- **XLM-RoBERTa-large** - F1 Score: 71.38%
##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased with K-Fold** - F1 Score: 74.96% ⭐ (Best Development Score)
- **BanglaBERT with K-Fold** - F1 Score: 73.69%
- **XLM-RoBERTa-large with K-Fold** - F1 Score: 71.53%
##### **K-Fold with Text Normalizer**
- **BanglaBERT with Normalizer** - F1 Score: 74.72%
- **MuRIL-large-cased with Normalizer** - F1 Score: 74.48%
- **XLM-RoBERTa-large with Normalizer** - F1 Score: 72.39%
##### **LLMs with K-Fold and Adversarial Attacks (FGM)**
- **XLM-RoBERTa-large with K-Fold + FGM** - F1 Score: 74.20%
- **BanglaBERT with K-Fold + FGM** - F1 Score: 74.12%
- **MuRIL-large-cased with K-Fold + FGM** - F1 Score: 73.89%
##### **Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - F1 Score: 74.64%
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - F1 Score: 74.56%
- **XLM-RoBERTa-large + K-Fold + FGM + Normalizer** - F1 Score: 74.32%
#### 🎯 **Evaluation Phase Results**
##### **Base LLMs (without K-Fold)**
- **XLM-RoBERTa-large** - Test F1: 71.23%
- **MuRIL-large-cased** - Test F1: 70.93%
- **BanglaBERT** - Test F1: 70.25%
##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased + K-Fold** - Test F1: 73.44%
- **BanglaBERT + K-Fold** - Test F1: 71.85%
- **XLM-RoBERTa-large + K-Fold** - Test F1: 68.07%
##### **K-Fold with Text Normalizer**
- **MuRIL-large-cased + K-Fold + Normalizer** - Test F1: 73.44%
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 72.89%
- **XLM-RoBERTa-large + K-Fold + Normalizer** - Test F1: 71.66%
##### **LLMs with K-Fold and Adversarial Attacks (FGM)**
- **XLM-RoBERTa-large + K-Fold + FGM** - Test F1: 73.28%
- **MuRIL-large-cased + K-Fold + FGM** - Test F1: 72.92%
- **BanglaBERT + K-Fold + FGM** - Test F1: 72.25%
##### **Advanced Combined Approaches (K-Fold + FGM + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - Test F1: 73.12% ⭐
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - Test F1: 72.95% ⭐
- **XLM-RoBERTa-large + K-Fold + FGM + Normalizer** - Test F1: 72.17%
### Subtask 1C - Multi-task Hate Speech Analysis
Multi-task classification combining hate type (Abusive, Sexism, Religious Hate, Political Hate, Profane, None), severity (Little to None, Mild, Severe), and target group (Individuals, Organizations, Communities, Society).
#### 📊 **Developmental Phase Results**
##### **Base LLMs**
- Basic transformer implementations (scores pending)
##### **LLMs with K-Fold Cross Validation**
- Standard K-Fold implementations (scores pending)
##### **LLMs with Adversarial Training and K-Fold**
All using BanglaBERT (cse-buet-nlp) with different adversarial techniques:
- **BanglaBERT + FreeLB** - F1 Score: 74.52% ⭐ (Best Development Score)
- **BanglaBERT + Simple FreeLB** - F1 Score: 73.91%
- **BanglaBERT + GAT** - F1 Score: 73.79%
- **BanglaBERT + FGM** - F1 Score: 73.75%
##### **LLMs with K-Fold and Normalizer**
- Text normalization implementations (scores pending)
##### **Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)**
- Comprehensive technique combinations (scores pending)
#### 🎯 **Evaluation Phase Results**
##### **LLMs with K-Fold and Normalizer**
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 73.00%
##### **LLMs with Adversarial Training and K-Fold**
- **BanglaBERT + FreeLB + K-Fold** - Test F1: 72.00%
## Technical Implementation Details
### Advanced Training Techniques
#### **Adversarial Training Methods**
- **FGM (Fast Gradient Method)**: Simple and efficient adversarial perturbations
- **AWP (Adversarial Weight Perturbation)**: Weight-space adversarial training
- **FreeLB**: Free large-batch adversarial training for improved generalization
- **Simple FreeLB**: Streamlined version of FreeLB
- **GAT (Geometry-Aware Training)**: Advanced geometry-aware adversarial training
#### **Text Normalization Pipeline**
```python
normalize(
text,
unicode_norm="NFKC", # Canonical decomposition + compatibility
punct_replacement=None, # Preserve original punctuation
url_replacement=None, # Preserve URLs
emoji_replacement=None, # Preserve emojis
apply_unicode_norm_last=True # Apply normalization as final step
)
```
#### **Custom Model Architectures**
- **Attention-Based Pooling Head**: Dynamic token weighting for better representation
- **Multi-Head Classification**: Custom classification layers for Bengali text
- **Enhanced Dropout Strategies**: Improved regularization techniques
#### **Cross-Validation Strategy**
- **K-Fold Implementation**: 5-fold cross-validation for robust evaluation
- **Stratified Sampling**: Maintaining class distribution across folds
- **Ensemble Averaging**: Combining predictions from multiple folds
## Performance Analysis
### 📈 Best Performing Models by Phase
#### Developmental Phase Champions:
| Subtask | Model | F1 Score | Technique |
|---------|-------|----------|-----------|
| **1A** | BanglaBERT | 74.88% | K-Fold + FGM + Normalizer |
| **1B** | MuRIL-large-cased | 74.96% | K-Fold Cross Validation |
| **1C** | BanglaBERT | 74.52% | FreeLB Adversarial Training |
#### Evaluation Phase Performance:
| Subtask | Model | Dev F1 | Test F1 | Performance Drop |
|---------|-------|--------|---------|------------------|
| **1A** | BanglaBERT + K-Fold + FGM + Normalizer | 74.88% | 72.33% | -2.55% |
| **1B** | MuRIL-large-cased + K-Fold | 74.96% | 73.44% | -1.52% |
| **1C** | BanglaBERT + K-Fold + Normalizer | 74.52% | 73.00% | -1.52% |
#### Best Test Phase Models (Subtask 1A):
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|----------|------------|-------------|-------------------|
| **Base LLM** | 70.31% | - | - |
| **+ K-Fold** | 72.05% | 71.88% | 71.72% |
| **+ K-Fold + Normalizer** | 71.14% | 72.30% | 71.57% |
| **+ K-Fold + FGM** | 72.17% | 71.90% | - |
| **+ K-Fold + FGM + Normalizer** | 72.33% ⭐ | 71.31% | - |
#### Best Test Phase Models (Subtask 1B):
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|----------|------------|-------------|-------------------|
| **Base LLM** | 70.25% | 70.93% | 71.23% |
| **+ K-Fold** | 71.85% | 73.44% ⭐ | 68.07% |
| **+ K-Fold + Normalizer** | 72.89% | 73.44% ⭐ | 71.66% |
| **+ K-Fold + FGM** | 72.25% | 72.92% | 73.28% |
| **+ K-Fold + FGM + Normalizer** | 73.12% | 72.95% | 72.17% |
#### Best Test Phase Models (Subtask 1C):
| Approach | BanglaBERT | Development | Test |
|----------|------------|-------------|------|
| **K-Fold + Normalizer** | ✅ | - | 73% ⭐ |
| **K-Fold + FreeLB** | ✅ | 74.52% | 72% |
| **Simple FreeLB** | ✅ | 73.91% | - |
| **GAT** | ✅ | 73.79% | - |
| **FGM** | ✅ | 73.75% | - |
### Key Performance Insights
#### Development vs Evaluation Observations:
- **Generalization Gap**: 1-3% performance drop from development to test across all subtasks
- **Most Stable**: K-Fold + Normalizer combinations showed best consistency (especially in subtask1C)
- **Overfitting Risk**: Single models without cross-validation showed higher variance
- **Best Generalization**:
- Subtask 1A: Adversarial training methods (FGM + Normalizer)
- Subtask 1B: Combined approaches (K-Fold + FGM + Normalizer)
- Subtask 1C: Normalization techniques (smallest performance drop: -1.52%)
#### Technical Effectiveness:
- **K-Fold Cross Validation**: Consistent 2-3% improvement across all models
- **Text Normalization**: Additional 0.5-1% boost for Bengali text processing
- **Adversarial Training**: 0.5-1.5% improvement with better robustness
- **Combined Techniques**: Best overall performance with stacked improvements
- **Transformer Superiority**: 15-20% improvement over traditional deep learning
## Model Architecture Details
### Transformer Models Utilized
- **BanglaBERT (csebuetnlp)**: Specialized Bengali language model
- **MuRIL-large-cased**: Multilingual model with strong Bengali support
- **XLM-RoBERTa (base & large)**: Cross-lingual transformer variants
- **DistilBERT-multilingual**: Lightweight multilingual model
### Custom Implementations
- **Enhanced Tokenization**: Bengali-specific preprocessing pipelines
- **Dynamic Padding**: Efficient batch processing strategies
- **Label Smoothing**: Improved training stability
- **Learning Rate Scheduling**: Optimized training convergence
## File Organization
### Directory Structure:
```
Shared_Task1_HateSpeech/
├── subtask1A/ # Hate speech type classification
│ ├── Developmental Phase/
│ │ ├── DL Models/ # BiLSTM, LSTM-Attention
│ │ ├── LLMs/ # Base transformer models
│ │ ├── LLMS with K Fold CV/ # K-Fold implementations
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ ├── LLMS_KFolds_attacks_normalizer/
│ │ └── Various classification heads/
│ └── Evaluation Phase/ # Final test submissions
├── subtask1B/ # Hate speech target classification
│ ├── Developmental Phase/
│ │ ├── DL Models/
│ │ ├── LLMs/
│ │ ├── LLMS with K Fold CV/
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ └── LLMS_KFolds_attacks_normalizer/
│ └── Evaluation Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── K Folds with normalizer/
│ ├── LLMs_KFolds_adversarial attacks/
│ └── LLMS_KFolds_attacks_normalizer/
└── subtask1C/ # Multi-task hate speech analysis
├── Developmental Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── LLMs with adversarial attacks and K Fold CV/
│ ├── LLMs with K Fold CV and normalizer/
│ └── K Fold CV with attacks and normalizer/
└── Evaluation Phase/
├── LLMs/
├── LLMS with K Fold CV/
├── LLMs with adversarial attacks and K Fold CV/
├── LLMs with K Fold CV and normalizer/
└── K Fold CV with attacks and normalizer/
```
### Naming Convention:
- **Model directories**: `v{f1_score}_{model_name}`
- Example: `v0.7488_banglabert-fgm` = 74.88% F1 score using BanglaBERT with FGM
- **Each directory contains**:
- Jupyter notebook (.ipynb) with complete implementation
- Dataset file (subtask_1X.tsv)
- Model checkpoints and outputs
## Performance Evolution
### Developmental Phase Progression:
1. **Baseline Models**: 55-68% F1 (Deep Learning approaches)
2. **Base Transformers**: 68-73% F1 (Standard LLM implementations)
3. **K-Fold Enhancement**: 70-74% F1 (Cross-validation improvements)
4. **Normalization Boost**: 73-75% F1 (Text preprocessing optimization)
5. **Adversarial Training**: 73-75% F1 (Robustness improvements)
6. **Combined Excellence**: 74-75% F1 (Best technique combinations)
### Development → Evaluation Trends:
- **Average Performance Drop**: 1-3% on unseen test data
- **Most Stable Approaches**: K-Fold + Normalizer combinations
- **Highest Risk**: Single model implementations without regularization
- **Best Generalization**: Models with adversarial training components
## Technologies and Frameworks
### Core Technologies:
- **Deep Learning**: PyTorch, TensorFlow
- **Transformers**: Hugging Face Transformers library
- **Text Processing**: Custom Bengali normalizers, NLTK
- **Evaluation**: Scikit-learn, Custom metrics implementations
- **Adversarial**: Custom FGM, AWP, FreeLB implementations
- **Cross-Validation**: Stratified K-Fold with scikit-learn
### Hardware and Training:
- **GPU Acceleration**: CUDA-enabled training
- **Mixed Precision**: For memory efficiency
- **Gradient Accumulation**: Effective batch size optimization
- **Early Stopping**: Preventing overfitting
## Key Contributions
### Novel Techniques Implemented:
1. **Bengali-Specific Normalization**: NFKC Unicode with preservation strategies
2. **Advanced Adversarial Training**: Multiple adversarial techniques comparison
3. **Custom Attention Heads**: Learnable pooling mechanisms
4. **Robust Cross-Validation**: Stratified K-Fold with ensemble strategies
5. **Multi-Phase Evaluation**: Systematic development vs evaluation analysis
### Research Insights:
- **Language-Specific Approaches**: Bengali text requires specialized preprocessing
- **Adversarial Robustness**: Significant impact on generalization
- **Cross-Validation Importance**: Critical for reliable performance estimation
- **Model Ensemble Benefits**: Combining techniques yields optimal results
## Usage Instructions
### Running Experiments:
1. Navigate to desired subtask directory
2. Choose appropriate approach folder
3. Open corresponding Jupyter notebook
4. Ensure required dependencies are installed
5. Execute cells sequentially for complete pipeline
### Model Training:
- Each notebook contains complete training pipeline
- Data preprocessing and normalization included
- Model evaluation and metrics calculation automated
- Results saved with performance indicators
## Future Work
### Potential Improvements:
- **Multi-Modal Approaches**: Incorporating contextual information
- **Advanced Ensembling**: Sophisticated model combination strategies
- **Real-Time Processing**: Optimized inference pipelines
- **Transfer Learning**: Cross-task knowledge transfer
- **Data Augmentation**: Synthetic data generation for Bengali
### Research Directions:
- **Explainability**: Understanding model decision processes
- **Fairness Analysis**: Bias detection and mitigation
- **Cross-Lingual Transfer**: Knowledge sharing across languages
- **Domain Adaptation**: Generalization to different text domains
## Official Task Information
### Task Details
- **Competition**: Bengali Multi-task Hate Speech Identification Shared Task
- **Workshop**: BLP Workshop @ IJCNLP-AACL 2025
- **Website**: https://multihate.github.io/
- **Evaluation Metrics**:
- Subtask 1A & 1B: Micro-F1
- Subtask 1C: Weighted Micro-F1
### Data Format
#### Subtask 1A
```
id text label
```
Labels: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
#### Subtask 1B
```
id text label
```
Labels: Individuals, Organizations, Communities, Society
#### Subtask 1C
```
id text hate_type hate_severity to_whom
```
- hate_type: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
- hate_severity: Little to None, Mild, Severe
- to_whom: Individuals, Organizations, Communities, Society
## Citation and Acknowledgments
This work represents comprehensive exploration of Bengali hate speech detection for the BLP Workshop @ IJCNLP-AACL 2025 shared task, contributing to the advancement of multilingual NLP and social media content moderation.
### Organizers
- Md Arid Hasan, PhD Student, The University of Toronto
- Firoj Alam, Senior Scientist, Qatar Computing Research Institute
- Md Fahad Hossain, Lecturer, Daffodil International University
- Usman Naseem, Assistant Professor, Macquarie University
- Syed Ishtiaque Ahmed, Associate Professor, The University of Toronto
---
**Note**: This repository demonstrates state-of-the-art approaches for Bengali hate speech detection across multiple classification tasks, with particular emphasis on robust evaluation methodology and practical implementation strategies for the official shared task.