https://github.com/syedt1/shared_task1_hatespeech

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/syedt1/shared_task1_hatespeech
Owner: SyedT1
Created: 2025-07-31T15:51:53.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-11-08T05:05:12.000Z (7 months ago)
Last Synced: 2025-11-08T06:17:39.904Z (7 months ago)
Language: Jupyter Notebook
Size: 5.86 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Shared Task 1: Hate Speech Detection in Bengali

## Project Overview
This repository contains comprehensive implementations for the Bengali Multi-task Hate Speech Identification shared task at BLP Workshop @IJCNLP-AACL 2025. The project addresses the complex problem of detecting and understanding hate speech in Bengali across three related subtasks: hate type classification, target identification, and multi-task analysis. The implementation explores various machine learning approaches from traditional deep learning to state-of-the-art transformer models with advanced training techniques.

## Competition Phases

### 🔬 **Developmental Phase**
- **Objective**: Model experimentation, architecture exploration, and hyperparameter tuning
- **Data**: Training and validation datasets provided by organizers
- **Focus**: Testing various approaches and techniques to identify best-performing models
- **Metrics**: Validation F1 scores on development set

### 🏆 **Evaluation Phase**
- **Objective**: Final model evaluation on unseen test data
- **Data**: Hidden test set released during evaluation period
- **Focus**: Deploying best models from developmental phase with refined configurations
- **Metrics**: Test F1 scores on official evaluation set

## Repository Structure

### Subtask 1A - Hate Speech Type Classification
Multi-class classification of Bengali text into: Abusive, Sexism, Religious Hate, Political Hate, Profane, or None.

#### 📊 **Developmental Phase Results**

##### **Deep Learning Models**
- **BiLSTM** - F1 Score: 56.25%
- **LSTM with Attention** - F1 Score: 55.18%

##### **Large Language Models (LLMs)**
- **XLM-RoBERTa-large** - F1 Score: 72.81%
- **MuRIL-large-cased** - F1 Score: 71.02%
- **BanglaBERT (csebuetnlp)** - F1 Score: 70.74%
- **BanglaBERT-large (csebuetnlp)** - F1 Score: 70.51%
- **XLM-RoBERTa-base** - F1 Score: 70.50%
- **DistilBERT-multilingual** - F1 Score: 68.03%

##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased with K-Fold** - F1 Score: 73.61%
- **XLM-RoBERTa-large with K-Fold** - F1 Score: 73.45%
- **BanglaBERT with K-Fold** - F1 Score: 73.29%

##### **K-Fold with Text Normalizer**
- **BanglaBERT with Normalizer** - F1 Score: 74.32%
- **MuRIL-large-cased with Normalizer** - F1 Score: 73.73%
- **XLM-RoBERTa-large with Normalizer** - F1 Score: 73.29%

##### **LLMs with Adversarial Training (K-Fold + FGM)**
- **BanglaBERT with K-Fold + FGM** - F1 Score: 73.87%
- **MuRIL-large-cased with K-Fold + FGM** - F1 Score: 73.68%

##### **Advanced Combined Approaches (K-Fold + FGM + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - F1 Score: 74.88% ⭐ (Best Development Score)
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - F1 Score: 73.81%

#### 🎯 **Evaluation Phase Results**
- **BanglaBERT + K-Fold + FGM + Normalizer** - Test F1: 72.33% ⭐ (Best Test Score)
- **BanglaBERT + K-Fold + FGM** - Test F1: 72.17%
- **MuRIL-large-cased + K-Fold + Normalizer** - Test F1: 72.30%
- **BanglaBERT + K-Fold** - Test F1: 72.05%
- **MuRIL-large-cased + K-Fold + FGM** - Test F1: 71.90%
- **MuRIL-large-cased + K-Fold** - Test F1: 71.88%
- **XLM-RoBERTa-large + K-Fold** - Test F1: 71.72%
- **XLM-RoBERTa-large + K-Fold + Normalizer** - Test F1: 71.57%
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - Test F1: 71.31%
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 71.14%
- **BanglaBERT (Base)** - Test F1: 70.31%

### Subtask 1B - Hate Speech Target Classification
Classification of hate speech targets into: Individuals, Organizations, Communities, or Society.

#### 📊 **Developmental Phase Results**

##### **Deep Learning Models**
- Traditional deep learning approaches implemented (scores pending)

##### **Large Language Models (LLMs)**
- **BanglaBERT** - F1 Score: 72.09%
- **MuRIL-large-cased** - F1 Score: 71.93%
- **XLM-RoBERTa-large** - F1 Score: 71.38%

##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased with K-Fold** - F1 Score: 74.96% ⭐ (Best Development Score)
- **BanglaBERT with K-Fold** - F1 Score: 73.69%
- **XLM-RoBERTa-large with K-Fold** - F1 Score: 71.53%

##### **K-Fold with Text Normalizer**
- **BanglaBERT with Normalizer** - F1 Score: 74.72%
- **MuRIL-large-cased with Normalizer** - F1 Score: 74.48%
- **XLM-RoBERTa-large with Normalizer** - F1 Score: 72.39%

##### **LLMs with K-Fold and Adversarial Attacks (FGM)**
- **XLM-RoBERTa-large with K-Fold + FGM** - F1 Score: 74.20%
- **BanglaBERT with K-Fold + FGM** - F1 Score: 74.12%
- **MuRIL-large-cased with K-Fold + FGM** - F1 Score: 73.89%

##### **Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - F1 Score: 74.64%
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - F1 Score: 74.56%
- **XLM-RoBERTa-large + K-Fold + FGM + Normalizer** - F1 Score: 74.32%

#### 🎯 **Evaluation Phase Results**

##### **Base LLMs (without K-Fold)**
- **XLM-RoBERTa-large** - Test F1: 71.23%
- **MuRIL-large-cased** - Test F1: 70.93%
- **BanglaBERT** - Test F1: 70.25%

##### **LLMs with K-Fold Cross Validation**
- **MuRIL-large-cased + K-Fold** - Test F1: 73.44%
- **BanglaBERT + K-Fold** - Test F1: 71.85%
- **XLM-RoBERTa-large + K-Fold** - Test F1: 68.07%

##### **K-Fold with Text Normalizer**
- **MuRIL-large-cased + K-Fold + Normalizer** - Test F1: 73.44%
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 72.89%
- **XLM-RoBERTa-large + K-Fold + Normalizer** - Test F1: 71.66%

##### **LLMs with K-Fold and Adversarial Attacks (FGM)**
- **XLM-RoBERTa-large + K-Fold + FGM** - Test F1: 73.28%
- **MuRIL-large-cased + K-Fold + FGM** - Test F1: 72.92%
- **BanglaBERT + K-Fold + FGM** - Test F1: 72.25%

##### **Advanced Combined Approaches (K-Fold + FGM + Normalizer)**
- **BanglaBERT + K-Fold + FGM + Normalizer** - Test F1: 73.12% ⭐
- **MuRIL-large-cased + K-Fold + FGM + Normalizer** - Test F1: 72.95% ⭐
- **XLM-RoBERTa-large + K-Fold + FGM + Normalizer** - Test F1: 72.17%

### Subtask 1C - Multi-task Hate Speech Analysis
Multi-task classification combining hate type (Abusive, Sexism, Religious Hate, Political Hate, Profane, None), severity (Little to None, Mild, Severe), and target group (Individuals, Organizations, Communities, Society).

#### 📊 **Developmental Phase Results**

##### **Base LLMs**
- Basic transformer implementations (scores pending)

##### **LLMs with K-Fold Cross Validation**
- Standard K-Fold implementations (scores pending)

##### **LLMs with Adversarial Training and K-Fold**
All using BanglaBERT (cse-buet-nlp) with different adversarial techniques:
- **BanglaBERT + FreeLB** - F1 Score: 74.52% ⭐ (Best Development Score)
- **BanglaBERT + Simple FreeLB** - F1 Score: 73.91%
- **BanglaBERT + GAT** - F1 Score: 73.79%
- **BanglaBERT + FGM** - F1 Score: 73.75%

##### **LLMs with K-Fold and Normalizer**
- Text normalization implementations (scores pending)

##### **Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)**
- Comprehensive technique combinations (scores pending)

#### 🎯 **Evaluation Phase Results**

##### **LLMs with K-Fold and Normalizer**
- **BanglaBERT + K-Fold + Normalizer** - Test F1: 73.00%

##### **LLMs with Adversarial Training and K-Fold**
- **BanglaBERT + FreeLB + K-Fold** - Test F1: 72.00%

## Technical Implementation Details

### Advanced Training Techniques

#### **Adversarial Training Methods**
- **FGM (Fast Gradient Method)**: Simple and efficient adversarial perturbations
- **AWP (Adversarial Weight Perturbation)**: Weight-space adversarial training
- **FreeLB**: Free large-batch adversarial training for improved generalization
- **Simple FreeLB**: Streamlined version of FreeLB
- **GAT (Geometry-Aware Training)**: Advanced geometry-aware adversarial training

#### **Text Normalization Pipeline**
```python
normalize(
text,
unicode_norm="NFKC", # Canonical decomposition + compatibility
punct_replacement=None, # Preserve original punctuation
url_replacement=None, # Preserve URLs
emoji_replacement=None, # Preserve emojis
apply_unicode_norm_last=True # Apply normalization as final step
)
```

#### **Custom Model Architectures**
- **Attention-Based Pooling Head**: Dynamic token weighting for better representation
- **Multi-Head Classification**: Custom classification layers for Bengali text
- **Enhanced Dropout Strategies**: Improved regularization techniques

#### **Cross-Validation Strategy**
- **K-Fold Implementation**: 5-fold cross-validation for robust evaluation
- **Stratified Sampling**: Maintaining class distribution across folds
- **Ensemble Averaging**: Combining predictions from multiple folds

## Performance Analysis

### 📈 Best Performing Models by Phase

#### Developmental Phase Champions:
| Subtask | Model | F1 Score | Technique |
|---------|-------|----------|-----------|
| **1A** | BanglaBERT | 74.88% | K-Fold + FGM + Normalizer |
| **1B** | MuRIL-large-cased | 74.96% | K-Fold Cross Validation |
| **1C** | BanglaBERT | 74.52% | FreeLB Adversarial Training |

#### Evaluation Phase Performance:
| Subtask | Model | Dev F1 | Test F1 | Performance Drop |
|---------|-------|--------|---------|------------------|
| **1A** | BanglaBERT + K-Fold + FGM + Normalizer | 74.88% | 72.33% | -2.55% |
| **1B** | MuRIL-large-cased + K-Fold | 74.96% | 73.44% | -1.52% |
| **1C** | BanglaBERT + K-Fold + Normalizer | 74.52% | 73.00% | -1.52% |

#### Best Test Phase Models (Subtask 1A):
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|----------|------------|-------------|-------------------|
| **Base LLM** | 70.31% | - | - |
| **+ K-Fold** | 72.05% | 71.88% | 71.72% |
| **+ K-Fold + Normalizer** | 71.14% | 72.30% | 71.57% |
| **+ K-Fold + FGM** | 72.17% | 71.90% | - |
| **+ K-Fold + FGM + Normalizer** | 72.33% ⭐ | 71.31% | - |

#### Best Test Phase Models (Subtask 1B):
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|----------|------------|-------------|-------------------|
| **Base LLM** | 70.25% | 70.93% | 71.23% |
| **+ K-Fold** | 71.85% | 73.44% ⭐ | 68.07% |
| **+ K-Fold + Normalizer** | 72.89% | 73.44% ⭐ | 71.66% |
| **+ K-Fold + FGM** | 72.25% | 72.92% | 73.28% |
| **+ K-Fold + FGM + Normalizer** | 73.12% | 72.95% | 72.17% |

#### Best Test Phase Models (Subtask 1C):
| Approach | BanglaBERT | Development | Test |
|----------|------------|-------------|------|
| **K-Fold + Normalizer** | ✅ | - | 73% ⭐ |
| **K-Fold + FreeLB** | ✅ | 74.52% | 72% |
| **Simple FreeLB** | ✅ | 73.91% | - |
| **GAT** | ✅ | 73.79% | - |
| **FGM** | ✅ | 73.75% | - |

### Key Performance Insights

#### Development vs Evaluation Observations:
- **Generalization Gap**: 1-3% performance drop from development to test across all subtasks
- **Most Stable**: K-Fold + Normalizer combinations showed best consistency (especially in subtask1C)
- **Overfitting Risk**: Single models without cross-validation showed higher variance
- **Best Generalization**:
- Subtask 1A: Adversarial training methods (FGM + Normalizer)
- Subtask 1B: Combined approaches (K-Fold + FGM + Normalizer)
- Subtask 1C: Normalization techniques (smallest performance drop: -1.52%)

#### Technical Effectiveness:
- **K-Fold Cross Validation**: Consistent 2-3% improvement across all models
- **Text Normalization**: Additional 0.5-1% boost for Bengali text processing
- **Adversarial Training**: 0.5-1.5% improvement with better robustness
- **Combined Techniques**: Best overall performance with stacked improvements
- **Transformer Superiority**: 15-20% improvement over traditional deep learning

## Model Architecture Details

### Transformer Models Utilized
- **BanglaBERT (csebuetnlp)**: Specialized Bengali language model
- **MuRIL-large-cased**: Multilingual model with strong Bengali support
- **XLM-RoBERTa (base & large)**: Cross-lingual transformer variants
- **DistilBERT-multilingual**: Lightweight multilingual model

### Custom Implementations
- **Enhanced Tokenization**: Bengali-specific preprocessing pipelines
- **Dynamic Padding**: Efficient batch processing strategies
- **Label Smoothing**: Improved training stability
- **Learning Rate Scheduling**: Optimized training convergence

## File Organization

### Directory Structure:
```
Shared_Task1_HateSpeech/
├── subtask1A/ # Hate speech type classification
│ ├── Developmental Phase/
│ │ ├── DL Models/ # BiLSTM, LSTM-Attention
│ │ ├── LLMs/ # Base transformer models
│ │ ├── LLMS with K Fold CV/ # K-Fold implementations
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ ├── LLMS_KFolds_attacks_normalizer/
│ │ └── Various classification heads/
│ └── Evaluation Phase/ # Final test submissions
├── subtask1B/ # Hate speech target classification
│ ├── Developmental Phase/
│ │ ├── DL Models/
│ │ ├── LLMs/
│ │ ├── LLMS with K Fold CV/
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ └── LLMS_KFolds_attacks_normalizer/
│ └── Evaluation Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── K Folds with normalizer/
│ ├── LLMs_KFolds_adversarial attacks/
│ └── LLMS_KFolds_attacks_normalizer/
└── subtask1C/ # Multi-task hate speech analysis
├── Developmental Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── LLMs with adversarial attacks and K Fold CV/
│ ├── LLMs with K Fold CV and normalizer/
│ └── K Fold CV with attacks and normalizer/
└── Evaluation Phase/
├── LLMs/
├── LLMS with K Fold CV/
├── LLMs with adversarial attacks and K Fold CV/
├── LLMs with K Fold CV and normalizer/
└── K Fold CV with attacks and normalizer/
```

### Naming Convention:
- **Model directories**: `v{f1_score}_{model_name}`
- Example: `v0.7488_banglabert-fgm` = 74.88% F1 score using BanglaBERT with FGM
- **Each directory contains**:
- Jupyter notebook (.ipynb) with complete implementation
- Dataset file (subtask_1X.tsv)
- Model checkpoints and outputs

## Performance Evolution

### Developmental Phase Progression:
1. **Baseline Models**: 55-68% F1 (Deep Learning approaches)
2. **Base Transformers**: 68-73% F1 (Standard LLM implementations)
3. **K-Fold Enhancement**: 70-74% F1 (Cross-validation improvements)
4. **Normalization Boost**: 73-75% F1 (Text preprocessing optimization)
5. **Adversarial Training**: 73-75% F1 (Robustness improvements)
6. **Combined Excellence**: 74-75% F1 (Best technique combinations)

### Development → Evaluation Trends:
- **Average Performance Drop**: 1-3% on unseen test data
- **Most Stable Approaches**: K-Fold + Normalizer combinations
- **Highest Risk**: Single model implementations without regularization
- **Best Generalization**: Models with adversarial training components

## Technologies and Frameworks

### Core Technologies:
- **Deep Learning**: PyTorch, TensorFlow
- **Transformers**: Hugging Face Transformers library
- **Text Processing**: Custom Bengali normalizers, NLTK
- **Evaluation**: Scikit-learn, Custom metrics implementations
- **Adversarial**: Custom FGM, AWP, FreeLB implementations
- **Cross-Validation**: Stratified K-Fold with scikit-learn

### Hardware and Training:
- **GPU Acceleration**: CUDA-enabled training
- **Mixed Precision**: For memory efficiency
- **Gradient Accumulation**: Effective batch size optimization
- **Early Stopping**: Preventing overfitting

## Key Contributions

### Novel Techniques Implemented:
1. **Bengali-Specific Normalization**: NFKC Unicode with preservation strategies
2. **Advanced Adversarial Training**: Multiple adversarial techniques comparison
3. **Custom Attention Heads**: Learnable pooling mechanisms
4. **Robust Cross-Validation**: Stratified K-Fold with ensemble strategies
5. **Multi-Phase Evaluation**: Systematic development vs evaluation analysis

### Research Insights:
- **Language-Specific Approaches**: Bengali text requires specialized preprocessing
- **Adversarial Robustness**: Significant impact on generalization
- **Cross-Validation Importance**: Critical for reliable performance estimation
- **Model Ensemble Benefits**: Combining techniques yields optimal results

## Usage Instructions

### Running Experiments:
1. Navigate to desired subtask directory
2. Choose appropriate approach folder
3. Open corresponding Jupyter notebook
4. Ensure required dependencies are installed
5. Execute cells sequentially for complete pipeline

### Model Training:
- Each notebook contains complete training pipeline
- Data preprocessing and normalization included
- Model evaluation and metrics calculation automated
- Results saved with performance indicators

## Future Work

### Potential Improvements:
- **Multi-Modal Approaches**: Incorporating contextual information
- **Advanced Ensembling**: Sophisticated model combination strategies
- **Real-Time Processing**: Optimized inference pipelines
- **Transfer Learning**: Cross-task knowledge transfer
- **Data Augmentation**: Synthetic data generation for Bengali

### Research Directions:
- **Explainability**: Understanding model decision processes
- **Fairness Analysis**: Bias detection and mitigation
- **Cross-Lingual Transfer**: Knowledge sharing across languages
- **Domain Adaptation**: Generalization to different text domains

## Official Task Information

### Task Details
- **Competition**: Bengali Multi-task Hate Speech Identification Shared Task
- **Workshop**: BLP Workshop @ IJCNLP-AACL 2025
- **Website**: https://multihate.github.io/
- **Evaluation Metrics**:
- Subtask 1A & 1B: Micro-F1
- Subtask 1C: Weighted Micro-F1

### Data Format
#### Subtask 1A
```
id text label
```
Labels: Abusive, Sexism, Religious Hate, Political Hate, Profane, None

#### Subtask 1B
```
id text label
```
Labels: Individuals, Organizations, Communities, Society

#### Subtask 1C
```
id text hate_type hate_severity to_whom
```
- hate_type: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
- hate_severity: Little to None, Mild, Severe
- to_whom: Individuals, Organizations, Communities, Society

## Citation and Acknowledgments

This work represents comprehensive exploration of Bengali hate speech detection for the BLP Workshop @ IJCNLP-AACL 2025 shared task, contributing to the advancement of multilingual NLP and social media content moderation.

### Organizers
- Md Arid Hasan, PhD Student, The University of Toronto
- Firoj Alam, Senior Scientist, Qatar Computing Research Institute
- Md Fahad Hossain, Lecturer, Daffodil International University
- Usman Naseem, Assistant Professor, Macquarie University
- Syed Ishtiaque Ahmed, Associate Professor, The University of Toronto

---

**Note**: This repository demonstrates state-of-the-art approaches for Bengali hate speech detection across multiple classification tasks, with particular emphasis on robust evaluation methodology and practical implementation strategies for the official shared task.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/syedt1/shared_task1_hatespeech

Awesome Lists containing this project

README