https://github.com/sanyambk/hate-speech-detection
Multiclass hate-speech detection pipeline: data cleaning, neural model training, and serialized inference artifact.
https://github.com/sanyambk/hate-speech-detection
deep-learning machine-learning nlp pytorch tensorflow text-classification
Last synced: about 2 months ago
JSON representation
Multiclass hate-speech detection pipeline: data cleaning, neural model training, and serialized inference artifact.
- Host: GitHub
- URL: https://github.com/sanyambk/hate-speech-detection
- Owner: SanyamBK
- Created: 2025-09-06T18:54:40.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-06T19:04:28.000Z (10 months ago)
- Last Synced: 2025-09-06T21:21:45.730Z (10 months ago)
- Topics: deep-learning, machine-learning, nlp, pytorch, tensorflow, text-classification
- Language: Python
- Homepage:
- Size: 8.13 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hate Speech Detection
### Overview
This project implements a text-classification pipeline to categorize short social media posts (tweets) into one of three classes:
- Hate Speech (0)
- Offensive Language (1)
- Neither / Neutral (2)
This is a multiclass classification problem. The solution includes data exploration, cleaning and preprocessing, model training using a neural network, and a saved model artifact for inference.
### Contest
Developed for CodeChef Weekend Dev Challenge 14: "DL Projects" (attempted on 6 Sep 2025).
## Repository layout
- `main.ipynb` — exploratory analysis and experiment log.
- Part 1/
- `main.py` — data loading and initial EDA.
- `hate_speech.csv` — original raw dataset (columns: `tweet`, `class`).
- Part 2/
- `main.py` — data cleaning and preprocessing pipeline; outputs `cleaned_hate_dataset.csv`.
- `hate_dataset.csv` — intermediate dataset.
- Part 3/
- `main.py` — model definition, training, evaluation, and inference utilities.
- `cleaned_hate_dataset.csv` — final cleaned dataset used for training.
- `hate_speech_model.pkl` — serialized trained model for deployment.
## Key steps
1. Data exploration (Part 1): inspect class balance, token distributions, and common tokens.
2. Cleaning & preprocessing (Part 2): normalize text, remove noise, tokenize, and vectorize (TF-IDF or embeddings).
3. Model training & evaluation (Part 3): train a neural classifier and evaluate using accuracy, precision/recall, F1, and confusion matrix.
### Part 1 — Data Exploration & Analysis
In Part 1 we perform an exploratory data analysis to understand the dataset before cleaning and modeling. The dataset (`hate_dataset.csv` / `hate_speech.csv`) contains two columns:
- `tweet`: raw tweet text
- `class`: label (0 = Hate Speech, 1 = Offensive Language, 2 = Neither/Neutral)
The EDA includes:
- Class distribution and imbalance checks
- Token length and distribution plots
- Frequent token and n-gram analysis per class
## Quick start
1. Create a virtual environment and install dependencies (from repo root):
```powershell
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
2. Run Part 1 (EDA):
```powershell
Set-Location -LiteralPath "Hate Speech Detection project\Part 1"
python main.py
```
3. Run final training & evaluation (Part 3):
```powershell
Set-Location -LiteralPath "Hate Speech Detection project\Part 3"
python main.py
```
### Inference example
```python
import joblib
model = joblib.load('Part 3/hate_speech_model.pkl')
text = "This is a sample tweet to classify"
pred = model.predict([text])
print(pred)
```