https://github.com/rohansardar/speechflowguard

A machine learning web API that detects toxic language in user comments using classical ML
https://github.com/rohansardar/speechflowguard

docker logistic-regression machine-learning python3 scikit-learn tf-idf tfidf-text-analysis tfidf-vectorizer

Last synced: 2 months ago
JSON representation

A machine learning web API that detects toxic language in user comments using classical ML

Host: GitHub
URL: https://github.com/rohansardar/speechflowguard
Owner: RohanSardar
Created: 2025-05-17T17:01:07.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-21T11:09:00.000Z (about 1 year ago)
Last Synced: 2025-06-25T18:45:49.649Z (about 1 year ago)
Topics: docker, logistic-regression, machine-learning, python3, scikit-learn, tf-idf, tfidf-text-analysis, tfidf-vectorizer
Language: Jupyter Notebook
Homepage:
Size: 39.9 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# SpeechFlowGuard

A machine learning web API that detects toxic language in user comments using classical ML models (TF-IDF + Logistic Regression). Built with **FastAPI**, trained on the **Jigsaw Toxic Comment Classification Challenge** dataset.

## ✅ Features

- Multi-label classification:
- `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, `identity_hate`
- Real-time REST API (FastAPI)
- Modular codebase
- Dockerized for portability
- Preprocessed with custom regex cleaner

## 🧪 Model Details

- Vectorizer: `TfidfVectorizer` (max_features=4096, stop_words='english')
- Classifier: `LogisticRegression` (class_weight='balanced', max_iter=500, C=1.6)
- Trained on: [Jigsaw Toxic Comment Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

## 🗂️ Project Structure

```
SpeechFlowGuard/
├── app/
│ ├── main.py
│ ├── api.py
│ ├── model.py
│ ├── schemas.py
│ ├── utils.py
│ └── config.py
├── data/
│ ├── data_processed.csv
│ └── trains.csv
├── models/
│ ├── tf-idf_vectorizer.pkl
│ └── classifier.pkl
├── notebooks/
│ ├── data_cleaning.ipynb
│ └── tf-idf_model_train.ipynb
├── .gitignore
├── docker-requirements.txt
├── Dockerfile
├── README.md
└── requirements.txt
```

## 🧰 Technical Stack
- **Language:** Python 3.12+
- **Framework:** FastAPI (ASGI-compatible)
- **ML Model:**
- TfidfVectorizer for feature extraction
- LogisticRegression (one classifier per label, binary relevance method)
- **Serialization:** `dill` for saving sklearn models
- **Request Schema:** Pydantic-based input validation
- **Serving:** Uvicorn for ASGI serving
- **Containerization:** Docker

## 📡 API Endpoints
The FastAPI server exposes the following endpoints:

### `GET /`

Returns a welcome message to confirm the API is live.

**Request:**

`curl http://localhost:8000/`

**Response:**
```
{
"message": "Hello and welcome to SpeechFlowGuard API"
}
```
### `POST /predict`

Performs multi-label classification on the input text and returns the predicted probabilities for each toxicity label.

**Request:**
```
POST /predict
Content-Type: application/json
```
**Request Body:**
```
{
"text": "You are a criminal person"
}
```
**Response:**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```

## 🛠️ Git Setup & Repository Cloning
If you haven't installed Git:

### 🔨 Install Git
**Windows:**

Download from https://git-scm.com/download/win and install with default settings.

**Ubuntu/Linux:**
```
sudo apt update
sudo apt install git
```

**macOS:**
```
brew install git
```

### 📦 Clone the Repository
```
git clone https://github.com/RohanSardar/SpeechFlowGuard.git
cd SpeechFlowGuard
```

## 🔧 How to Train the Model

Ensure you have the following installed:
- Python (≥ 3.12)
- Conda (for Conda-based setup)
- Virtualenv (install via `pip install virtualenv` if not already available)

### 🐍 Using conda
#### Create a conda virtual environment
Run the following command to create a virtual environment in a specific directory:
```
conda create -p venv python=3.12 -y
```
#### Activate it
```
conda activate venv/
```
#### Install dependencies
```
pip install -r requirements.txt
```

### 💻 Using virtualenv
Run the following command to create a virtual environment in a specific directory:
```
python -m virtualenv venv
```
#### Activate it
- **Windows**
```
venv\Scripts\activate
```
- **Linux/macOS**
```
source venv/bin/activate
```
#### Install dependencies
```
pip install -r requirements.txt
```

Use the Jupyter notebooks in `notebooks/` or create a script to:

1. Load and preprocess the dataset.
2. Train TF-IDF and LogisticRegression models.
3. Save them using `dill`.

## 🐳 Docker Setup
🔥 1. Build the Image
```
docker build -t speechflowguard .
```
🚀 2. Run the Container
```
docker run -p 8000:8000 speechflowguard
```
You can also access the interactive API docs at:

- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

**Example using cURL**
```
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "You are a criminal person"}'
```

**Response**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rohansardar/speechflowguard

Awesome Lists containing this project

README