An open API service indexing awesome lists of open source software.

https://github.com/rohansardar/speechflowguard

A machine learning web API that detects toxic language in user comments using classical ML
https://github.com/rohansardar/speechflowguard

docker logistic-regression machine-learning python3 scikit-learn tf-idf tfidf-text-analysis tfidf-vectorizer

Last synced: about 2 months ago
JSON representation

A machine learning web API that detects toxic language in user comments using classical ML

Awesome Lists containing this project

README

          

# SpeechFlowGuard

A machine learning web API that detects toxic language in user comments using classical ML models (TF-IDF + Logistic Regression). Built with **FastAPI**, trained on the **Jigsaw Toxic Comment Classification Challenge** dataset.

## โœ… Features

- Multi-label classification:
- `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, `identity_hate`
- Real-time REST API (FastAPI)
- Modular codebase
- Dockerized for portability
- Preprocessed with custom regex cleaner

## ๐Ÿงช Model Details

- Vectorizer: `TfidfVectorizer` (max_features=4096, stop_words='english')
- Classifier: `LogisticRegression` (class_weight='balanced', max_iter=500, C=1.6)
- Trained on: [Jigsaw Toxic Comment Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

## ๐Ÿ—‚๏ธ Project Structure

```
SpeechFlowGuard/
โ”œโ”€โ”€ app/
โ”‚ โ”œโ”€โ”€ main.py
โ”‚ โ”œโ”€โ”€ api.py
โ”‚ โ”œโ”€โ”€ model.py
โ”‚ โ”œโ”€โ”€ schemas.py
โ”‚ โ”œโ”€โ”€ utils.py
โ”‚ โ””โ”€โ”€ config.py
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ data_processed.csv
โ”‚ โ””โ”€โ”€ trains.csv
โ”œโ”€โ”€ models/
โ”‚ โ”œโ”€โ”€ tf-idf_vectorizer.pkl
โ”‚ โ””โ”€โ”€ classifier.pkl
โ”œโ”€โ”€ notebooks/
โ”‚ โ”œโ”€โ”€ data_cleaning.ipynb
โ”‚ โ””โ”€โ”€ tf-idf_model_train.ipynb
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ docker-requirements.txt
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ requirements.txt
```

## ๐Ÿงฐ Technical Stack
- **Language:** Python 3.12+
- **Framework:** FastAPI (ASGI-compatible)
- **ML Model:**
- TfidfVectorizer for feature extraction
- LogisticRegression (one classifier per label, binary relevance method)
- **Serialization:** `dill` for saving sklearn models
- **Request Schema:** Pydantic-based input validation
- **Serving:** Uvicorn for ASGI serving
- **Containerization:** Docker

## ๐Ÿ“ก API Endpoints
The FastAPI server exposes the following endpoints:

### `GET /`

Returns a welcome message to confirm the API is live.

**Request:**

`curl http://localhost:8000/`

**Response:**
```
{
"message": "Hello and welcome to SpeechFlowGuard API"
}
```
### `POST /predict`

Performs multi-label classification on the input text and returns the predicted probabilities for each toxicity label.

**Request:**
```
POST /predict
Content-Type: application/json
```
**Request Body:**
```
{
"text": "You are a criminal person"
}
```
**Response:**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```

## ๐Ÿ› ๏ธ Git Setup & Repository Cloning
If you haven't installed Git:

### ๐Ÿ”จ Install Git
**Windows:**

Download from https://git-scm.com/download/win and install with default settings.

**Ubuntu/Linux:**
```
sudo apt update
sudo apt install git
```

**macOS:**
```
brew install git
```

### ๐Ÿ“ฆ Clone the Repository
```
git clone https://github.com/RohanSardar/SpeechFlowGuard.git
cd SpeechFlowGuard
```

## ๐Ÿ”ง How to Train the Model

Ensure you have the following installed:
- Python (โ‰ฅ 3.12)
- Conda (for Conda-based setup)
- Virtualenv (install via `pip install virtualenv` if not already available)

### ๐Ÿ Using conda
#### Create a conda virtual environment
Run the following command to create a virtual environment in a specific directory:
```
conda create -p venv python=3.12 -y
```
#### Activate it
```
conda activate venv/
```
#### Install dependencies
```
pip install -r requirements.txt
```

### ๐Ÿ’ป Using virtualenv
Run the following command to create a virtual environment in a specific directory:
```
python -m virtualenv venv
```
#### Activate it
- **Windows**
```
venv\Scripts\activate
```
- **Linux/macOS**
```
source venv/bin/activate
```
#### Install dependencies
```
pip install -r requirements.txt
```

Use the Jupyter notebooks in `notebooks/` or create a script to:

1. Load and preprocess the dataset.
2. Train TF-IDF and LogisticRegression models.
3. Save them using `dill`.

## ๐Ÿณ Docker Setup
๐Ÿ”ฅ 1. Build the Image
```
docker build -t speechflowguard .
```
๐Ÿš€ 2. Run the Container
```
docker run -p 8000:8000 speechflowguard
```
You can also access the interactive API docs at:

- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

**Example using cURL**
```
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "You are a criminal person"}'
```

**Response**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```