https://github.com/rohansardar/speechflowguard
A machine learning web API that detects toxic language in user comments using classical ML
https://github.com/rohansardar/speechflowguard
docker logistic-regression machine-learning python3 scikit-learn tf-idf tfidf-text-analysis tfidf-vectorizer
Last synced: about 2 months ago
JSON representation
A machine learning web API that detects toxic language in user comments using classical ML
- Host: GitHub
- URL: https://github.com/rohansardar/speechflowguard
- Owner: RohanSardar
- Created: 2025-05-17T17:01:07.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-21T11:09:00.000Z (about 1 year ago)
- Last Synced: 2025-06-25T18:45:49.649Z (11 months ago)
- Topics: docker, logistic-regression, machine-learning, python3, scikit-learn, tf-idf, tfidf-text-analysis, tfidf-vectorizer
- Language: Jupyter Notebook
- Homepage:
- Size: 39.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SpeechFlowGuard
A machine learning web API that detects toxic language in user comments using classical ML models (TF-IDF + Logistic Regression). Built with **FastAPI**, trained on the **Jigsaw Toxic Comment Classification Challenge** dataset.
## โ
Features
- Multi-label classification:
- `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, `identity_hate`
- Real-time REST API (FastAPI)
- Modular codebase
- Dockerized for portability
- Preprocessed with custom regex cleaner
## ๐งช Model Details
- Vectorizer: `TfidfVectorizer` (max_features=4096, stop_words='english')
- Classifier: `LogisticRegression` (class_weight='balanced', max_iter=500, C=1.6)
- Trained on: [Jigsaw Toxic Comment Dataset](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
## ๐๏ธ Project Structure
```
SpeechFlowGuard/
โโโ app/
โ โโโ main.py
โ โโโ api.py
โ โโโ model.py
โ โโโ schemas.py
โ โโโ utils.py
โ โโโ config.py
โโโ data/
โ โโโ data_processed.csv
โ โโโ trains.csv
โโโ models/
โ โโโ tf-idf_vectorizer.pkl
โ โโโ classifier.pkl
โโโ notebooks/
โ โโโ data_cleaning.ipynb
โ โโโ tf-idf_model_train.ipynb
โโโ .gitignore
โโโ docker-requirements.txt
โโโ Dockerfile
โโโ README.md
โโโ requirements.txt
```
## ๐งฐ Technical Stack
- **Language:** Python 3.12+
- **Framework:** FastAPI (ASGI-compatible)
- **ML Model:**
- TfidfVectorizer for feature extraction
- LogisticRegression (one classifier per label, binary relevance method)
- **Serialization:** `dill` for saving sklearn models
- **Request Schema:** Pydantic-based input validation
- **Serving:** Uvicorn for ASGI serving
- **Containerization:** Docker
## ๐ก API Endpoints
The FastAPI server exposes the following endpoints:
### `GET /`
Returns a welcome message to confirm the API is live.
**Request:**
`curl http://localhost:8000/`
**Response:**
```
{
"message": "Hello and welcome to SpeechFlowGuard API"
}
```
### `POST /predict`
Performs multi-label classification on the input text and returns the predicted probabilities for each toxicity label.
**Request:**
```
POST /predict
Content-Type: application/json
```
**Request Body:**
```
{
"text": "You are a criminal person"
}
```
**Response:**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```
## ๐ ๏ธ Git Setup & Repository Cloning
If you haven't installed Git:
### ๐จ Install Git
**Windows:**
Download from https://git-scm.com/download/win and install with default settings.
**Ubuntu/Linux:**
```
sudo apt update
sudo apt install git
```
**macOS:**
```
brew install git
```
### ๐ฆ Clone the Repository
```
git clone https://github.com/RohanSardar/SpeechFlowGuard.git
cd SpeechFlowGuard
```
## ๐ง How to Train the Model
Ensure you have the following installed:
- Python (โฅ 3.12)
- Conda (for Conda-based setup)
- Virtualenv (install via `pip install virtualenv` if not already available)
### ๐ Using conda
#### Create a conda virtual environment
Run the following command to create a virtual environment in a specific directory:
```
conda create -p venv python=3.12 -y
```
#### Activate it
```
conda activate venv/
```
#### Install dependencies
```
pip install -r requirements.txt
```
### ๐ป Using virtualenv
Run the following command to create a virtual environment in a specific directory:
```
python -m virtualenv venv
```
#### Activate it
- **Windows**
```
venv\Scripts\activate
```
- **Linux/macOS**
```
source venv/bin/activate
```
#### Install dependencies
```
pip install -r requirements.txt
```
Use the Jupyter notebooks in `notebooks/` or create a script to:
1. Load and preprocess the dataset.
2. Train TF-IDF and LogisticRegression models.
3. Save them using `dill`.
## ๐ณ Docker Setup
๐ฅ 1. Build the Image
```
docker build -t speechflowguard .
```
๐ 2. Run the Container
```
docker run -p 8000:8000 speechflowguard
```
You can also access the interactive API docs at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
**Example using cURL**
```
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "You are a criminal person"}'
```
**Response**
```
{
"toxic": 0.6774,
"severe_toxic": 0.039,
"obscene": 0.0994,
"threat": 0.1204,
"insult": 0.5151,
"identity_hate": 0.6681
}
```