https://github.com/devroopsaha744/hatespeechdetect-text

This is an effort to detect/classify hatespeech in text
https://github.com/devroopsaha744/hatespeechdetect-text
Last synced: 11 months ago
JSON representation
This is an effort to detect/classify hatespeech in text
Host: GitHub
URL: https://github.com/devroopsaha744/hatespeechdetect-text
Owner: devroopsaha744
License: mit
Created: 2024-05-22T20:04:23.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-05-22T20:09:11.000Z (almost 2 years ago)
Last Synced: 2024-05-22T21:34:41.457Z (almost 2 years ago)
Language: Jupyter Notebook
Size: 998 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          ---

# HateSpeechDetect-text

## **Project Overview**

In this project, I focused on benchmarking various machine learning models, deep learning architectures, and fine-tuned BERT-based models to evaluate their performance across multiple metrics. The aim was to establish a robust and efficient framework for text classification tasks, ultimately improving the overall accuracy of predictions by 25%.

### **Key Highlights**

1. **Data Collection:**

   - Scraped data from Twitter using `BeautifulSoup` (BS4), collecting tweets related to a specific domain for text classification tasks.

   - Preprocessed the scraped data to clean, tokenize, and transform it for effective model training.

2. **Machine Learning Models:**

   - Benchmarked traditional ML models, including:

     - Logistic Regression

     - Support Vector Classifier

     - Decision Tree Classifier

     - Random Forest Classifier

     - Gradient Boosting Classifier

     - AdaBoost Classifier

     - XGBoost Classifier

3. **Deep Learning Architectures:**

   - Evaluated the performance of deep learning models:

     - LSTM (Long Short-Term Memory)

     - Bi-LSTM (Bidirectional LSTM)

     - GRU (Gated Recurrent Unit)

4. **Fine-Tuned BERT-Based Models:**

   - Implemented and fine-tuned transformer-based models, including:

     - BERT

     - RoBERTa

     - ALBERT

     - DistilBERT

   - Employed **QLoRA-based fine-tuning** on the **Phi-2 model** to enhance its performance.

5. **Performance Metrics:**

   - Evaluated models on the following metrics:

     - Accuracy

     - Precision (Macro and Weighted)

     - Recall (Macro and Weighted)

     - F1-Score (Macro and Weighted)

6. **Achievements:**

   - Improved the best-performing model’s accuracy by **25%** through advanced fine-tuning and hyperparameter optimization.

   - Achieved consistent performance with BERT-based models, maintaining **91% accuracy** across multiple datasets.

7. **Best Performing Model:**

   - After rigorous benchmarking, **DistilBERT** emerged as the top-performing model, achieving a balanced performance with **91% accuracy**. 

   - The fine-tuned DistilBERT model has been pushed to the [datafreak/hatespeech-distill-bert](https://huggingface.co/datafreak/hatespeech-distill-bert) repository.

8. **FastAPI App:**

   - The FastAPI application serving the model has been **dockerized** to make it easily deployable and scalable.

## **Benchmarking Results**

| **Model**                     | **Accuracy** | **Precision (Macro)** | **Recall (Macro)** | **F1 (Macro)** | **Precision (Weighted)** | **Recall (Weighted)** | **F1 (Weighted)** |

|-------------------------------|--------------|------------------------|--------------------|----------------|--------------------------|-----------------------|--------------------|

| Logistic Regression           | 0.75         | 0.58                  | 0.67              | 0.59           | 0.86                    | 0.75                 | 0.79              |

| Support Vector Classifier     | 0.75         | 0.54                  | 0.60              | 0.55           | 0.83                    | 0.75                 | 0.78              |

| Decision Tree Classifier      | 0.72         | 0.51                  | 0.53              | 0.51           | 0.79                    | 0.72                 | 0.75              |

| Random Forest Classifier      | 0.83         | 0.59                  | 0.60              | 0.59           | 0.82                    | 0.83                 | 0.82              |

| Gradient Boosting Classifier  | 0.75         | 0.57                  | 0.63              | 0.57           | 0.84                    | 0.75                 | 0.79              |

| AdaBoost Classifier           | 0.71         | 0.54                  | 0.59              | 0.54           | 0.83                    | 0.71                 | 0.76              |

| XGBoost Classifier            | 0.81         | 0.59                  | 0.62              | 0.60           | 0.83                    | 0.81                 | 0.82              |

| LSTM                          | 0.73         | 0.57                  | 0.54              | 0.51           | 0.85                    | 0.73                 | 0.77              |

| Bi-LSTM                       | 0.75         | 0.55                  | 0.63              | 0.57           | 0.85                    | 0.75                 | 0.78              |

| GRU                           | 0.79         | 0.57                  | 0.66              | 0.60           | 0.85                    | 0.79                 | 0.81              |

| BERT                          | 0.91         | 0.76                  | 0.69              | 0.71           | 0.90                    | 0.91                 | 0.90              |

| roBERTa                       | 0.91         | 0.75                  | 0.72              | 0.74           | 0.90                    | 0.91                 | 0.90              |

| ALBERT                        | 0.91         | 0.76                  | 0.66              | 0.67           | 0.90                    | 0.91                 | 0.91              |

| DistilBERT                    | 0.91         | 0.79                  | 0.73              | 0.75           | 0.91                    | 0.91                 | 0.91              |

| Phi-2 (QLoRA Fine-Tuned)      | 0.90         | 0.75                  | 0.68              | 0.70           | 0.89                    | 0.90                 | 0.89              |

---

## **Technical Tools and Frameworks**

- **Data Collection:** BeautifulSoup (BS4), Python for scraping and preprocessing.

- **Modeling:** Scikit-learn, TensorFlow, PyTorch, and HuggingFace Transformers.

- **Fine-Tuning:** QLoRA-based approach for parameter-efficient fine-tuning of large language models like Phi-2.

- **Visualization:** Matplotlib and Seaborn for plotting evaluation metrics.

## **Results**

The project demonstrated the superior performance of BERT-based models, particularly after applying QLoRA-based fine-tuning on Phi-2, which outperformed traditional machine learning and standard deep learning models across all key metrics. Random Forest and XGBoost were the top-performing ML models, while GRU showed the best results among deep learning approaches.

---

## **How to Use the API Locally**

To run the **FastAPI app** locally and test the **DistilBERT** model for hate speech detection, follow the instructions below:

### **1. Clone the Repository**

Clone the repository containing the FastAPI application and model files.

```bash

git clone https://github.com/devroopsaha744/HateSpeecDetect-text.git

cd HateSpeecDetect-text

```

### **2. Install Dependencies**

Ensure you have Python 3.10+ installed and set up a virtual environment (recommended).

```bash

python3 -m venv venv

source venv/bin/activate  # On Windows use `venv\Scripts\activate`

pip install -r requirements.txt

```

### **3. Run the Docker Container (Optional)**

To run the API using Docker, you can use the following commands. Make sure you’ve built the Docker image before.

```bash

docker build -t hatespeech-api .

docker run -p 8000:8000 hatespeech-api

```

This will start the FastAPI app and expose it on port `8000`.

### **4. Access the API**

Once the container is running, you can access the API via the following endpoint in your browser or Postman:

```

http://127.0.0.1:8000

```

## **API Endpoints**

### **1. GET `/`**

- **Description**: Health check endpoint to verify if the API is running.  

- **Response**:

  ```json

  {

    "Hello": "World!"

  }

  ```

### **2. POST `/predict/`**

- **Description**: Predict whether the provided text is hate speech or not.

- **Body**:  

  ```json

  {

    "text": "shut the fuck up, you asshole!"

  }

  ```

- **Response**:

  ```json

  {

    "prediction": "Offensive Language",

  }

  ```

  This will return a prediction of whether the text is hate speech, offensive language or neither.

You can also access the **deployed API** on Hugging Face Spaces:  

```

https://datafreak-hatespeech-detect.hf.space

```

## **Screenshots**  

 

![img-1](examples/img-1.png)   

![img-2](examples/img-2.png)  

![img-3](examples/img-3.png)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devroopsaha744/hatespeechdetect-text

Awesome Lists containing this project

README