https://github.com/msthamizh/toxicity-detection-in-tweets
Developed a machine learning model to classify tweets as toxic or non-toxic using NLP techniques like tokenization, lemmatization, and TF-IDF vectorization. Evaluated multiple classifiers and analyzed performance using classification reports, confusion matrices, and ROC-AUC curves.
https://github.com/msthamizh/toxicity-detection-in-tweets
machine-learning natural-language-processing nltk python
Last synced: about 1 month ago
JSON representation
Developed a machine learning model to classify tweets as toxic or non-toxic using NLP techniques like tokenization, lemmatization, and TF-IDF vectorization. Evaluated multiple classifiers and analyzed performance using classification reports, confusion matrices, and ROC-AUC curves.
- Host: GitHub
- URL: https://github.com/msthamizh/toxicity-detection-in-tweets
- Owner: MSThamizh
- Created: 2024-11-21T15:01:09.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-20T12:42:43.000Z (4 months ago)
- Last Synced: 2025-01-26T04:11:20.939Z (3 months ago)
- Topics: machine-learning, natural-language-processing, nltk, python
- Language: Jupyter Notebook
- Homepage:
- Size: 3.05 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **Toxicity Detection in Tweets**
This project focuses on detecting toxic tweets using advanced Natural Language Processing (NLP) techniques. The goal is to classify tweets as either toxic or non-toxic, based on the content, to help prevent harmful online communication.
## **Problem Statement**
The task is to build a model that classifies tweets into two categories: toxic and non-toxic. Toxicity is defined as language that is harmful, abusive, or hateful. The model is trained using a dataset of labeled tweets and evaluated using classification metrics.## Workflow
1. **Data Collection**: The dataset consists of labeled tweets, where each tweet is classified as either toxic or non-toxic.
2. **Data Preprocessing**: Text data is cleaned and preprocessed by:
- Removing special characters, URLs, and stopwords.
- Tokenizing text and lemmatizing words.
3. **Feature Extraction**:
- Used **TF-IDF vectorization** to convert the text data into numerical features for the machine learning models.
4. **Model Training**:
- Trained multiple machine learning classifiers like **Logistic Regression**, **Random Forest**, and **SVM**.
5. **Model Evaluation**:
- Evaluated models using metrics such as **Accuracy**, **Precision**, **Recall**, **F1-Score**, and **ROC-AUC**.
6. **Prediction**:
- The model predicts whether a given tweet is toxic or non-toxic.## Features
- **Text Preprocessing**:
- Removal of special characters, URLs, and non-alphabetic characters.
- Tokenization and lemmatization for better feature extraction.
- **TF-IDF Vectorization**:
- Convert text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).
- **Model Training & Evaluation**:
- Multiple models are trained, evaluated, and compared based on classification metrics.- **User Interface**:
- User can input a tweet and the model will classify it as toxic or non-toxic.
## Technologies Used- **Python**: Primary language for model development.
- **NLP Libraries**:
- **NLTK** for text preprocessing (tokenization, lemmatization).
- **Scikit-learn** for machine learning algorithms and evaluation.
- **TF-IDF**: For converting text into numerical features.## **Results**
### **Classification Models**
| Model | Test Accuracy | Precision (Toxic) | Recall (Toxic) | F1-Score (Toxic) | Key Insights |
|-------------------------|---------------|--------------------|----------------|------------------|-------------------------------------------|
| Decision Tree | 87% | 80% | 91% | 85% | High interpretability; prone to overfitting. |
| Random Forest | 91% | 88% | 90% | 89% | Robust to overfitting; high accuracy. |
| Multinomial Naive Bayes | 89% | 87% | 87% | 87% | Performs well with text data. |
| K-Nearest Neighbors | 81% | 71% | 91% | 80% | Struggles with imbalanced classes. |### **Visualizations**
- **ROC Curves**:
Each model's ROC curve shows its capability to distinguish between toxic and non-toxic tweets.
- **Confusion Matrices**:
Evaluate the model's ability to avoid false positives and false negatives.## References
- **Python**: [https://docs.python.org/3/](https://docs.python.org/3/)
- **NLTK**: [https://www.nltk.org/](https://www.nltk.org/)
- **Scikit-learn Documentation**: [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)